\usepackage{relsize}

Chapter 1: Bayesian Model

--- Start Class on 11-2-2014 (session 1)

What is a statistical model

A statistical model (experiment) is a list (a set) of probability models indexed by a parameter that is known to belong to a parameter space \(\Omega\).

\(M = {P(y | \theta), \theta \in \Omega}\), \(\Omega\) parameter space.

when we do inference we assume that data \(Y = y\) is coming from a probability model \(P(y| \theta^{*})\) that is known to belong to M. We assume that \(\theta^{*} \in \Omega\).

We will claim that the model M is correct if the probability model that generated the data \(P(y | \theta^{*}) \in M\).

alt text

\(\underline{Example 1}\)

\(M_{1} = {binomial(n, \theta), \theta \in [0,1]}\)

One could also think about using \(\Omega = [.1, .9]\) or \(\Omega = {.7, .8, .9}\) (trichotomy).

Tossing "pins"

\(\theta = P(\bot)\)

\(1 - \theta = P(\vdash)\)

Once you have observed \(n = 10\) and you get \(y = 4\).

\({binomial(10, \theta), \theta \in [0,1]}\) and \(y = 4\)

a) Point estimation
- What is the \(\hat \theta\) that best represent your data
b) Interval estimation
c) Testing
d) Prediction of future values

You treat all models as they were equally model algo mal apuntat. \(\theta = 0\), \(\theta = 0.5\), \(\theta = 1\). Are all \(\theta\) equally credible?

Implicitly in a statistical model is the assumption that all probability models in \(M = {P(y | \theta), \theta \in \Omega}\) are "equally credible". Is that a sensible assumption?

\(\theta = Probability(heads)\)
\(\theta = Probability(\bot)\)
\(\theta = Probability(blue eyes in Bcn 2014)\)

Make sense to do inference in the same way?

alt text

The three problems in Statistics

A) Design of expermients (sampling)

You do not have data.

Choose the statistical model \(M = {P(y | \theta), \theta \in \Omega}\) you are going to obtain your data from out of a list of possible statistical models \(M_{1}, M_{2}, \dots M\).

B) Model Checking (identification)

You have data, \(Y = y\), and you have a candidate model \(M = {P(y | \theta), \theta \in \Omega}\) and you need to decide whether that model is correct or not. That is, you have to decide whether the probability model \(P(y | \theta^{*})\) that generated your data \(Y = y\) is in M or not?

C) Statistical Inference

Given your data \(Y = y\) and assuming that the mode \(M = {P(y | \theta), \theta \in \Omega}\) is correct (because you know that the model \(P(y | \theta^{*})\) generating your data is in M), then you do:

Point estimation
Interval estimation
Testing
Prediction
\(\dots\)

Guessing what \(\theta^{*}\) generated your data.

--- Start Class on 13-2-2014 (session 2)

Critique of frequentist inference

Frequentist statistician is someone doing statistics using only statistical model. Doing that is extremely difficult.

alt text

Goal is to guess (formula) \(\theta^{*} \in \Omega\) that generated \(Y = y\)

point estimation
testing
prediction

a) Point estimation:

Picking up a function of the data \(\hat\theta(y):S_y\longrightarrow\Omega\) such that is close to the \(\hat\theta(y)\) that generates the data with a large probability.

How do you chose \(\hat\theta(y)\)?

\(\hat\theta_{ML}(y)\) Maximum likelihood
\(\hat\theta_{MM}(y)\) Moments
\(\hat\theta_{LS}(y)\) Least squares. Minimizes a distance between \(y, \hat y\)
\(\hat\theta_{R}(y)\) Robust estimation

Good asymptotic properties. All are heuristics.

The only way to rank \(\hat\theta(y)\)'s and choose one based on how do they perform based on repeated sampling from \(M = {p(y| \theta, \theta \in \Omega)}\)

Distribution of \(\hat\theta(y)\) when\(y\sim P(y|\theta^*)\)

How to choose the best? Based on what?

In order to make this comparison more feasible one often resorts to selecting \(\hat\theta(y)\) based on the squared mean error.

\(MSE_{\hat \theta}(\theta) = E((\hat\theta(y) - \theta)^{2}| \theta) =\left(E(\hat\theta(y)|\theta)-\theta\right)^{2} + V (\hat\theta(y)|\theta)=B(\hat\theta(y)|\theta)^2+V(\hat\theta(y)|\theta)\)

Problems?

Difficult to compute.
Is a function of \(\theta\) and you don't know the truth.
This does not rank your estimator either.

b) Interval estimation:

What is the subset of \(\Omega\) that best represents truth data.

Answer: An interval with confidence p

Difficulties: How do yo build such \(C(p) = [d_{L}^p (y), d_{U}^p (y)] \subset \Omega\)

How do you find/get \(d_{L}^p (y)\) and \(d_{U}^p (y)\)?.

Its difficult to do it for other thing different than a normal. The real problem when you have to explain what does it mean that \(d_{L}^p (y), d_{U}^p (y)]\) has confidence p?

It means that

\(f(\theta^*) = P_{y|\theta^{*}}(d_{L}^p (y) < \theta^* < d_{U}^p (y)) > p \forall \theta^* \in \Omega\)

If you repeat the experiment with the same \(\theta^{*}\) many times and compute these kind of intervals each time, they would include \(\theta^{*}\) with a probability larger than p.

Is a function of \(\theta^{*}\), the real value.

Intervals has to be very large in the worst case scenario.

Do we really care only about this?

In practice many people do as if confidence p is the same as

\(f(y) = P_{\theta | y}(d_{L}(y) < \theta^* < d_{U}(y)) = P(\theta^* \in [d_L, d_U])\) in your actual experiment.

\(d_{L}(y)\) fixed, \(\theta^*\) random, \(d_{U}(y)\) fixed, but this is not confidence intervals.

Confidence Intervals is that is very difficult to compute and understand and interpret. Very hard to justify, like assuming normality and need large sampling.

Paper on binomial distribution.

c) Testing:

To reduce your parameter space by splitting it in two pieces and pick up one of them

\(\Omega = \Omega_{0} \cup \Omega_{0}\)

\(H_{0}: \theta \in \Omega_{0}\) \(H_{a}: \theta \in \Omega_{a}\)

How do you choose a test?

Likelihood ratio test
Wald
Permutation

Only neyman Pearson the only one not asymptotic. But just simple hypothesis. Anything else is just heuristics.

Difficult to choose. Difficult to implement. What do you get with a test? A p-value.

You end up with a p-value. It is very difficult to understand what a p-value means.

You cheat and do as if a p-value = \(P(H_{0} is true | data)\)

If people want \(P(H_{0} is true | data)\) why give a p-value.

The problem with frequentist inference is that:

It is purely heuristic
It is grounded on asymptotic results even though N is always finite.
It is extremely difficult to
- justify
- compute
- Interpret

Statistics is to difficult to handle starting only from the statistical model.

Likelihood based Infernce

Between Bayesian Model and Frequentist model.

Likelihood function
Can one use a likelihood function as a probability function
Likelihood principle

Likelihood function

Likelihood function ins a function on \(\Omega\) obtained by plugging in your observed data into \(P(Y|\theta)\).

\(\ell_{Y = y} (\theta) \propto P(Y = y | \theta)\)

The heuristic behind this is that the largest the \(\ell_{Y = y} (\theta_{i})\), the more likely it is that the data observed comes from \(P(y| \theta^{*} = \theta_{i})\).

With \(\ell(\theta)\) ranks the \(P(y | \theta) \in M\) from making data more likely to making data less likely.

The initial idea come out of the blue and is kind of weird.

\(\underline{Example}\)

\(\theta = P(\bot) = 1 - P(\vdash)\)

\(M_1 = {binomial(n = (10, \theta), \theta \in [0, \theta])}\) and \(y = 6\)

\(\ell_{Y = 6}(\theta) = {10 \choose 6}\theta^{6}(1 - \theta)^4\)

The maximum likelihood estimator is \(\hat \theta_{ML} = 6/10\)

alt text

If \[\frac{\ell(\theta_2)}{\ell(\theta_1)} = 2\]

Then \(\theta_2\) is true more likely to have generated your data than \(\theta_1\)

alt text

How do you compare subset A with subset B in terms of likelihood?

Can we use areas? Can we read the likelihood function as if it was a probability density function?

There are 2 problems for treating \(\ell(\theta)\) as a pdf.

\[\int_{\Omega}\ell_{y}(\theta)d\theta \neq 1\]

Is not always equal to 1. It can be fixed by to a standardized likelihood function.

\[\int_{\Omega}^{est}\ell_{y}(\theta)d\theta = \frac{\ell_{Y = y}(\theta)}{\int_{\Omega}\ell_{Y = y}(\theta)d\theta} = \frac{{10 \choose 6}\theta^{6}(1 - \theta)^4}{\int_{0}^{1} {10 \choose 6}\theta^{6}(1 - \theta)^4 d\theta}\]

Sometimes is infinite and you can't do that. If you reparametrize the model Probability function have to integrate to 1.

alt text

--- Class finishes

--- Start Class on 18-2-2014 (session 3)

(dibuix explicatiu frequentist vs likelihood based statistics, Potser el va fer l'altre dia)

What can you say about the likelihood of (a,b)?

Let's use area under \(\ell(\theta)\). Does that allow one to compare intervals? Not for two reasons.

a) Area under \(\ell(\theta)\) might be larger than one and even \(\infty\). That can be partially fixed by standardizing

\(\ell^{st}(\theta)=\frac{\ell(\theta)}{\int_{\Omega}\ell(\theta)d\theta}\)

b) Area under the likelihood function is not invariant under reparametrizations.

\(\underline{Example}\)

You care about \(\beta = \frac{1}{\theta}\).

\(M=\left\{p(y|\theta) = \binom{10}{y}\left(\frac{1}{\beta}\right)^{y}\left(1-\frac{1}{\beta}\right)^{n-y},\beta\in[1,\infty)\right\}\)

This is the same model but reparametrized. The same statistical model! And now the \ell(\(\theta\)) with y=6 'll be proportional to,

\(\ell^{st} = \frac{\left(\frac{1}{\beta}\right)^{6}\left(\frac{\beta-1}{\beta}\right)^{4}}{\int_{1}^{\infty}\left(\frac{1}{\beta}\right)^{6}\left(\frac{\beta-1}{\beta}\right)^{4}d\theta}\)

The problem will be that in the new parametrization the areas will be different. The likelihood function does not behave/work like a probability density function.

The role in the Jacobian in repramarizations is to mantain areas.

Fisher spent the last years of his live to obtain a likelihood function behave like a probability density function. But it didn't work.

Choosing a prior would have finished the problem.

Can one use a likelihood function as a probability function

Talked in the previous section.

Likelihood principle

If you observe from two different statistical models

\(M_{1} = \left\{P_{1}(y | \theta), \theta \in \Omega\right\}\)

\(M_{2} = \left\{P_{2}(y | \theta), \theta \in \Omega\right\}\)

And the likelihood function are proportional

\(P_{1}(Y=y|\theta)=\ell^{1}_{Y=y}(\theta)~\propto~P_{2}(Y=y|\theta)=\ell^{2}_{Y=y}(\theta)\)

then your inferences (decisions) about \(\theta^{*}\) have to be identical.

Paper Bimbaun (1964) JASA (Will be posted in atenea) stated that if the sufficiency principle + conditionality principle are satisfied, then Likelihood principle are reached.

\(\underline{Example}\)

\(M_{1}\) Tossing a pin 10 times and counting how many times you get pin up. \(M_{1}=\left\{Binom(10, \theta), \theta\in\Omega\right\}\) with \(y=6\)

\(M_{2}\) Toss that pin until you observe six times pin up and count the number of tosses \(M_{2}=\left\{NegBinom(6, \theta), \theta\in\Omega\right\}\) \(y' = 10\)

\(M_{3}\) Toss that pin until the fire alarm sounds \(y'' =\) 6 pin up and 4 pin down

\(\ell_{1}(\theta)~\propto~\theta^{6}(1-\theta)^{4}~\propto~\ell_{2}(\theta)~\propto~\ell_{3}(\theta)\)

Your conclusion about \(\theta^{*}\) will be the same if you keep in mind the likelihood principle.

If you algo by the likelihood principle.

Irrelevance of the stooping rule. Only the things that happen matter.

The problem with many frequentist concepts is that they violate the likelihood principle. Frequentist have to judge everything based on what will happen with repeated simulation from \(P(y|\theta)\) and they worry about what could happen if. The likelihood principle tells you that only what happened matter.

Bayesian Model

A Bayesian statistician also starts with a statistical model,

\(M = {P(y | \theta), \theta \in \Omega}\)

but on top of this considers \(\theta\) to be a random variable, and he/she is ready to choose a probability distribution on \(\Omega\) for \(\theta\) (the so called prior distribution), \(\pi(\theta)\). This prior distribution \(\pi(\theta)\) reflects what believe about \(\theta\).

alt text

We have a list of probability distributions on a sample space \(P(y| \theta)\) that are ordered form more believable to less believable based on the prior \(\pi(\theta)\).

\(\underline{Example}\)

\(M = \left\{Bin(10, \theta), \theta \in [0,1]\right\}\)

\(\pi(\theta)\) is supposed to capture what you know about \(\theta\) without data.

A possible prior could be a uniform. This is not informative, it tells you that being near 0 has the same probability than being near 1.

Beta distributions are convenient. \(\pi(\theta) = Beta(a, b)\). In the past it was useful because you could calculated by hand.

pdf: \(f(x;a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\)

In the particular case with \(Beta(1,1)\), we get a \(U(0,1)\). If \(a>1\) and \(b>1\) then the pdf is unimodal.

\(E[\theta] = \frac{a}{a+b}\) \(V[\theta] = \frac{ab}{(a+b+1)(a+b)^2}\) \(Mode = \frac{a-1}{a + b -2}\)

The larger a+b the smaller \(V(\theta)\) and so the more informative \(\pi(\theta) (\downarrow V[\theta])\)

alt text

Toss coin, \(\pi(\theta) = \beta(50, 50)\), \(E[\theta] = \frac{1}{2}\) and \(V[\theta]=0.002\). A more informative prior, because we know with more reliability something about \(\theta\).
Push pin, \(\pi(\theta) = \beta(2, 2)\), \(E[\theta] = \frac{1}{2}\) and \(V[\theta]=0.5\)
Blue eyes, \(\pi(\theta) = \beta(1, 4)\), \(E[\theta] =\frac{1}{5}\) and \(V[\theta]\). As \(a < b\) it will be more density on the left of the pdf.

alt text

Assuming that \(\pi(\theta)\sim\textrm{Beta}(a,b)\) is like assuming that you have tossed your pin a+b times in your head and you have observed a times pin up and b times pin down.

One problem is that if you don't agree with a prior in your research. Choosing a prior is answering about \(\theta\) and who knows about \(\theta\) is the client. The statistician has to help the client to choose the distribution.

Posterior distribution

We start with \(M = \left\{P(y | \theta), \theta \in \Omega\right\}\) and \(\pi(\theta)\).

alt text

We start with \(P(y | \theta)\). Once you have the data \(P(y | \theta)\) in the wrong kind of statement, because y is already known an \(\theta\) unknown. What you would really like to have is a probability statement \(P(\theta|y)\), knew in the pas as inverse probability. Likelihood function can not be read in this terms, to get \(P(\theta|y)\) (posterior distribution), we'd need a prior.

--- Start class on 20-2-2014 (Session 4)

Once you have observed your data, you feel the need to go from \({p(y|\theta), \theta\in\Omega}\) to a statement \(\theta | y\) with y known and \(\theta\) unknown.

Bayes Theorem

In Bayesian mode we have a statistical model \(M={p(y|\theta),\theta\in\Omega}\) and a prior distribution \(\pi(\theta)\). That's equivalent to \(\pi(\theta)p(y|\theta)=f_{\Omega}(\theta,y)\) (joint distribution).

The Bayes theorem states that to compute the posterior distribution:

\(\pi(\theta|y)=\frac{f(y,\theta)}{P_{\pi}(y)}=\frac{\pi(\theta)P(y|\theta)}{P_{\pi}(y)}\)

Where \(P_{\pi}(y)\) (prior predictive distribution) is a constant needed to that the posterior \(\pi(\theta|y)\) integrates to 1.

\(P_{\pi}(y)=\int_{\Omega}P(y|\theta)\pi(\theta)d\theta\)

\(formula \rightarrow\) The constant needed so that \(\pi(\theta | y)\) integrate to 1. You have y because is your data.

This is the only thing that you have to compute in order to have your posterior. If you know how to integrate you are done. If you don't pick specific priors the integral is going to be difficult. In case you cannot integrate you will use simulation and estimate your posterior.

Sometimes you don't need to integrate. If you compute \(\pi(\theta)P(y|\theta)\) and you recognize the distribution, then \(\frac{1}{P_{\pi}(\y)}\) will be the constant that gets \(\pi(\theta | y)\) to integrate to one.

\(\pi(\theta|y)=\pi(\theta)\frac{1}{P_{\pi}(y)}P(y|\theta)~\propto~\pi(\theta)\ell_{y}(\theta)\) where the second term is a version of the likelihood function and \(P(y|\theta)\) your statistical model.

By incorporating the prior in the analysis, \(\ell_{y}(\theta)\) becomes a proper probability model. This is all the math that there is in Bayesian Analysis.

A Bayesian will do:

Chose a prior \(\pi(\theta)\)
Compute (or simulate) from the posterior \(\pi(\theta | y)\) and show it to the client.

Since data gets into the posterior only through the likelihood, \(\ell_{y}(\theta)\) Bayesian inference will always satisfy the likelihood principle.

A flat prior (a uniform) is a prior that is constant on \(\Omega\). If you use them, then \(\pi(\theta|y)~\propto~\ell_{u}\) like \(\ell_{y}^{st}\)

alt text

In Frequentist statistics you fit a model, namely, from all your possible infinite models in M you just pick one.
In likelihood based statistics after data you get M and the likelihood function, that ranks your statistical models \(P(y|\theta)\) from more to less likely.
In Bayesian statistics first you order yours \(P(y|\theta)\in M\) from more to less credible through the prior. After getting your data you don't fit a model you just update your model and rank the elements of M in terms of credibility in the light of the data. We have the same object before and after data. This is very convenient if you plan to work sequentially (always do that).

\(\underline{Example (coin and pin)}\)

\(\left\{P(y|\theta)=\binom{10}{y}\theta^{y}(1-\theta)^{10-y}, \theta\in[0,1]\right\}\)

\(\pi(\theta) = Beta(a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\)

With \(Y = 6\), then:

\(\pi(\theta | y) = \frac{\pi(\theta)P(y|\theta)}{\int_{\Omega}\pi(\theta)P(y|\theta)d\theta}=\frac{\binom{10}{y}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a+y-1}(1-\theta)^{10+b-y-1}}{\mathlarger{\int}_{0}^{1}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a+y-1}(1-\theta)^{10+b-y-1}d\theta}\)

\(\pi(\theta | y) = Beta(a + y, b + n - y) = Beta(a + 6, b + 10 - 6)\)

alt text

The likelihood

alt text

Frequentist chooses just one model from \(M = {P(y | \theta), \theta \in \Omega}\), Bayesian chooses \(M = {P(y | \theta), \theta \in \Omega}\) and \(\pi(\theta)\). Teacher opinion: Choosing M is more dangerous than choosing \(\pi(\theta)\) as long as the \(\pi(\theta)\) does not truncate parts of \(\Omega\) out.

Imagine that for binomial I pick \(\pi(\theta)\) that's a Uniform with \(\Omega=[0.6,1]\), then the support of the posterior \(\pi(\theta|y)\) will be \([0.6,1]\), and if \(\theta^{*}=0.4\) we will never get there, and data will never lead you out of \(\Omega = [.6, 1]\). Otherwise ( by not truncating \(\Omega\) through your prior) then a mistake in choosing a \(\pi(\theta)\) will be self correcting because with enough data, \(\pi(\theta | y)\) always ends up close to the truth \(\Omega^*\).

\mbox{Equivocar-se a l'hora d'escollir un model significa agafar-ne un que no conté \(\theta^{*}\); en canvi equivocar-se amb la tria de la distro. a priori significa concentrar-te lluny de \(\theta^{*}\), cosa que es pot solucionar amb dades}

In Statistical Inference you deal with two kind of questions

a) Answer questions about \(\theta^{*}\) starting from a sample \(Y = y\) from \(M = {P(y | \theta), \theta \in \Omega}\).

\(\pi(\theta) \rightarrow \pi(\theta| y)\)

b) Answer questions about future values of \(y\), \(\tilde{y}\), assuming that \(\theta^*\) will be the same that generated \(Y = y\).

\(P_{pi}(\tilde{y}) \rightarrow P_{\pi}(\tilde{y}|y)\), namely, going from the prior predictive to the posterior predictive.

--- Start Class on 25-2-2014 (Session 5)

Prior predictive and posterior predictive distributions

What do you do if you care about future values of \(y\), \(\tilde y\)? (assuming that \(\theta^*\) will stay the same)?.

What do we know about \(\tilde y\) before answering the data?

\(\left\{\begin{matrix} M = {p(\tilde y | \theta), \theta \in \Omega} \\ \pi(\theta) \end{matrix}\right.\)

If you assume this, what is your best bet (guess) as a distribution of \(\tilde y\)?

\(P_{\pi}(\tilde y) = \int_{\Omega}{p(\tilde y | \theta)\pi(\theta) d\theta} = E_{\pi}(p(\tilde y | \theta))\) is the prior predictive distribution.

\(p(\tilde y | \theta)\) weighted average of \(p(y | \theta)\)

This is what you will use if you want to predict \(\tilde y\) without data.

NOTE: \(P_{\pi}(\tilde y) \notin M = {p(y | \theta), \theta \in \Omega}\) (it don't have too).

What will we do if we want to make a probability statement about future values of \(y\), \(\tilde y\), after observing data \(Y = y\).

Our best guess for a distribution of \(\tilde y\) will be:

\(P(\tilde y | y) = \int_{\Omega}{P(\tilde y | \theta)\pi(\theta | y) d\theta}\)

Posterior predictive distribution.

Having \(p_{\pi}(\tilde y)\) and a \(p_{\pi}(\tilde y| y)\) is having a LOT more than just having point estimator, \(\tilde y\), interval estimator, \(\tilde y \pm S_{\tilde y}\).

alt text

\(\underline{Example (binomial)}\)

\(\left\{\begin{matrix} M = {binomial(n , \theta), \theta \in [0, \theta]} \\ \pi(\theta) = Beta(a, b) \end{matrix}\right.\)

\(p_{\pi}(\tilde y) = \int_{\Omega}{p(\tilde y | \theta)\pi(\theta) d\theta} = \int_{0}^1{{n \choose \tilde y} \theta^{\tilde y} (1- \theta)^{n - \tilde y} \frac{\Gamma (a+b))}{\Gamma (a) \Gamma(b)} \theta^{a - 1} (1- \theta)^{b -1} d\theta} = {n \choose \tilde y} \frac{\Gamma (a+b)}{\Gamma (a) \Gamma(b)} \frac{\Gamma (a+\tilde y)}{\Gamma (a + b + n) }\Gamma (b + n - \tilde y)\)

Named beta-binomial \(\tilde y = 0, 1, 2, \dots , n\)

(dibuix) -> (dibuix) Having this is having a lot more than what you are used to have.

\(p_{\pi}(\tilde y| y) = Beta-binomial(a + y, b + n -y, {n}') = \int_{0}^1{{n \choose \tilde y} \theta^{\tilde y} (1- \theta)^{n - \tilde y} \pi(\theta | y) d\theta}\)

\({n}'\) sample size for \(\tilde y\)

Choice of the prior distribution

It is not a big deal.

Choosing \(\pi(\theta)\) is the same as hosing \(p_{\pi}(\tilde y) = \int_{\Omega}{p(y | \theta)\pi(\theta) d\theta}\)

\(\tilde y\) This is the same thing that you will eventually observe.

If you have information about \(\theta\) or about \(\tilde y\), then you want to make sure that all of it is included in \(\pi(\theta)\), \(p_{\pi}(\tilde y)\) through an informative prior.

What will we do if we know little or nothing about \(\theta\) and \(\tilde y\)?

Ideally you would like you to have a catalogue of priors to be used when "nothing is known". These priors are called "reference priors".

You will have diferent ways of defining reference priors, and so for a given \(M = {P(y | \theta), \theta \in \Omega}\) one gets more than one candidate.

In practice there is nothing wrong in trying different priors and checking how the posterior changes.

Informative priors

a) Conjugate priors

A family of priors for a statistical model, M, is conjugate if the posterior is in the same family.

Only (all) statistical models that are in the exponential family have conjugate priors.

Conjugate priors were the only ones that could be used in practice 30 years ago. They allow you to update the \(\pi(\theta)\) into the \(\pi(\theta | y)\) by just updating the parameters of the \(\pi(\theta)\).

(Catalog of conjugate priors in atenea?)

It is easy to check that finite mixtures of conjugate priors are conjugate

\(\pi(\theta) =\sum_{i = 1}^k w_i Beta(a_i, b_i)\) with \(\sum_{i = 1}^k w_i\) is a conjugate prior for a \(binomial(n, \theta)\) model.

You could get very close to almost any distribution on [0,1] with a Beta distribution.

alt text

b) Non-conjugate priors

The problem with choosing a prior is more challenging the larger is \(\Omega\) \((\theta \in \mathbb{R}^p)\).

\(\underline{Example (\text{Normal linear model})}\)

\(M_x = {Normal(\beta_0 + \beta_{1}x, \sigma^2)}, (\beta_0, \beta_1, \sigma) \in \mathbb{R}^2 \times (0, \infty)\)

Any distribution on \(\mathbb{R}^2 \times (0, \infty)\) is allowed.

alt text

You might want to chose a prior by stating what you know about \(\beta\), and about \(E(y = 0 |x) = \frac{- \beta_0}{\beta_1}\). So one possible informative prior is, according to the client,

\(\beta_1 \sim N(5,1)\)

\(\frac{- \beta_0}{\beta_1} \sim N(-1, 3)\)

Sometimes you want to chose \(\pi(\theta)\) by making probability statements about future values \(\tilde y(x)\)

alt text

It requires a lot of care but it is doable. Then everything is straightforward.

It is not easy to translate what people know about \(\tilde y\) or about \(\theta\) in terms of probability. This is the most (only) challenging step in Bayesian Inference.

(dibuix)

Start asking about the maximum, mean, etc. Depending on the problem.

Reference priors

The selection of prior distributions by formal rules (Kass-Waserman-1995)

Flat priors
Jeffrey's priors
Limits of conjugate priors with \(formula\)
Others

Different answers to what prior do I want to use if I do not want to chose prior.

We look for priors that do not mess the information in \(\ell_y (\theta)\).

\(\pi(\theta)\) that are as if they were not there.

If the probability space is finite it is fine. If the probability space is infinite it doesn't integrate to one, they are called improper ones. They are allowed as long as the posterior integrates to one.

--- Start class 27-2-2014 (Session 6)

If \(N \rightarrow \infty\), \(\pi(\theta|Y) \rightarrow l_y(\theta)\) and therefore \(\pi(\theta)\) becomes irrelevant.

Reference priors are an attempt to get the same kind of deal for small N. (objective, non informative, default priors).

a) Flat priors: All elements of \(\theta \in \Omega\) have the same prior probability.

When \(\Omega = \{\theta_1, \theta_2, \dots, \theta_k\}\), \(\pi(\theta) = \frac{1}{k} ~ \forall \theta_i \in \Omega\).

This prior is hard to beat as a reference prior. (for finite \(\Omega\)).

I'm not sure about that. Imagine that your goal is to answer the question

\(H_{0}: \theta = \theta_{1}\)

\(H_{a}: \theta \neq \theta_1\)

Are you sure that you would want to use \(\pi(\theta) = \frac{1}{k}\) by default?

One alternative that makes sense here would be:

\(\pi(\theta_1)=\frac{1}{2}\)

\(\pi(\theta_j)=\frac{\frac{1}{2}}{k-1}\) for \(j=2,3,..,n\)

If the goal of the analysis is to estimate \(\theta\) then the uniform \(\pi(\theta_i) = \frac{1}{k}\) is indeed hard to beat.

In general, when \(\Omega\) is uncountable and not compact or countable, the problem with using a flat prior is that the integral is infinity. They are called improper priors.

In practice, improper priors "can be used" as long as the posterior is proper. This has to be checked! Otherwise you could do silly things.

An option could be to truncate \(\Omega\) far away from your likelihood and then it would be proper.

The problem here is that a flat prior is no invariant under re-parametrization.

\(\underline{Example}\)

\(M = \{Binomial(n,\theta),\theta \in \Omega\}\)

\(\pi(\theta) = U(0,1)\), being 1 if \(\theta \in [0,1]\) and 0 otherwise.

Imagine that you want to make inference about \(\eta = \frac{1}{\theta}\)

\(\theta \sim U(0,1)\) then \(\pi(\eta) = \frac{1}{\eta^2} \neq flat\)

\(\theta \sim U(0,1)\) then \(\pi(log(\theta)) = e^{\phi} \neq flat\)

This is a reason for not calling them flat priors non-informative priors, although it is not a reason for not using them by default either.

b) Jeffreys priors

\(\pi_{j}(\theta) = \sqrt{|I(\theta)|}\) (Fisher information matrix)

\(\pi_{j}(\theta) = \sqrt{-E_{y(\theta)}[\frac{\partial^2}{\partial \theta^2}log~p(y|\theta)]}\)

Invariant against re-parametrization.

Very often Jeffrey's priors are improper.

c) Conjugate with \(V(\theta) = \infty\)

What do we get for \(\{Binomial(n,\theta),\theta \in \Omega\}\)?

a) \(\pi_F(\theta) = U(0,1) = Beta(1,1)\)

b) \(\pi_{j}(\theta) = \sqrt{|\frac{1}{\theta (1-\theta)}|} = Beta(0.5, 0.5)\)

c) Conjugate prior is \(Beta(a, b)\)

\(V(\theta) = \frac{ab}{(a + b + 1)(a + b)^2}\)

If (a + b) decreases to 0, the Variance increases.

\(\theta \sim Beta(0, 0) \Rightarrow V(\theta) = \infty\)

alt text

a + b is equivalent to prior sample size.

b) and c) could be improper.

It is useful to try different priors and check how does the posterior change (sensitivity analysis). Robust Bayes. If the three times you check you get similar results you are done, if not you have too think better which one to use.

Empirical Bayes is not Bayesian.

To chose the prior that makes the probability distribution similar to the histogram to your data.

Choosing the prior is the same than hosing a prior predictive.

alt text

This is cheating. It requires to chose the prior a posteriori, after data. It violates the likelihood principle.

This is not Bayesian and is tempting because what you get from this tends to have very good frequentist properties.

Bayesian model as a probability model and as a data simulator

Care about \(\theta\)

alt text

Care about \(\tilde y\)

alt text

Care about both \((\tilde y,\theta)\) \(\tilde y\) Observable and \(\theta\) not observable.

alt text

Bayesian model is a probability model and hence Bayesian mathematics are probability mathematics.

Building a Bayesian model is very much like building a data simulation model that you will train with real data. (we will talk more in chapter 4 model checking)

With a statistical model you can not simulate.

alt text

Advantages and disadvantages of going Bayesian

Non-Bayesian frequentist

He can not talk about the probability of \(\theta\). He assesses everything indirectly by checking what happens under repeated sampling from

\(P(Y | \theta), \theta \in \Omega\)

How do they deal with non-repetitive phenomenon.

Is tomorrow going to rain? How would you deal with this statistically unless you are Bayesian?

Frequentist statistics only have \(M = {P(y | \theta), \theta \in \Omega}\) and proceeding from that only is difficult to:

justify
implement
interpret

Once you have observed your data y becomes known and \(\theta\) is still unknown. Why do you want to keep thinking in terms of \(P(Y| \theta)\), and hence in terms of what could have happened if \(\theta\) with \(\theta\).

\(P(Y| \theta)\) Why don't you switch? A frequentist answer is because you can't. You can not do this unless you have a prior.

--- Start Class on 4-3-2014 (session 7)

You have to be willing to assign probabilities to everything and not just repetitive events \((\theta)\).

If you are willing then everything is a lot easier to:

justify
implement
interpret

Everything falls down to computing probabilities.

One only needs to do two steps:

a) Write the joint distribution of what will be observed and what will not be observed and what will not be observed. I have to choose a \(\pi(\theta)\)

\(p_{\pi}(\tilde y, \theta) = \pi(\theta)p(\tilde y| \theta)\)

You get data \(Y = y\).

b) Condition with respect to what we know \((Y = y)\) and we integrate over everything that we do not know and we do not care about.

\(f_{pi}(\tilde y, \theta | Y = y) = \pi(\theta | y) * p(\tilde y | \theta)\)

If you care about \(\theta\) \(\pi(\theta) = \int\)

If you care about \(\tilde y\) \(P_{\pi}(\tilde y | y) = \int P(\tilde y | \theta) \pi(\theta|y)d\theta\)

a) Choosing a \(\pi(\theta)\) b) Compute (simulate) \(formula\) and decide how to present the results.

Bayesian and frequentist provide different answers to different questions.

Frequentist ask what happens when one simulates repeatedly from \(p(y | theta^*), \theta^* \in \Omega\) Bayesian ask what can be done with \(\pi(\theta | y)\) \(p_{\pi}(\tilde y | y)\) Be optimal given what was observed.