Chapter 2: Bayesian Inference

--- Start Class on 4-3-2014 (session 7)

1.Posterior distribution as an estimator.

Once you have computed (simulated) from \(\pi(\theta| y)\), \(p(\tilde y | y)\).

You are basically done.

The posterior distribution is a sufficient statistic. If the model is correct, you can throw away your data and stick only with \(\pi(\theta | y)\).

One can claim that the Bayesian estimate of \(\theta\) is \(\pi(\theta | y)\). One can claim that the Bayesian estimate of \(\tilde y\) is \(p_{\pi}(\tilde y | y)\).

Why does any Bayesian need anything else (point estimation, interval estimator, test, predictive intervals)?

Compare with what frequentist do (cynic answer)
Bayesian need to summarize what is in the posterior in order to communicate results.

Imagine that your parameter space \(\theta = (\theta_1, \dots \theta_p) \in R^p\) and you only care about \(\theta_1\). What do you do if you only have:

\(\ell_{Y = y}(\theta_1, \dots \theta_p)\)

\(\tilde \ell(\theta_1) = max \ell_{Y = y}(\theta_1, \dots \theta_p)\)

Why is this a good answer?

Bayesian answer will be the marginal:

\(\pi(\theta_1 | y) = \int \dots \int \pi(\theta_1 \dots \theta_p |y) d\theta_1 \dots d\theta_p\)

Imagine that you have worked things out for \(\theta\) and now you ask questions about \(g(\theta)\).

alt text

Frequentist have difficulties translating inferences about \(\theta\) into inferences about \(g(\theta)\).

Bayesian will do \(\pi(\theta | y) \rightarrow \pi(g(\theta) | y)\) just a change of variables.

\(\theta_1 \dots \theta_p \rightarrow g(\theta_1) \dots g(\theta_p)\) sample from \(\pi(g(\theta) | y)\).

2.Point estimation.

alt text

How do we summarize the \(formula\) with a number if we must?

Expectation \(formula\)
Mode \(formula\)
Median \(formula\)
etc

Observation:

a) If \(\pi(\theta)\) is flat then \(\hat \theta_{pmode} = \hat \theta_{MLikelihood}\)
b) to compute the \(\hat \theta_{pmode}\) you don't need to know the posterior distribution. \(formula\) maximizing this is equivalent to maximizing \(\pi(\theta | y)\).
c) Under regularity conditions often satisfied \(\hat \theta_{pmode}\) has the same asymptotic properties as \(\hat \theta_{ML}\)

You can only use the Median if you deal with real valued \(\theta\). If \(\theta \in \Omega R^2\) you can't sort \(\Omega\).

What happens if your posterior is ??

alt text

What do you give one point estimation? Do you want to give a point estimate? Probably not.

If you are in \(R^10\) you probably want to do point estimate.

An estimator is neither Bayesian nor Frequentist. An estimator is a function of the data \(formula\) that hopefully will be close to the truth \(\theta^*\) most of the time.

What will be Bayesian or Frequentist is how you judge (assess) the estimator.

Frequentsit assessment: \(E_{y | \theta^*} ( | \theta) formula\)
Bayesian assessment: \(_{\theta | y} formula\)

3.Interval (region) estimation.

We define a region with posterior credibility p to be a subset of \(\Omega\), \(formula\), such that \(\int_{C_{p}(y)}formula\)

http://en.wikipedia.org/wiki/Credible_interval

alt text

The same concept applies to \(formula\).

They are useful as summaries of the uncertainty in \(formula\), \(formula\), \(formula\), \(formula\).

There are 2 families of credibility regions

a) HPD regions (Highest Posterior Density) They are the smallest regions that have credibility p.

alt text

You restrict yourself to picking the values of \(\theta\) with the highest density.

They are the smallest. (strength)
They generalize well when \(\theta \in R^p, p > 1\). (strength)
They are not easy to compute. Even if you have \(\pi(\theta|y)\) in closed form it is not easy. (weakness)
They are not parametric invariant. (weakness)

alt text

They might no be connected.(fact)

b) Central p-credible intervals

alt text

--- Start Class on 6-3-2014 (session 8)

You can not use the if you are not in the real line (weakness)
You are not getting the smallest p-credible region possible (weakness)
Easy to compute (strength)
It is easy to estimate from a sample simulated from \(\pi(\theta), formula\) (strength)

\(formula\) Sample simulated from \(\pi(\theta | y)\) ordered from small to big

\(\hat q^{\frac{1-p}{2}} = formula\)

Invariant when you re-parametrize. (strength)
They have to be intervals (fact)

An interval \(| a(y), b(y)|\) is neither Bayesian nor frequentist. What is Bayesian or Frequentist is the way in which you asses (judge) it.

A Bayesian will judge the interval through the posterior \(\pi(\theta | y)\)

\(P_{\theta | y}(\theta \in | a(y), b(y)| | y) = p\) credible \(\theta\) unknown \(y\) fixed

A Frequentist will judge it through repeated sampling from \(M = {P(y | \theta^*), \theta^* \in \Omega}\)

\(\inf_{\theta^* \in \Omega} = P_{y|\theta^*}(\theta^* \in | a(y), b(y)| | \theta^*) = p(\theta^*)\) confidence of the interval.

\(\theta\) fixed, \(y\) random.

It is extreamly rare that \(p\) and \(1 - \alpha\) for a given \(| a(y), b(y)|\).

There is a temptation of selling an 95% confidence interval as if it was 95% credible interval. This is cheating.

Confidence it is not a probability.

4.Two-hypothesis test.

\(\Omega = \Omega_1 \cup \Omega_2\)

\(M = {P(y | \theta), \theta \in \Omega} = formula\)

\(\left\{\begin{matrix}H_{1}: \theta \in \Omega_{1}\\H_{2}: \theta \in \Omega_{2}\end{matrix}\right.\)

\(M_1: \tilde y \sim M_1\) \(M_2: \tilde y \sim M_2\)

alt text

\(formula\)

You will choose the \(H_1\) that has the largest posterior probability.

\(formula\) posterior odds.

Note that we are treating the null and the alternative symetrically. Compute both probability and choose the one that has more probability.

There is no difference between \(H_0\) and \(H_a\).

\(\underline{Example 1}\)

Simple against simple.

\(M = {P(y | \theta), \theta \in {\theta_1, \theta_2}} = {p(y| \theta_1), p(y| \theta_2)}\) is a Dichothomy.

\(complex formula\)

\(P(H_1 | y) = \frac{formula}{formula}\) \(P(H_2 | y) = \frac{formula}{formula}\)

Posterior odds \(\frac{P(H_1 | y)}{P(H_2 | y)} = \frac{P(H_1 | y)}{1 - P(H_1 | y)} = \frac{P(H_1)}{P(H_2)}\frac{P(y | \theta_1)}{P(y | \theta_2)} = \frac{P(H_1)}{P(H_2)}\frac{\ell_y(\theta_1)}{\ell_y(\theta_2)}\)

Posterior odds = prior odds x likelihood ratio (Bayes factor)

Neyman-Pearson states that it is optimal to have a rejection region based on

\(\frac{\ell_y(\theta_1)}{\ell_y(\theta_2)} = \frac{P(y | \theta_1)}{P(y | \theta_2)}\)

C os a constant that depends on the size of your test. If it is > C then \(H_1\) If it is < C then \(H_2\)

\(p-value = formula\)

Only works for simple against simple.

Often we do as if the \(p-value = P(H_1|y) = P(H_0|y)\).

Instances when a p-value is approximately equal to \(P(H_1|y)\) are rare.

\(\underline{Example 2}\)

Chance between two submodels

\(M = M_1 \cup M_2 =formula \cup formula\)

\({Poisson(\lambda), \lambda \in (0, \infty)} \cup {NegativeBinomial(r, \theta), \theta \in (0,1)}\)

\(\left\{\begin{matrix}H_{1}: \theta \in \Omega_{1}\\H_{2}: \theta \in \Omega_{2}\end{matrix}\right.\)

\(M_1: \tilde y \sim P_1(y | \theta)\) \(P(H_1), \pi(\theta | H_1)\)

\(M_2: \tilde y \sim P_2(y | \theta)\) \(P(H_1), \pi(\theta | H_2)\)

\(P(H_1 |y)\)

Complex problems will be dealed as the simple case.

5.More than two-hypothesis test and model comparison.

6.Prediction.

7.Model averaging.

8.Simulation based inference.

9.Frequentist asymptotic behavior of the posterior distribution.

10.Bayesian asymptotic behavior of the posterior distribution.

11.Decision theory and frequentist (Bayesian) assessment of the Bayesian (frequentist) inference.

Chapter 2: Bayesian Inference

1.Posterior distribution as an estimator.

2.Point estimation.

3.Interval (region) estimation.

4.Two-hypothesis test.

5.More than two-hypothesis test and model comparison.

6.Prediction.

7.Model averaging.

8.Simulation based inference.

9.Frequentist asymptotic behavior of the posterior distribution.

10.Bayesian asymptotic behavior of the posterior distribution.

11.Decision theory and frequentist (Bayesian) assessment of the Bayesian (frequentist) inference.

12.Summary.