0% found this document useful (0 votes)
4 views39 pages

19-Bayesian 2

The document discusses Bayesian inference and its comparison with the maximum likelihood approach in statistical modeling. It explains the concept of treating parameters as random variables and using prior distributions to derive posterior distributions through Bayes' theorem. Additionally, it provides examples, including a binomial model, to illustrate how to choose priors and compute posterior distributions.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views39 pages

19-Bayesian 2

The document discusses Bayesian inference and its comparison with the maximum likelihood approach in statistical modeling. It explains the concept of treating parameters as random variables and using prior distributions to derive posterior distributions through Bayes' theorem. Additionally, it provides examples, including a binomial model, to illustrate how to choose priors and compute posterior distributions.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

STAT 5703

Statistical Inference and Modeling


for Data Science
Dobrin Marchev

1
Last week Markov chains >
: -
computational tool for Bayesian model .

Bayesian Inference
• In this lecture we will define the basic terminology used in Bayesian
inference.

• We will also consider some examples to illustrate the new concepts.



• Finally, it will be explained why are Monte
-
Carlo methods typically
needed in the Bayesian inference.
& end .
other : frequencyanalysis Both MLES still need
Bayesian

Ga
.

Maximum Likelihood Approach


Recall the maximum likelihood technique, which is based on a
fixed unknown parameter 𝜃 ∈ Θ, and an iid sample 𝑿 =
(𝑋1 , … , 𝑋𝑛 ) from a population with density 𝑓(𝑥 | 𝜃).

The likelihood function, 𝐿(𝜃 | 𝒙) is defined as


Not conditionalp ,
but we fix data ,
Why 𝑛 ↓ kelihood .

F
𝐿(𝜃 | 𝒙) = 𝑓(𝒙 𝜃 = ෑ 𝑓 𝑥𝑖 𝜃 ,
max wit variables .
𝑖=1

and the goal is to find a suitable estimate of 𝜃መ of 𝜃. optimisation -

Find max function .


Most often the difficulties with this approach are optimization
type problems, like finding MLE = argmax 𝐿(θ | 𝒙) , which in
𝜃∈ Θ
turn are often reduced to problems solving equations.

Remark: Very often the parameter 𝜃 is a vector 𝛉 = (𝜃1 , … , 𝜃𝑘 ).


Parameters not fixed data .
More likely to
get higher
↑ values than other.
𝑓(𝒙 | 𝜃) 𝜋(𝜃)

Bayesian
Approach

a
Middle
• The main idea is that 𝜃 is
now a random variable.

• It is useful to think that 𝜃 likely to be 0 .


5
being random represents
now this obtained ?
our uncertainty about its - is

value. Apply Bayes


knowledge of parameter Theorem

a curve .
𝑓(𝜃| 𝒙)
is modelled into
Bayesian Approach
• The main idea is that 𝜃 is now a random variable.
• Therefore, 𝜃 has a distribution with a pdf.
• The information brought by the sample x is combined with a prior
information specified by a prior distribution 𝜋(𝜃), to form the
posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:
Joint likelihood
conditional distribution

𝑓 𝒙 ,𝜃 DA
𝑓(𝒙 | 𝜃) 𝜋(𝜃) chosen
prior .

-> depends on type


𝑓(𝜃| 𝒙) = = , by of data.
𝑚 𝒙 𝑚(𝒙) tser
where Marginal

can be a 15-dimensional
𝑚(𝒙) = න 𝑓(𝒙| 𝜃) 𝜋(𝜃) 𝑑𝜃
integral .

is the marginal distribution of the data X.


• Before seeing the data, our uncertainty is represented by 𝜋(𝜃)
and after that by 𝑓(𝜃| 𝒙).
Where does the formula come from?
Extend conditional probability to discrete random
variables:
𝑃(𝑌 = 𝑦, 𝑋 = 𝑥)
𝑃 𝑌=𝑦𝑋=𝑥 =
𝑃(𝑋 = 𝑥)
And then to conditional densities:
parameter data
𝑓(𝑥, 𝑦) ,

𝑓 𝑦𝑥 = Both are rectors


.

𝑓(𝑥)
CO vector)
Then Bayes’ Theorem becomes:
𝑓 𝑥 𝑦 𝑓(𝑦)
𝑓 𝑦𝑥 =
‫𝑦𝑑 𝑦 𝑓 𝑦 𝑥 𝑓 ׬‬

When we apply it to y = 𝜃 we obtain the posterior


𝑓(𝒙 | 𝜃) 𝜋(𝜃)
𝑓(𝜃| 𝒙) =
𝑚(𝒙)
Bayesian is helpful in random
Remarks intercept model.
• Both procedures have advantages and disadvantages.
• Bayesian approach has been criticized for over-reliance on
convenient priors or heavy computations.
• The frequentist approach has been criticized for inflexibility
(failure to incorporate prior information) and incoherence
(failure to process information systematically).
• Bayesian approach has intuitive interpretation: The outputs
of Bayesian analyses are usually probabilities that measure
our (un)certainty. In the context of hypothesis testing,
Bayesian analyses directly measure the probability that the
null hypothesis is true, which provides usually provides a
more straightforward interpretation. The Bayesian process
is very similar to how we process information in our minds.

• For large n, as well as when the prior is uniform, the


Bayesian method will provide similar results to the classical
likelihood approach method.
-
How Do You Choose a Prior?
• Hard to answer: it depends.

• If you know something about your parameter, specify it.


Could be as simple as 𝜃 > 0.

• There might have already been research done on similar


problems, and some “expert” knowledge might be
available. This is difficult for the experts in the subject
who don’t have detailed knowledge about statistics and
it’s difficult for the statisticians because they don’t have
the subject knowledge. expert might not
(othroughnation)
to set
know how up pior


• You can also use priors based on mean and variance of
your knowledge. For example, you can say “𝜃 is around 1”.
som

O
• It is possible to adopt “flat” or “non-informative” priors
when no info is available.
uniform
,
when parametr
ightw [0 1].,
Example: Binomial Model
Assume we have conditional iid binary data
prior on
parameter-
𝑋1 , 𝑋2 , … , 𝑋𝑛 | 𝜃 ~ Bernoulli 𝜃

Note that 𝑓 𝑥𝑖 𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 , 𝑖 = 1, … , 𝑛


Then the conditional likelihood is:
𝑛

𝑓 𝑥1 , … , 𝑥𝑛 𝜃 = ෑ 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖

𝑖=1
σ𝑛
𝑖=1 𝑥𝑖 σ𝑛
𝑖=1 1−𝑥𝑖
=𝜃 1−𝜃
σ𝑛𝑖=1 𝑥𝑖 𝑛−σ𝑛 𝑖=1 𝑥𝑖
=𝜃 1−𝜃
What remains to be specified is the prior distribution. In this case
the parameter is 𝜃 ∈ [0,1]
9
Choosing a prior
peaks
Parameter 𝜃 ∈ [0,1]
You need to specify a pdf with
domain [0,1]
One family of possible
distributions is the Beta(a, b).
We will do this later in general.
For now, we pick one of the
Beta family members, the
uniform distribution on [0, 1].

10
Uniform Prior
-not
multiplying by anything >
-
Taking

Suppose we have no information about the value of 𝜃. This


means we assume
𝜃 ~ U(0, 1)
𝜋 𝜃 = 1, 0≤𝜃≤1
I
Then the posterior is MLE X
constant
>
- &
𝑝(𝑥1 , … , 𝑥𝑛 |𝜃)𝜋 𝜃 normalising .

𝑓 𝜃 𝑥1 , … , 𝑥𝑛 =
𝑚(𝑥1 , … , 𝑥𝑛 ) posterior mean .
= 𝑐(𝑥1 , … , 𝑥𝑛 )𝑝 𝑥1 , … , 𝑥𝑛 𝜃 ∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃

The last line says that 𝑓 𝜃 𝑥1 , … , 𝑥𝑛 and 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 are


proportional to each other as functions of θ and only the
1
normalizing constant 𝑐 𝑥1 , … , 𝑥𝑛 = is missing.
𝑚(𝑥1 ,…,𝑥𝑛 )
can estimate on grid to see 11
what it looks like
Example 1

Suppose you are playing a new game and want to estimate your
chance 𝜃 ∈ [0,1] of winning it. You don’t know anything about
the game before you start playing, so the prior on 𝜃 is assumed
uniform(0,1).
Then you play 9 games, and they result in the following
sequence:
WLWWWLWLW
Derive and plot the posterior distribution of your chance 𝜃 of
winning.

12
Example 1 Solution
Given: You play 9 games, and they result in the following sequence:
WLWWWLWLW
Plot the posterior distribution of your chance of winning using a grid
approximation.
6 is ,
30 ;

Solution:
We have n = 9 trials, and x1 = 1, x2 = 0, … , x9 = 1, σ9𝑖=1 𝑥𝑖 = 6
Therefore, the posterior is Not normalised ,
𝑝(𝑥1 , … , 𝑥9 |𝜃)𝜋 𝜃 leaving out constant
𝑓 𝜃 𝑥1 , … , 𝑥9 = find sold numerically
to

𝑚(𝑥1 , … , 𝑥𝑛 )
σ𝑛 𝑥𝑖 𝑛−σ𝑛
𝑖=1 𝑥𝑖
∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 = 𝜃 𝑖=1 1−𝜃
∝ 𝜃 6 1 − 𝜃 3 , 𝜃 ∈ [0,1]
1 6
Note: ‫׬‬0 𝜃 1 − 𝜃 3 𝑑𝜃 ≠ 1, that’s why the proportional sign.
13
wont be able to find integral
Example 1 Graph using a grid
Modifying posterior computationally expensivee.
thats dif from

likelihood.

43 >
- but might be
as Iow as 0 2
.

14
Normalizing Constant
To find the marginal density of the data, aka normalizing constant,
observe the following:
1 1
𝑛
𝑚 𝑥1 , … , 𝑥𝑛 = න 𝑝 𝑥1 , … , 𝑥𝑛 |𝜃 𝜋 𝜃 𝑑𝜃 = න 𝜃 σ𝑖=1 𝑥𝑖 1 − 𝜃 𝑛−σ𝑛
𝑖=1 𝑥𝑖 𝑑𝜃
0 0 ·
can compute normalising
Recall from calculus: constant as well
1
analytically
·
can solve
Γ 𝑎 Γ 𝑏
න 𝜃 𝑎−1 1 − 𝜃 𝑏−1 𝑑𝜃 = , ∀𝑎, 𝑏 > 0
-
Γ 𝑎+𝑏 Divide
- normalising constant
by
.

Therefore, with 𝑎 = 𝑥 + 1, 𝑏 = 𝑛 − 𝑥 + 1, where 𝑥 = σ𝑛𝑖=1 𝑥𝑖 ,


we have that
Γ 𝑥+1 Γ 𝑛−𝑥+1
𝑚 𝑥1 , … , 𝑥𝑛 =
Γ 𝑛+2
15
Posterior Distribution
We can now find the posterior distribution exactly:
𝑝 𝑥1 , … , 𝑥𝑛 |𝜃 𝜋 𝜃
𝑓 𝜃|𝑥1 , … , 𝑥𝑛 =
𝑝 𝑥1 , … , 𝑥𝑛
σ𝑛 𝑥𝑖 𝑛−σ𝑛
𝑖=1 𝑥𝑖
𝜃 𝑖=1 1−𝜃
=
Γ 𝑥+1 Γ 𝑛−𝑥+1
Γ 𝑛+2
Γ(𝑛 + 2)
= 𝜃 𝑥 1 − 𝜃 𝑛−𝑥 , 𝜃 ∈ (0,1)
Γ 𝑥+1 Γ 𝑛−𝑥+1
where 𝑥 = σ𝑛𝑖=1 𝑥𝑖

It can even be recognized that


𝜃|𝑥1 , … , 𝑥𝑛 ~ Beta(𝑥 + 1, 𝑛 − 𝑥 + 1)
That is, the posterior has beta distribution with parameters x + 1
and n – x +1.
16
used for method of moments.

Midterm
-

Beta Distribution .
a ,
b =
1
Beta(5, 7)

Aside: Beta Distribution


2.5
2.0

• The pdf of Beta(a, b) is defined for a, b > 0


by
dbeta(x, 5, 7)

1.5

Γ 𝑎 + 𝑏 𝑎−1
𝑓 𝑥 = 𝑥 1 − 𝑥 𝑏−1, 𝑥 ∈ [0,1]
1.0

Γ 𝑎 Γ 𝑏
• It is available in R with the dbeta function.
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

17
can solve on
pens paper!
Different Prior
Notice that the uniform distribution we used is equivalent to beta
distribution with parameters a = b = 1. That is,
U(0, 1) = Beta(1, 1)

𝜃 ~ Beta(a, b)
!
Suppose now that the prior is any beta distribution:
distribution
Any prior with Beta
eufs upwith Befo
Then the posterior is
distribution
𝑓 𝜃|𝑥1 , … , 𝑥𝑛 =
.

𝑥 𝑛−𝑥 Γ(𝑎 + 𝑏) 𝑎−1


𝜃 1−𝜃 𝜃 1 − 𝜃 𝑏−1
Γ 𝑎 Γ 𝑏
=
𝑝(𝑥1 , … , 𝑥𝑛 )
∝ 𝜃 𝑥+𝑎−1 1 − 𝜃 𝑛−𝑥+𝑏−1 , 𝜃 ∈ (0,1)
where 𝑥 = σ𝑛𝑖=1 𝑥𝑖
The only distribution with such shape is Beta(a + x, b + n – x).
Throughout the course we will use this trick to identify posterior
distributions. 18
Exercise: Gaussian Model with one observation
It is easier to consider first a model with one unknown parameter and one
observation.
Suppose that 𝜎 2 is known and we have only one observation such that:

𝑋 | 𝜃 ∼ 𝑁 𝜃, 𝜎 2
𝜃 ∼ 𝑁(𝜇, 𝜏 2 )
That is,
1 𝑥−𝜃 2

𝑓(𝑥 | 𝜃) = 𝑒 2𝜎2 , 𝑥 ∈ℝ
2𝜋𝜎 2
𝜃−𝜇 2
1 −
𝜋(𝜃) = 𝑒 2𝜏2 ,𝜃 ∈ ℝ
2𝜋𝜏 2
1 1 𝑥−𝜃 2 𝜃−𝜇 2
−2 +
𝑓(𝑥, 𝜃) = 𝑒 𝜎2 𝜏2
2𝜋𝜎𝜏

Find m(x) and 𝑓(𝜃|𝑥).


Possible to calculate posterior by .
hand

Example: Normal distribution


Known variance, unknown mean
Now suppose we have a sample of Normal data:
𝑋1 , … , 𝑋𝑛 ~ 𝑁 𝜃, 𝜎 2
Let us again assume we know the variance, 2 ,and
we assume a prior distribution for the mean, 𝜃, based
on our prior beliefs:
𝜃 ∼ 𝑁(𝜇, 𝜏 2 )
Now we wish to construct the posterior distribution
f(𝜃 |x1, … , xn)
(See exercises)
Posterior for Normal distribution mean
The unknown parameter is the mean 𝜃
The prior is 𝜃 ~ 𝑁 𝜇, 𝜏 2 . That is,
1 𝜃−𝜇 2

𝜋 𝜃 = 𝑒 2𝜏 2
2𝜋𝜏 2

Model for the data is 𝑋𝑖 ~𝑁 𝜃, 𝜎 2 . That is,


𝑥 −𝜃 2
1 − 𝑖 2
𝑓 𝑥𝑖 𝜃 = 𝑒 2𝜎
2𝜋𝜎 2

Conditional likelihood is:


𝑛
1 𝑥𝑖 −𝜃 2

ෑ 𝑒 2𝜎2
𝑖=1
2𝜋𝜎 2
Posterior for the mean of a normal distribution
Hence, without the normalizing constant m(x), the posterior is
proportional to: can use algebra to derivethese
I
2
𝜃−𝜇 2 𝑥𝑖 −𝜃
1 − 𝑛 1 −
𝑓 𝜃𝑥 ∝𝜋 𝜃 𝑓 𝑥𝜃 = 𝑒 2𝜏2 × ς𝑖=1 𝑒 2𝜎2
2𝜋𝜏2 2𝜋𝜎 2
After using some algebra, you can prove that
𝑛 2 𝑛 𝑛
2
𝑥𝑖 − 𝜃 1 2𝜃 𝜃
෍ = 2 ෍ 𝑥𝑖2 − 2 ෍ 𝑥𝑖 + 𝑛 2
𝜎 𝜎 𝜎 𝜎
𝑖=1 𝑖=1 𝑖=1
Then …
𝑛
1 1 𝑛 𝜇 σ 𝑥
− 𝜃 2 2 + 2 +𝜃 2 + 𝑖=12 𝑖 +𝑐𝑜𝑛𝑠𝑡
2 𝜏 𝜎 𝜏 𝜎
𝑓(𝜃|𝑥) ∝ 𝑒
To obtain the exact posterior distribution, you have to complete
the square…. (exercise at the end)
Posterior Distribution
• When you finish the previous example as an exercise, you will
find out that the posterior is a normal distribution:
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2 Later Formulas compared
:
2
6
𝜃 ∼ 𝑁(𝜇, 𝜏 2 ) ↑ to
prior Mand
2
⇒ 𝜃|𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡
• Note that we ended up with posterior distribution in the same
family as the prior: normal. Such phenomenon is known as a
conjugate prior.
2
• How are the posterior mean and variance 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡 related to
the prior mean and variance and the sample mean and
variance?
• It turns out that the posterior mean is a compromise between
the prior mean and the sample mean!
Precisions and means whatdata tells
Posterior mean and variance: knowledge about you
O
.

D
𝜇 𝑥ҧ 1 ↑ 𝑛
+
𝜏 2 𝜎 2 /𝑛 𝜏 2 𝜎 2
𝜇𝑝𝑜𝑠𝑡 =𝐸 𝜃𝒙 = = 𝜇+ 𝑥ҧ
1 𝑛 1 𝑛 1 𝑛
+
𝜏2 𝜎2
+
𝜏2 𝜎2
+
𝜏2 𝜎2 ↓
Variance if you have
Posterior - 1 large sample,it
𝑉𝑎𝑟 𝜃 𝒙 = will dominate.
1 𝑛
+ sample Prior
𝜏 2 𝜎 2 Total ofprecision
>
-

precision
.

In Bayesian statistics the precision = 1/variance is often more


important than the variance. smaller than prior

For the Normal model we have that the posterior precision = sum
of prior precision and data precision, and the posterior mean is a
(precision weighted) average of the prior mean and data mean.
School Example
The following data are on the amount of time (in hours) students
from some high school spent on studying or homework for a
particular subject during an exam one-week period:

2.11 9.75 13.88 11.30 8.93 15.66 16.38 4.54 8.86 11.94 12.47 11.11 11.65
14.53 9.61 7.38 3.34 9.06 9.45 5.98 7.44 8.50 1.55 11.45 9.73

Suppose that we are somewhat 90% sure that students dedicate 2-3
hours on average per credit hour and the the class was 3 credits.
Assume study times vary between students with standard deviation
4 hours.

Obtain reasonable prior and compute the posterior mean and


variance and draw the posterior curve and compare it to the prior.
School Example Prior
Suppose that we are somewhat 90% sure that students dedicate
2-3 hours on average per credit hour and the the class was 3
credits. Assume study times vary between students with
standard deviation 4 hours.

Obtain reasonable prior:


We can take the middle of the interval to be
2+3
𝜇 =3× = 7.5
2
The length of the interval is approximately 2*1.645*𝝉, so:
3
3 × 3 − 2 = 2 × 1.645 × 𝜏 ⇒ 𝜏 = = 0.9 ≈ 1
3.29
2 2
𝜎 = 4 = 16
Prior and posterior comparison

Note how the posterior has more precision (narrower curve) than the prior.
Estimating 𝜃 under Bayesian Paradigm
Q: We found the posterior 𝑓 𝜃 𝑥 . Now what can we do with it?

A: It can be reported as the entire posterior likelihood of the


parameter, but very often we report just a point estimator 𝜃መ .
The most commonly used estimator is the posterior mean:
𝜃መ = 𝐸(𝜃 | 𝑥) = න 𝜃 𝑓 𝜃 𝑥 𝑑 𝜃

In general, we can find estimate of ℎ(𝜃) for any integrable


function ℎ(⋅) by value of transformation
of some methoda
expected

ℎ(𝜃) = 𝐸[ℎ(𝜃) | 𝑥] = න ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑 𝜃

Therefore, the Bayesian approach often results in integration


problems. Some of these difficulties are:
• 𝑓 𝜃 𝑥 might not be available in closed form or be partially
available due to m(x) being unavailable.
• the integration ‫ ׬‬ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑 𝜃 cannot be done analytically.
Confidence Regions
• The Bayesian analog of CI's are the
confidence regions, also called credible
regions, with highest posterior density:
Most credible
region .
𝐶(𝒙) = { 𝜃: 𝑓(𝜃 | 𝒙) ≥ 𝑘}
• where k is such that How to know which level to slice :

𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
not
• This leads to a different symmetricalis
-

the best approach


computational problem, namely I could have
a

solving shorter distance


.

𝑓(𝜃 | 𝒙) = 𝑘
CI Examples
• Example (normal, continued): If the posterior is
unimodal, symmetric:
𝜇 𝜎 2 + 𝑥 𝜏2
2 symmetrical curve :

𝜃 −
1
− 2 𝜎2 +𝜏2
1 𝜎 2 𝜏2
𝑓(𝜃 | 𝑥) = 𝑒 𝜎2 +𝜏2
𝜎 2𝜏 2
2𝜋 2
𝜎 + 𝜏2
the credible region has the form
𝜇 𝜎2 + 𝑥 𝜏2 𝜎 2𝜏 2
2 2 ± 𝑧𝛼 2 2
𝜎 +𝜏 2 𝜎 +𝜏 More area on

left

• Example (binomial, continued): In this case the posterior ↑


is not symmetric:
a = 2, b = 5

𝜃 | 𝑥 ∼ 𝐵𝑒𝑡𝑎(𝑥 + 𝛼, 𝑛 − 𝑥 + 𝛽)

2.0
We must find l(x) and u(x) such that

1.5
f(x)
𝑢(𝑥)

1.0
න 𝑓 𝜃 𝑥 𝑑𝜃 = 0.95

0.5
𝑙(𝑥)

0.0
0.0 0.2 0.4 0.6 0.8 1.0

and 𝑓(𝑙(𝑥) | 𝑥) = 𝑓(𝑢(𝑥) | 𝑥), which cannot be done x

analytically.
Conjugate Priors

The computational drawbacks used to be so severe that in the


past many researchers worked only with the so-called conjugate
priors.

Definition: A family of probability distributions ℱ = { 𝜋(𝜃) } is


conjugate for the model 𝑓(𝑥| 𝜃) if
𝑓 𝜃|𝑥 ∈ℱ

Note: The Gaussian example was conjugate, because the prior


had normal distribution, and the posterior also had a normal
distribution.
Example: Beta-Binomial Model
Let
𝑥 | 𝜃 ∼ Binom(𝑛, 𝜃)
𝜃 ∼ Beta(𝛼, 𝛽)
That is,
𝑛 𝑥
𝑓(𝑥 | 𝜃) = 𝑃(𝑋 = 𝑥 | 𝜃) = 𝜃 1 − 𝜃 𝑛−𝑥
𝑥
Γ 𝛼+𝛽
𝜋 𝜃 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1
Γ 𝛼 Γ 𝛽
𝑛 Γ 𝛼+𝛽
⇒ 𝑓(𝑥, 𝜃) = 𝜃 𝛼+𝑥−1 1 − 𝜃 𝑛+𝛽−𝑥−1
𝑥 Γ 𝛼 Γ 𝛽

⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)

Therefore, prior is Beta distributed, and posterior is Beta distributed,


meaning the Beta distribution is a conjugate prior.

Note: See R examples.


Improper Priors
Sometimes we may wish that 𝜋(𝜃) contains no information about 𝜃. For
1
example, if Θ = [𝑎, 𝑏], then 𝜋 𝜃 = 𝑏−𝑎 ∼ 𝑈(𝑎, 𝑏) is a natural choice
when Θ is bounded.

However, if, say, Θ = ℝ, and 𝜋 (𝜃) = 𝑐, then


‫ = 𝜃 𝑑 )𝜃(𝜋 ׬‬න 𝑐 𝑑𝜃 = ∞
−∞


Definition: When ‫׬‬−∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.

If 𝜋(𝜃) is improper, then Bayesian inference is still possible, if 𝑓(𝜃 | 𝑥) is a


proper density. For example, if 𝜋 (𝜃) = 𝑐 , then we must have that
‫∞ < 𝑘 = 𝜃 𝑑 )𝜃 |𝑥(𝑓 ׬‬. In such cases
𝑓(𝑥 | 𝜃) 𝜋(𝜃) 𝑐𝑓(𝑥 | 𝜃) 𝑓(𝑥 | 𝜃)
𝑓(𝜃 | 𝑥) = = =
‫𝜃 𝑑 )𝜃( 𝜋 ) 𝜃 | 𝑥(𝑓 ׬‬ ‫𝜃 𝑑 ) 𝜃 | 𝑥(𝑓𝑐 ׬‬ 𝑘
Multistage Models
The basic Bayesian model has two stages:
Stage 1: 𝑓(𝑥|𝜃, 𝜂)
Stage 2: 𝑓(𝜃|𝜂)
where 𝜂 is a vector of hyperparameters. For example, in the Beta-Binomial
model 𝜂 = 𝛼, 𝛽 .

If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) ‫𝑥(𝑓 ׬‬, 𝜃, 𝜂) 𝑑 𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) ‫𝑥(𝑓 ׭‬, 𝜃, 𝜂) 𝑑 𝜂 𝑑 𝜃
‫𝜃 | 𝑥(𝑓 ׬‬, 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑 𝜂
=
‫𝜃 | 𝑥(𝑓 ׭‬, 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑 𝜂 𝑑 𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Example
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts rj out of mj surgeries, j = A, … , L.

Stage 1: 𝑓 𝑟𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 2: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 3: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌

𝜃𝑗
For Stage 2, define 𝛽𝑗 = log 1−𝜃 and assume the betas are independent with
𝑗
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2

For Stage 3, assume 𝜇 and 𝜎2 are independent with


𝜇 ~ 𝑁 0, 𝑐 2
𝜎 2 ~ IG(𝑎, 𝑏)
where IG is the inverse gamma prior distribution.
Let a = b = 0.001 and c = 103
Hierarchical Models: Why?
Consider the mortality rate example again. How would you model the data
without hierarchy?

At one extreme, we could consider that each hospital has their own mortality
rate and estimate the means individually per each hospital. This is known as the
no pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).

At the other extreme, we can consider all hospitals similar enough, and just
estimate a single pooled estimate (aka complete pooling of data).

Advantage: High accuracy. Disadvantage: No individual estimates (bias).

Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Non-exact Estimation
Q: What do we do when the integrals are not available in closed form?

A: There are three major alternatives to exact (calculus) solutions.

• Numerical integration: Grid approximation, Riemann integration, trapezoid


rule, Gaussian quadrature, ... However, they don't work well in high
dimensions.
Aside: How to deal with integrals over infinite intervals?
One approach is to change of variables, like this one:
∞ 1 𝑡 1+𝑡 2
‫׬‬−∞ 𝑓 𝑥 𝑑𝑥 = ‫׬‬−1 𝑓 1−𝑡 2 1−𝑡 2 2
𝑑𝑡
• Analytical approach, like Laplace approximation, but too complicated and
still doesn't work well, especially with hierarchical (multistage) models.
• Monte Carlo (MC) methods: simple and work equally well in any dimension.
We will concentrate on this third option and study it in more details.
Monte Carlo Estimation
Basic idea: If we want to find 𝐸 ℎ 𝜃 = ‫ ׬‬ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑𝜃 for some ℎ(⋅) and
some 𝜃 ∼ 𝑓(𝜃) (within the Bayesian framework we typically have 𝑓(𝜃) =
𝑓(𝜃 | 𝑥) and ℎ(𝜃) = 𝜃), then we can obtain an iid sample 𝜃1 , … , 𝜃𝑁 from 𝑓 𝜃
and estimate E[h(𝜃)] with
𝑁
1
ℎ෠ = ෍ ℎ(𝜃𝑖 )
𝑁
𝑖=1
𝑎.𝑠.
Then by the SLLN, ℎ෠ ℎ(𝜃), as 𝑁 → ∞. That is, ℎ෠ is a
consistent estimator of E[h(𝜃)].

Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.

If obtaining an iid sample is not an option, then we can employ correlated and
non-identically distribute sample in the form of a Markov chain, which leads to
the Markov chain Monte Carlo methods (MCMC), studied later.
Exercises

1. Verify algebraically that


𝑥−𝜃 2 𝜃−𝜇 2
2
+
𝜎 𝜏2
2 2
𝑥 𝜏2 + 𝜇 𝜎2 𝑥 2 𝜏 2 + 𝜎 2 𝜇2 𝑥 𝜏2 + 𝜇 𝜎2
𝜃− + −
𝜎2 + 𝜏2 𝜎2 + 𝜏2 𝜎2 + 𝜏2
=
𝜎2 𝜏2
𝜎2 + 𝜏2

2. Redo normal example with n observations


𝑋1 , … , 𝑋𝑛 ∼ 𝑁 (𝜃, 𝜎 2 )

3. Check that 𝜋 𝜃 ∝ 1 is a valid improper prior for the Gaussian


model.

You might also like