19-Bayesian 2
19-Bayesian 2
1
Last week Markov chains >
: -
computational tool for Bayesian model .
Bayesian Inference
• In this lecture we will define the basic terminology used in Bayesian
inference.
Ga
.
F
𝐿(𝜃 | 𝒙) = 𝑓(𝒙 𝜃 = ෑ 𝑓 𝑥𝑖 𝜃 ,
max wit variables .
𝑖=1
Bayesian
Approach
a
Middle
• The main idea is that 𝜃 is
now a random variable.
a curve .
𝑓(𝜃| 𝒙)
is modelled into
Bayesian Approach
• The main idea is that 𝜃 is now a random variable.
• Therefore, 𝜃 has a distribution with a pdf.
• The information brought by the sample x is combined with a prior
information specified by a prior distribution 𝜋(𝜃), to form the
posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:
Joint likelihood
conditional distribution
𝑓 𝒙 ,𝜃 DA
𝑓(𝒙 | 𝜃) 𝜋(𝜃) chosen
prior .
can be a 15-dimensional
𝑚(𝒙) = න 𝑓(𝒙| 𝜃) 𝜋(𝜃) 𝑑𝜃
integral .
𝑓(𝑥)
CO vector)
Then Bayes’ Theorem becomes:
𝑓 𝑥 𝑦 𝑓(𝑦)
𝑓 𝑦𝑥 =
𝑦𝑑 𝑦 𝑓 𝑦 𝑥 𝑓
↓
• You can also use priors based on mean and variance of
your knowledge. For example, you can say “𝜃 is around 1”.
som
O
• It is possible to adopt “flat” or “non-informative” priors
when no info is available.
uniform
,
when parametr
ightw [0 1].,
Example: Binomial Model
Assume we have conditional iid binary data
prior on
parameter-
𝑋1 , 𝑋2 , … , 𝑋𝑛 | 𝜃 ~ Bernoulli 𝜃
𝑓 𝑥1 , … , 𝑥𝑛 𝜃 = ෑ 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖
𝑖=1
σ𝑛
𝑖=1 𝑥𝑖 σ𝑛
𝑖=1 1−𝑥𝑖
=𝜃 1−𝜃
σ𝑛𝑖=1 𝑥𝑖 𝑛−σ𝑛 𝑖=1 𝑥𝑖
=𝜃 1−𝜃
What remains to be specified is the prior distribution. In this case
the parameter is 𝜃 ∈ [0,1]
9
Choosing a prior
peaks
Parameter 𝜃 ∈ [0,1]
You need to specify a pdf with
domain [0,1]
One family of possible
distributions is the Beta(a, b).
We will do this later in general.
For now, we pick one of the
Beta family members, the
uniform distribution on [0, 1].
10
Uniform Prior
-not
multiplying by anything >
-
Taking
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 =
𝑚(𝑥1 , … , 𝑥𝑛 ) posterior mean .
= 𝑐(𝑥1 , … , 𝑥𝑛 )𝑝 𝑥1 , … , 𝑥𝑛 𝜃 ∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃
Suppose you are playing a new game and want to estimate your
chance 𝜃 ∈ [0,1] of winning it. You don’t know anything about
the game before you start playing, so the prior on 𝜃 is assumed
uniform(0,1).
Then you play 9 games, and they result in the following
sequence:
WLWWWLWLW
Derive and plot the posterior distribution of your chance 𝜃 of
winning.
12
Example 1 Solution
Given: You play 9 games, and they result in the following sequence:
WLWWWLWLW
Plot the posterior distribution of your chance of winning using a grid
approximation.
6 is ,
30 ;
Solution:
We have n = 9 trials, and x1 = 1, x2 = 0, … , x9 = 1, σ9𝑖=1 𝑥𝑖 = 6
Therefore, the posterior is Not normalised ,
𝑝(𝑥1 , … , 𝑥9 |𝜃)𝜋 𝜃 leaving out constant
𝑓 𝜃 𝑥1 , … , 𝑥9 = find sold numerically
to
𝑚(𝑥1 , … , 𝑥𝑛 )
σ𝑛 𝑥𝑖 𝑛−σ𝑛
𝑖=1 𝑥𝑖
∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 = 𝜃 𝑖=1 1−𝜃
∝ 𝜃 6 1 − 𝜃 3 , 𝜃 ∈ [0,1]
1 6
Note: 0 𝜃 1 − 𝜃 3 𝑑𝜃 ≠ 1, that’s why the proportional sign.
13
wont be able to find integral
Example 1 Graph using a grid
Modifying posterior computationally expensivee.
thats dif from
likelihood.
43 >
- but might be
as Iow as 0 2
.
14
Normalizing Constant
To find the marginal density of the data, aka normalizing constant,
observe the following:
1 1
𝑛
𝑚 𝑥1 , … , 𝑥𝑛 = න 𝑝 𝑥1 , … , 𝑥𝑛 |𝜃 𝜋 𝜃 𝑑𝜃 = න 𝜃 σ𝑖=1 𝑥𝑖 1 − 𝜃 𝑛−σ𝑛
𝑖=1 𝑥𝑖 𝑑𝜃
0 0 ·
can compute normalising
Recall from calculus: constant as well
1
analytically
·
can solve
Γ 𝑎 Γ 𝑏
න 𝜃 𝑎−1 1 − 𝜃 𝑏−1 𝑑𝜃 = , ∀𝑎, 𝑏 > 0
-
Γ 𝑎+𝑏 Divide
- normalising constant
by
.
Midterm
-
Beta Distribution .
a ,
b =
1
Beta(5, 7)
1.5
Γ 𝑎 + 𝑏 𝑎−1
𝑓 𝑥 = 𝑥 1 − 𝑥 𝑏−1, 𝑥 ∈ [0,1]
1.0
Γ 𝑎 Γ 𝑏
• It is available in R with the dbeta function.
0.5
0.0
17
can solve on
pens paper!
Different Prior
Notice that the uniform distribution we used is equivalent to beta
distribution with parameters a = b = 1. That is,
U(0, 1) = Beta(1, 1)
𝜃 ~ Beta(a, b)
!
Suppose now that the prior is any beta distribution:
distribution
Any prior with Beta
eufs upwith Befo
Then the posterior is
distribution
𝑓 𝜃|𝑥1 , … , 𝑥𝑛 =
.
𝑋 | 𝜃 ∼ 𝑁 𝜃, 𝜎 2
𝜃 ∼ 𝑁(𝜇, 𝜏 2 )
That is,
1 𝑥−𝜃 2
−
𝑓(𝑥 | 𝜃) = 𝑒 2𝜎2 , 𝑥 ∈ℝ
2𝜋𝜎 2
𝜃−𝜇 2
1 −
𝜋(𝜃) = 𝑒 2𝜏2 ,𝜃 ∈ ℝ
2𝜋𝜏 2
1 1 𝑥−𝜃 2 𝜃−𝜇 2
−2 +
𝑓(𝑥, 𝜃) = 𝑒 𝜎2 𝜏2
2𝜋𝜎𝜏
D
𝜇 𝑥ҧ 1 ↑ 𝑛
+
𝜏 2 𝜎 2 /𝑛 𝜏 2 𝜎 2
𝜇𝑝𝑜𝑠𝑡 =𝐸 𝜃𝒙 = = 𝜇+ 𝑥ҧ
1 𝑛 1 𝑛 1 𝑛
+
𝜏2 𝜎2
+
𝜏2 𝜎2
+
𝜏2 𝜎2 ↓
Variance if you have
Posterior - 1 large sample,it
𝑉𝑎𝑟 𝜃 𝒙 = will dominate.
1 𝑛
+ sample Prior
𝜏 2 𝜎 2 Total ofprecision
>
-
precision
.
For the Normal model we have that the posterior precision = sum
of prior precision and data precision, and the posterior mean is a
(precision weighted) average of the prior mean and data mean.
School Example
The following data are on the amount of time (in hours) students
from some high school spent on studying or homework for a
particular subject during an exam one-week period:
2.11 9.75 13.88 11.30 8.93 15.66 16.38 4.54 8.86 11.94 12.47 11.11 11.65
14.53 9.61 7.38 3.34 9.06 9.45 5.98 7.44 8.50 1.55 11.45 9.73
Suppose that we are somewhat 90% sure that students dedicate 2-3
hours on average per credit hour and the the class was 3 credits.
Assume study times vary between students with standard deviation
4 hours.
Note how the posterior has more precision (narrower curve) than the prior.
Estimating 𝜃 under Bayesian Paradigm
Q: We found the posterior 𝑓 𝜃 𝑥 . Now what can we do with it?
𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
not
• This leads to a different symmetricalis
-
𝑓(𝜃 | 𝒙) = 𝑘
CI Examples
• Example (normal, continued): If the posterior is
unimodal, symmetric:
𝜇 𝜎 2 + 𝑥 𝜏2
2 symmetrical curve :
𝜃 −
1
− 2 𝜎2 +𝜏2
1 𝜎 2 𝜏2
𝑓(𝜃 | 𝑥) = 𝑒 𝜎2 +𝜏2
𝜎 2𝜏 2
2𝜋 2
𝜎 + 𝜏2
the credible region has the form
𝜇 𝜎2 + 𝑥 𝜏2 𝜎 2𝜏 2
2 2 ± 𝑧𝛼 2 2
𝜎 +𝜏 2 𝜎 +𝜏 More area on
left
𝜃 | 𝑥 ∼ 𝐵𝑒𝑡𝑎(𝑥 + 𝛼, 𝑛 − 𝑥 + 𝛽)
2.0
We must find l(x) and u(x) such that
1.5
f(x)
𝑢(𝑥)
1.0
න 𝑓 𝜃 𝑥 𝑑𝜃 = 0.95
0.5
𝑙(𝑥)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
analytically.
Conjugate Priors
⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)
= 𝜃 𝑑 )𝜃(𝜋 න 𝑐 𝑑𝜃 = ∞
−∞
∞
Definition: When −∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.
If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) 𝑥(𝑓 , 𝜃, 𝜂) 𝑑 𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) 𝑥(𝑓 , 𝜃, 𝜂) 𝑑 𝜂 𝑑 𝜃
𝜃 | 𝑥(𝑓 , 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑 𝜂
=
𝜃 | 𝑥(𝑓 , 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑 𝜂 𝑑 𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Example
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts rj out of mj surgeries, j = A, … , L.
Stage 1: 𝑓 𝑟𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 2: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 3: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌
𝜃𝑗
For Stage 2, define 𝛽𝑗 = log 1−𝜃 and assume the betas are independent with
𝑗
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2
At one extreme, we could consider that each hospital has their own mortality
rate and estimate the means individually per each hospital. This is known as the
no pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).
At the other extreme, we can consider all hospitals similar enough, and just
estimate a single pooled estimate (aka complete pooling of data).
Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Non-exact Estimation
Q: What do we do when the integrals are not available in closed form?
Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.
If obtaining an iid sample is not an option, then we can employ correlated and
non-identically distribute sample in the form of a Markov chain, which leads to
the Markov chain Monte Carlo methods (MCMC), studied later.
Exercises