0% found this document useful (0 votes)

4 views39 pages

19-Bayesian 2

The document discusses Bayesian inference and its comparison with the maximum likelihood approach in statistical modeling. It explains the concept of treating parameters as random variables and using prior distributions to derive posterior distributions through Bayes' theorem. Additionally, it provides examples, including a binomial model, to illustrate how to choose priors and compute posterior distributions.

Uploaded by

tanyalim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views39 pages

19-Bayesian 2

Uploaded by

tanyalim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

STAT 5703

Statistical Inference and Modeling

for Data Science
Dobrin Marchev

1
Last week Markov chains >
: -
computational tool for Bayesian model .

Bayesian Inference
• In this lecture we will define the basic terminology used in Bayesian
inference.

• We will also consider some examples to illustrate the new concepts.

•
• Finally, it will be explained why are Monte
-
Carlo methods typically
needed in the Bayesian inference.
& end .
other : frequencyanalysis Both MLES still need
Bayesian

Ga
.

Maximum Likelihood Approach

Recall the maximum likelihood technique, which is based on a
fixed unknown parameter 𝜃 ∈ Θ, and an iid sample 𝑿 =
(𝑋1 , … , 𝑋𝑛 ) from a population with density 𝑓(𝑥 | 𝜃).

The likelihood function, 𝐿(𝜃 | 𝒙) is defined as

Not conditionalp ,
but we fix data ,
Why 𝑛 ↓ kelihood .

F
𝐿(𝜃 | 𝒙) = 𝑓(𝒙 𝜃 = ෑ 𝑓 𝑥𝑖 𝜃 ,
max wit variables .
𝑖=1

and the goal is to find a suitable estimate of 𝜃መ of 𝜃. optimisation -

Find max function .

Most often the difficulties with this approach are optimization
type problems, like finding MLE = argmax 𝐿(θ | 𝒙) , which in
𝜃∈ Θ
turn are often reduced to problems solving equations.

Remark: Very often the parameter 𝜃 is a vector 𝛉 = (𝜃1 , … , 𝜃𝑘 ).

Parameters not fixed data .
More likely to
get higher
↑ values than other.
𝑓(𝒙 | 𝜃) 𝜋(𝜃)

Bayesian
Approach

a
Middle
• The main idea is that 𝜃 is
now a random variable.

• It is useful to think that 𝜃 likely to be 0 .

5
being random represents
now this obtained ?
our uncertainty about its - is

value. Apply Bayes

knowledge of parameter Theorem

a curve .
𝑓(𝜃| 𝒙)
is modelled into
Bayesian Approach
• The main idea is that 𝜃 is now a random variable.
• Therefore, 𝜃 has a distribution with a pdf.
• The information brought by the sample x is combined with a prior
information specified by a prior distribution 𝜋(𝜃), to form the
posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:
Joint likelihood
conditional distribution

𝑓 𝒙 ,𝜃 DA
𝑓(𝒙 | 𝜃) 𝜋(𝜃) chosen
prior .

-> depends on type

𝑓(𝜃| 𝒙) = = , by of data.
𝑚 𝒙 𝑚(𝒙) tser
where Marginal

can be a 15-dimensional
𝑚(𝒙) = න 𝑓(𝒙| 𝜃) 𝜋(𝜃) 𝑑𝜃
integral .

is the marginal distribution of the data X.

• Before seeing the data, our uncertainty is represented by 𝜋(𝜃)
and after that by 𝑓(𝜃| 𝒙).
Where does the formula come from?
Extend conditional probability to discrete random
variables:
𝑃(𝑌 = 𝑦, 𝑋 = 𝑥)
𝑃 𝑌=𝑦𝑋=𝑥 =
𝑃(𝑋 = 𝑥)
And then to conditional densities:
parameter data
𝑓(𝑥, 𝑦) ,

𝑓 𝑦𝑥 = Both are rectors

𝑓(𝑥)
CO vector)
Then Bayes’ Theorem becomes:
𝑓 𝑥 𝑦 𝑓(𝑦)
𝑓 𝑦𝑥 =
‫𝑦𝑑 𝑦 𝑓 𝑦 𝑥 𝑓 ׬‬

When we apply it to y = 𝜃 we obtain the posterior

𝑓(𝒙 | 𝜃) 𝜋(𝜃)
𝑓(𝜃| 𝒙) =
𝑚(𝒙)
Bayesian is helpful in random
Remarks intercept model.
• Both procedures have advantages and disadvantages.
• Bayesian approach has been criticized for over-reliance on
convenient priors or heavy computations.
• The frequentist approach has been criticized for inflexibility
(failure to incorporate prior information) and incoherence
(failure to process information systematically).
• Bayesian approach has intuitive interpretation: The outputs
of Bayesian analyses are usually probabilities that measure
our (un)certainty. In the context of hypothesis testing,
Bayesian analyses directly measure the probability that the
null hypothesis is true, which provides usually provides a
more straightforward interpretation. The Bayesian process
is very similar to how we process information in our minds.

• For large n, as well as when the prior is uniform, the

Bayesian method will provide similar results to the classical
likelihood approach method.
-
How Do You Choose a Prior?
• Hard to answer: it depends.

• If you know something about your parameter, specify it.

Could be as simple as 𝜃 > 0.

• There might have already been research done on similar

problems, and some “expert” knowledge might be
available. This is difficult for the experts in the subject
who don’t have detailed knowledge about statistics and
it’s difficult for the statisticians because they don’t have
the subject knowledge. expert might not
(othroughnation)
to set
know how up pior

↓
• You can also use priors based on mean and variance of
your knowledge. For example, you can say “𝜃 is around 1”.
som

O
• It is possible to adopt “flat” or “non-informative” priors
when no info is available.
uniform
,
when parametr
ightw [0 1].,
Example: Binomial Model
Assume we have conditional iid binary data
prior on
parameter-
𝑋1 , 𝑋2 , … , 𝑋𝑛 | 𝜃 ~ Bernoulli 𝜃

Note that 𝑓 𝑥𝑖 𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 , 𝑖 = 1, … , 𝑛

Then the conditional likelihood is:
𝑛

𝑓 𝑥1 , … , 𝑥𝑛 𝜃 = ෑ 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖

𝑖=1
σ𝑛
𝑖=1 𝑥𝑖 σ𝑛
𝑖=1 1−𝑥𝑖
=𝜃 1−𝜃
σ𝑛𝑖=1 𝑥𝑖 𝑛−σ𝑛 𝑖=1 𝑥𝑖
=𝜃 1−𝜃
What remains to be specified is the prior distribution. In this case
the parameter is 𝜃 ∈ [0,1]
9
Choosing a prior
peaks
Parameter 𝜃 ∈ [0,1]
You need to specify a pdf with
domain [0,1]
One family of possible
distributions is the Beta(a, b).
We will do this later in general.
For now, we pick one of the
Beta family members, the
uniform distribution on [0, 1].

10
Uniform Prior
-not
multiplying by anything >
-
Taking

Suppose we have no information about the value of 𝜃. This

means we assume
𝜃 ~ U(0, 1)
𝜋 𝜃 = 1, 0≤𝜃≤1
I
Then the posterior is MLE X
constant
>
- &
𝑝(𝑥1 , … , 𝑥𝑛 |𝜃)𝜋 𝜃 normalising .

𝑓 𝜃 𝑥1 , … , 𝑥𝑛 =
𝑚(𝑥1 , … , 𝑥𝑛 ) posterior mean .
= 𝑐(𝑥1 , … , 𝑥𝑛 )𝑝 𝑥1 , … , 𝑥𝑛 𝜃 ∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃

The last line says that 𝑓 𝜃 𝑥1 , … , 𝑥𝑛 and 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 are

proportional to each other as functions of θ and only the
1
normalizing constant 𝑐 𝑥1 , … , 𝑥𝑛 = is missing.
𝑚(𝑥1 ,…,𝑥𝑛 )
can estimate on grid to see 11
what it looks like
Example 1

Suppose you are playing a new game and want to estimate your
chance 𝜃 ∈ [0,1] of winning it. You don’t know anything about
the game before you start playing, so the prior on 𝜃 is assumed
uniform(0,1).
Then you play 9 games, and they result in the following
sequence:
WLWWWLWLW
Derive and plot the posterior distribution of your chance 𝜃 of
winning.

12
Example 1 Solution
Given: You play 9 games, and they result in the following sequence:
WLWWWLWLW
Plot the posterior distribution of your chance of winning using a grid
approximation.
6 is ,
30 ;

Solution:
We have n = 9 trials, and x1 = 1, x2 = 0, … , x9 = 1, σ9𝑖=1 𝑥𝑖 = 6
Therefore, the posterior is Not normalised ,
𝑝(𝑥1 , … , 𝑥9 |𝜃)𝜋 𝜃 leaving out constant
𝑓 𝜃 𝑥1 , … , 𝑥9 = find sold numerically
to

𝑚(𝑥1 , … , 𝑥𝑛 )
σ𝑛 𝑥𝑖 𝑛−σ𝑛
𝑖=1 𝑥𝑖
∝ 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 = 𝜃 𝑖=1 1−𝜃
∝ 𝜃 6 1 − 𝜃 3 , 𝜃 ∈ [0,1]
1 6
Note: ‫׬‬0 𝜃 1 − 𝜃 3 𝑑𝜃 ≠ 1, that’s why the proportional sign.
13
wont be able to find integral
Example 1 Graph using a grid
Modifying posterior computationally expensivee.
thats dif from

likelihood.

43 >
- but might be
as Iow as 0 2
.

14
Normalizing Constant
To find the marginal density of the data, aka normalizing constant,
observe the following:
1 1
𝑛
𝑚 𝑥1 , … , 𝑥𝑛 = න 𝑝 𝑥1 , … , 𝑥𝑛 |𝜃 𝜋 𝜃 𝑑𝜃 = න 𝜃 σ𝑖=1 𝑥𝑖 1 − 𝜃 𝑛−σ𝑛
𝑖=1 𝑥𝑖 𝑑𝜃
0 0 ·
can compute normalising
Recall from calculus: constant as well
1
analytically
·
can solve
Γ 𝑎 Γ 𝑏
න 𝜃 𝑎−1 1 − 𝜃 𝑏−1 𝑑𝜃 = , ∀𝑎, 𝑏 > 0
-
Γ 𝑎+𝑏 Divide
- normalising constant
by
.

Therefore, with 𝑎 = 𝑥 + 1, 𝑏 = 𝑛 − 𝑥 + 1, where 𝑥 = σ𝑛𝑖=1 𝑥𝑖 ,

we have that
Γ 𝑥+1 Γ 𝑛−𝑥+1
𝑚 𝑥1 , … , 𝑥𝑛 =
Γ 𝑛+2
15
Posterior Distribution
We can now find the posterior distribution exactly:
𝑝 𝑥1 , … , 𝑥𝑛 |𝜃 𝜋 𝜃
𝑓 𝜃|𝑥1 , … , 𝑥𝑛 =
𝑝 𝑥1 , … , 𝑥𝑛
σ𝑛 𝑥𝑖 𝑛−σ𝑛
𝑖=1 𝑥𝑖
𝜃 𝑖=1 1−𝜃
=
Γ 𝑥+1 Γ 𝑛−𝑥+1
Γ 𝑛+2
Γ(𝑛 + 2)
= 𝜃 𝑥 1 − 𝜃 𝑛−𝑥 , 𝜃 ∈ (0,1)
Γ 𝑥+1 Γ 𝑛−𝑥+1
where 𝑥 = σ𝑛𝑖=1 𝑥𝑖

It can even be recognized that

𝜃|𝑥1 , … , 𝑥𝑛 ~ Beta(𝑥 + 1, 𝑛 − 𝑥 + 1)
That is, the posterior has beta distribution with parameters x + 1
and n – x +1.
16
used for method of moments.

Midterm
-

Beta Distribution .
a ,
b =
1
Beta(5, 7)

Aside: Beta Distribution

2.5
2.0

• The pdf of Beta(a, b) is defined for a, b > 0

by
dbeta(x, 5, 7)

1.5

Γ 𝑎 + 𝑏 𝑎−1
𝑓 𝑥 = 𝑥 1 − 𝑥 𝑏−1, 𝑥 ∈ [0,1]
1.0

Γ 𝑎 Γ 𝑏
• It is available in R with the dbeta function.
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

17
can solve on
pens paper!
Different Prior
Notice that the uniform distribution we used is equivalent to beta
distribution with parameters a = b = 1. That is,
U(0, 1) = Beta(1, 1)

𝜃 ~ Beta(a, b)
!
Suppose now that the prior is any beta distribution:
distribution
Any prior with Beta
eufs upwith Befo
Then the posterior is
distribution
𝑓 𝜃|𝑥1 , … , 𝑥𝑛 =
.

𝑥 𝑛−𝑥 Γ(𝑎 + 𝑏) 𝑎−1

𝜃 1−𝜃 𝜃 1 − 𝜃 𝑏−1
Γ 𝑎 Γ 𝑏
=
𝑝(𝑥1 , … , 𝑥𝑛 )
∝ 𝜃 𝑥+𝑎−1 1 − 𝜃 𝑛−𝑥+𝑏−1 , 𝜃 ∈ (0,1)
where 𝑥 = σ𝑛𝑖=1 𝑥𝑖
The only distribution with such shape is Beta(a + x, b + n – x).
Throughout the course we will use this trick to identify posterior
distributions. 18
Exercise: Gaussian Model with one observation
It is easier to consider first a model with one unknown parameter and one
observation.
Suppose that 𝜎 2 is known and we have only one observation such that:

𝑋 | 𝜃 ∼ 𝑁 𝜃, 𝜎 2
𝜃 ∼ 𝑁(𝜇, 𝜏 2 )
That is,
1 𝑥−𝜃 2
−
𝑓(𝑥 | 𝜃) = 𝑒 2𝜎2 , 𝑥 ∈ℝ
2𝜋𝜎 2
𝜃−𝜇 2
1 −
𝜋(𝜃) = 𝑒 2𝜏2 ,𝜃 ∈ ℝ
2𝜋𝜏 2
1 1 𝑥−𝜃 2 𝜃−𝜇 2
−2 +
𝑓(𝑥, 𝜃) = 𝑒 𝜎2 𝜏2
2𝜋𝜎𝜏

Find m(x) and 𝑓(𝜃|𝑥).

Possible to calculate posterior by .
hand

Example: Normal distribution

Known variance, unknown mean
Now suppose we have a sample of Normal data:
𝑋1 , … , 𝑋𝑛 ~ 𝑁 𝜃, 𝜎 2
Let us again assume we know the variance, 2 ,and
we assume a prior distribution for the mean, 𝜃, based
on our prior beliefs:
𝜃 ∼ 𝑁(𝜇, 𝜏 2 )
Now we wish to construct the posterior distribution
f(𝜃 |x1, … , xn)
(See exercises)
Posterior for Normal distribution mean
The unknown parameter is the mean 𝜃
The prior is 𝜃 ~ 𝑁 𝜇, 𝜏 2 . That is,
1 𝜃−𝜇 2
−
𝜋 𝜃 = 𝑒 2𝜏 2
2𝜋𝜏 2

Model for the data is 𝑋𝑖 ~𝑁 𝜃, 𝜎 2 . That is,

𝑥 −𝜃 2
1 − 𝑖 2
𝑓 𝑥𝑖 𝜃 = 𝑒 2𝜎
2𝜋𝜎 2

Conditional likelihood is:

𝑛
1 𝑥𝑖 −𝜃 2
−
ෑ 𝑒 2𝜎2
𝑖=1
2𝜋𝜎 2
Posterior for the mean of a normal distribution
Hence, without the normalizing constant m(x), the posterior is
proportional to: can use algebra to derivethese
I
2
𝜃−𝜇 2 𝑥𝑖 −𝜃
1 − 𝑛 1 −
𝑓 𝜃𝑥 ∝𝜋 𝜃 𝑓 𝑥𝜃 = 𝑒 2𝜏2 × ς𝑖=1 𝑒 2𝜎2
2𝜋𝜏2 2𝜋𝜎 2
After using some algebra, you can prove that
𝑛 2 𝑛 𝑛
2
𝑥𝑖 − 𝜃 1 2𝜃 𝜃
෍ = 2 ෍ 𝑥𝑖2 − 2 ෍ 𝑥𝑖 + 𝑛 2
𝜎 𝜎 𝜎 𝜎
𝑖=1 𝑖=1 𝑖=1
Then …
𝑛
1 1 𝑛 𝜇 σ 𝑥
− 𝜃 2 2 + 2 +𝜃 2 + 𝑖=12 𝑖 +𝑐𝑜𝑛𝑠𝑡
2 𝜏 𝜎 𝜏 𝜎
𝑓(𝜃|𝑥) ∝ 𝑒
To obtain the exact posterior distribution, you have to complete
the square…. (exercise at the end)
Posterior Distribution
• When you finish the previous example as an exercise, you will
find out that the posterior is a normal distribution:
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2 Later Formulas compared
:
2
6
𝜃 ∼ 𝑁(𝜇, 𝜏 2 ) ↑ to
prior Mand
2
⇒ 𝜃|𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡
• Note that we ended up with posterior distribution in the same
family as the prior: normal. Such phenomenon is known as a
conjugate prior.
2
• How are the posterior mean and variance 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡 related to
the prior mean and variance and the sample mean and
variance?
• It turns out that the posterior mean is a compromise between
the prior mean and the sample mean!
Precisions and means whatdata tells
Posterior mean and variance: knowledge about you
O
.

D
𝜇 𝑥ҧ 1 ↑ 𝑛
+
𝜏 2 𝜎 2 /𝑛 𝜏 2 𝜎 2
𝜇𝑝𝑜𝑠𝑡 =𝐸 𝜃𝒙 = = 𝜇+ 𝑥ҧ
1 𝑛 1 𝑛 1 𝑛
+
𝜏2 𝜎2
+
𝜏2 𝜎2
+
𝜏2 𝜎2 ↓
Variance if you have
Posterior - 1 large sample,it
𝑉𝑎𝑟 𝜃 𝒙 = will dominate.
1 𝑛
+ sample Prior
𝜏 2 𝜎 2 Total ofprecision
>
-

precision
.

In Bayesian statistics the precision = 1/variance is often more

important than the variance. smaller than prior

For the Normal model we have that the posterior precision = sum
of prior precision and data precision, and the posterior mean is a
(precision weighted) average of the prior mean and data mean.
School Example
The following data are on the amount of time (in hours) students
from some high school spent on studying or homework for a
particular subject during an exam one-week period:

2.11 9.75 13.88 11.30 8.93 15.66 16.38 4.54 8.86 11.94 12.47 11.11 11.65
14.53 9.61 7.38 3.34 9.06 9.45 5.98 7.44 8.50 1.55 11.45 9.73

Suppose that we are somewhat 90% sure that students dedicate 2-3
hours on average per credit hour and the the class was 3 credits.
Assume study times vary between students with standard deviation
4 hours.

Obtain reasonable prior and compute the posterior mean and

variance and draw the posterior curve and compare it to the prior.
School Example Prior
Suppose that we are somewhat 90% sure that students dedicate
2-3 hours on average per credit hour and the the class was 3
credits. Assume study times vary between students with
standard deviation 4 hours.

Obtain reasonable prior:

We can take the middle of the interval to be
2+3
𝜇 =3× = 7.5
2
The length of the interval is approximately 2*1.645*𝝉, so:
3
3 × 3 − 2 = 2 × 1.645 × 𝜏 ⇒ 𝜏 = = 0.9 ≈ 1
3.29
2 2
𝜎 = 4 = 16
Prior and posterior comparison

Note how the posterior has more precision (narrower curve) than the prior.
Estimating 𝜃 under Bayesian Paradigm
Q: We found the posterior 𝑓 𝜃 𝑥 . Now what can we do with it?

A: It can be reported as the entire posterior likelihood of the

parameter, but very often we report just a point estimator 𝜃መ .
The most commonly used estimator is the posterior mean:
𝜃መ = 𝐸(𝜃 | 𝑥) = න 𝜃 𝑓 𝜃 𝑥 𝑑 𝜃

In general, we can find estimate of ℎ(𝜃) for any integrable

function ℎ(⋅) by value of transformation
of some methoda
expected
෠
ℎ(𝜃) = 𝐸[ℎ(𝜃) | 𝑥] = න ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑 𝜃

Therefore, the Bayesian approach often results in integration

problems. Some of these difficulties are:
• 𝑓 𝜃 𝑥 might not be available in closed form or be partially
available due to m(x) being unavailable.
• the integration ‫ ׬‬ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑 𝜃 cannot be done analytically.
Confidence Regions
• The Bayesian analog of CI's are the
confidence regions, also called credible
regions, with highest posterior density:
Most credible
region .
𝐶(𝒙) = { 𝜃: 𝑓(𝜃 | 𝒙) ≥ 𝑘}
• where k is such that How to know which level to slice :

𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
not
• This leads to a different symmetricalis
-

the best approach

computational problem, namely I could have
a

solving shorter distance

𝑓(𝜃 | 𝒙) = 𝑘
CI Examples
• Example (normal, continued): If the posterior is
unimodal, symmetric:
𝜇 𝜎 2 + 𝑥 𝜏2
2 symmetrical curve :

𝜃 −
1
− 2 𝜎2 +𝜏2
1 𝜎 2 𝜏2
𝑓(𝜃 | 𝑥) = 𝑒 𝜎2 +𝜏2
𝜎 2𝜏 2
2𝜋 2
𝜎 + 𝜏2
the credible region has the form
𝜇 𝜎2 + 𝑥 𝜏2 𝜎 2𝜏 2
2 2 ± 𝑧𝛼 2 2
𝜎 +𝜏 2 𝜎 +𝜏 More area on

left

• Example (binomial, continued): In this case the posterior ↑

is not symmetric:
a = 2, b = 5

𝜃 | 𝑥 ∼ 𝐵𝑒𝑡𝑎(𝑥 + 𝛼, 𝑛 − 𝑥 + 𝛽)

2.0
We must find l(x) and u(x) such that

1.5
f(x)
𝑢(𝑥)

1.0
න 𝑓 𝜃 𝑥 𝑑𝜃 = 0.95

0.5
𝑙(𝑥)

0.0
0.0 0.2 0.4 0.6 0.8 1.0

and 𝑓(𝑙(𝑥) | 𝑥) = 𝑓(𝑢(𝑥) | 𝑥), which cannot be done x

analytically.
Conjugate Priors

The computational drawbacks used to be so severe that in the

past many researchers worked only with the so-called conjugate
priors.

Definition: A family of probability distributions ℱ = { 𝜋(𝜃) } is

conjugate for the model 𝑓(𝑥| 𝜃) if
𝑓 𝜃|𝑥 ∈ℱ

Note: The Gaussian example was conjugate, because the prior

had normal distribution, and the posterior also had a normal
distribution.
Example: Beta-Binomial Model
Let
𝑥 | 𝜃 ∼ Binom(𝑛, 𝜃)
𝜃 ∼ Beta(𝛼, 𝛽)
That is,
𝑛 𝑥
𝑓(𝑥 | 𝜃) = 𝑃(𝑋 = 𝑥 | 𝜃) = 𝜃 1 − 𝜃 𝑛−𝑥
𝑥
Γ 𝛼+𝛽
𝜋 𝜃 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1
Γ 𝛼 Γ 𝛽
𝑛 Γ 𝛼+𝛽
⇒ 𝑓(𝑥, 𝜃) = 𝜃 𝛼+𝑥−1 1 − 𝜃 𝑛+𝛽−𝑥−1
𝑥 Γ 𝛼 Γ 𝛽

⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)

Therefore, prior is Beta distributed, and posterior is Beta distributed,

meaning the Beta distribution is a conjugate prior.

Note: See R examples.

Improper Priors
Sometimes we may wish that 𝜋(𝜃) contains no information about 𝜃. For
1
example, if Θ = [𝑎, 𝑏], then 𝜋 𝜃 = 𝑏−𝑎 ∼ 𝑈(𝑎, 𝑏) is a natural choice
when Θ is bounded.

However, if, say, Θ = ℝ, and 𝜋 (𝜃) = 𝑐, then

∞

‫ = 𝜃 𝑑 )𝜃(𝜋 ׬‬න 𝑐 𝑑𝜃 = ∞
−∞

∞
Definition: When ‫׬‬−∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.

If 𝜋(𝜃) is improper, then Bayesian inference is still possible, if 𝑓(𝜃 | 𝑥) is a

If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) ‫𝑥(𝑓 ׬‬, 𝜃, 𝜂) 𝑑 𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) ‫𝑥(𝑓 ׭‬, 𝜃, 𝜂) 𝑑 𝜂 𝑑 𝜃
‫𝜃 | 𝑥(𝑓 ׬‬, 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑 𝜂
=
‫𝜃 | 𝑥(𝑓 ׭‬, 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑 𝜂 𝑑 𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Example
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts rj out of mj surgeries, j = A, … , L.

Stage 1: 𝑓 𝑟𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 2: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 3: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌

𝜃𝑗
For Stage 2, define 𝛽𝑗 = log 1−𝜃 and assume the betas are independent with
𝑗
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2

For Stage 3, assume 𝜇 and 𝜎2 are independent with

𝜇 ~ 𝑁 0, 𝑐 2
𝜎 2 ~ IG(𝑎, 𝑏)
where IG is the inverse gamma prior distribution.
Let a = b = 0.001 and c = 103
Hierarchical Models: Why?
Consider the mortality rate example again. How would you model the data
without hierarchy?

At one extreme, we could consider that each hospital has their own mortality
rate and estimate the means individually per each hospital. This is known as the
no pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).

At the other extreme, we can consider all hospitals similar enough, and just
estimate a single pooled estimate (aka complete pooling of data).

Advantage: High accuracy. Disadvantage: No individual estimates (bias).

Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Non-exact Estimation
Q: What do we do when the integrals are not available in closed form?

A: There are three major alternatives to exact (calculus) solutions.

• Numerical integration: Grid approximation, Riemann integration, trapezoid

rule, Gaussian quadrature, ... However, they don't work well in high
dimensions.
Aside: How to deal with integrals over infinite intervals?
One approach is to change of variables, like this one:
∞ 1 𝑡 1+𝑡 2
‫׬‬−∞ 𝑓 𝑥 𝑑𝑥 = ‫׬‬−1 𝑓 1−𝑡 2 1−𝑡 2 2
𝑑𝑡
• Analytical approach, like Laplace approximation, but too complicated and
still doesn't work well, especially with hierarchical (multistage) models.
• Monte Carlo (MC) methods: simple and work equally well in any dimension.
We will concentrate on this third option and study it in more details.
Monte Carlo Estimation
Basic idea: If we want to find 𝐸 ℎ 𝜃 = ‫ ׬‬ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑𝜃 for some ℎ(⋅) and
some 𝜃 ∼ 𝑓(𝜃) (within the Bayesian framework we typically have 𝑓(𝜃) =
𝑓(𝜃 | 𝑥) and ℎ(𝜃) = 𝜃), then we can obtain an iid sample 𝜃1 , … , 𝜃𝑁 from 𝑓 𝜃
and estimate E[h(𝜃)] with
𝑁
1
ℎ෠ = ෍ ℎ(𝜃𝑖 )
𝑁
𝑖=1
𝑎.𝑠.
Then by the SLLN, ℎ෠ ℎ(𝜃), as 𝑁 → ∞. That is, ℎ෠ is a
consistent estimator of E[h(𝜃)].

Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.

If obtaining an iid sample is not an option, then we can employ correlated and
non-identically distribute sample in the form of a Markov chain, which leads to
the Markov chain Monte Carlo methods (MCMC), studied later.
Exercises

1. Verify algebraically that

𝑥−𝜃 2 𝜃−𝜇 2
2
+
𝜎 𝜏2
2 2
𝑥 𝜏2 + 𝜇 𝜎2 𝑥 2 𝜏 2 + 𝜎 2 𝜇2 𝑥 𝜏2 + 𝜇 𝜎2
𝜃− + −
𝜎2 + 𝜏2 𝜎2 + 𝜏2 𝜎2 + 𝜏2
=
𝜎2 𝜏2
𝜎2 + 𝜏2

2. Redo normal example with n observations

𝑋1 , … , 𝑋𝑛 ∼ 𝑁 (𝜃, 𝜎 2 )

3. Check that 𝜋 𝜃 ∝ 1 is a valid improper prior for the Gaussian

model.

CS2A Workbook
No ratings yet
CS2A Workbook
281 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Bagozzi and Yi, 1988 PDF
0% (1)
Bagozzi and Yi, 1988 PDF
21 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
GoldSim Appendices
No ratings yet
GoldSim Appendices
129 pages
Statistical Analysis with Measurement Error or Misclassification Strategy, Method and Application pdf docx
100% (11)
Statistical Analysis with Measurement Error or Misclassification Strategy, Method and Application pdf docx
15 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
Stat 520 CH 7 Slides
No ratings yet
Stat 520 CH 7 Slides
35 pages
06 Nonlinear Regression Models
No ratings yet
06 Nonlinear Regression Models
57 pages
20-Bayesian2
No ratings yet
20-Bayesian2
50 pages
Lecture 13
No ratings yet
Lecture 13
27 pages
Lecture 2 - 4 Prior
No ratings yet
Lecture 2 - 4 Prior
51 pages
Slides 1
No ratings yet
Slides 1
73 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
SKL Pattern
No ratings yet
SKL Pattern
66 pages
Estimation and Detection Theory Lab Manual v2
No ratings yet
Estimation and Detection Theory Lab Manual v2
5 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
79 pages
The Modelling of Non-Maturity Deposits
No ratings yet
The Modelling of Non-Maturity Deposits
45 pages
Single Parameter Models
No ratings yet
Single Parameter Models
37 pages
DS 630_Lec 02_St
No ratings yet
DS 630_Lec 02_St
34 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Mstat Note14 Bayesian Inference FSP
No ratings yet
Mstat Note14 Bayesian Inference FSP
30 pages
Bayesian Inference for the Gaussian
No ratings yet
Bayesian Inference for the Gaussian
11 pages
DS 630_Lec 4_St
No ratings yet
DS 630_Lec 4_St
27 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
CBCHB Manual
No ratings yet
CBCHB Manual
111 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Analytics of Observational data lec 10
No ratings yet
Analytics of Observational data lec 10
23 pages
hmm intraday momentum
No ratings yet
hmm intraday momentum
23 pages
Advance Statistics
No ratings yet
Advance Statistics
23 pages
Statistical Concepts
No ratings yet
Statistical Concepts
20 pages
ps3-sol
No ratings yet
ps3-sol
21 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
CH 5
No ratings yet
CH 5
45 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
STA301 Solved MCQs (23 To 45) Final Term by JUNAID
No ratings yet
STA301 Solved MCQs (23 To 45) Final Term by JUNAID
26 pages
Lectures 5
No ratings yet
Lectures 5
31 pages
BayesiaLab Specifications en
No ratings yet
BayesiaLab Specifications en
17 pages
BT_Wk3_LectureNotes(3)
No ratings yet
BT_Wk3_LectureNotes(3)
16 pages
Notes 2
No ratings yet
Notes 2
22 pages
Lecture 6. Bayesian Estimation
No ratings yet
Lecture 6. Bayesian Estimation
14 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Journal of Statistical Software: Extremes 2.0: An Extreme Value Analysis Package Inr
100% (1)
Journal of Statistical Software: Extremes 2.0: An Extreme Value Analysis Package Inr
49 pages
Lecture4 More Bayes
No ratings yet
Lecture4 More Bayes
24 pages
Lecture Material 2.5 - Bayesian Estimation & Concepts
No ratings yet
Lecture Material 2.5 - Bayesian Estimation & Concepts
12 pages
Single Parametric Models
No ratings yet
Single Parametric Models
10 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
Quantifying Diameter Distributions in Seasonally Dry Tropical Forest in Bahia, Brazil
No ratings yet
Quantifying Diameter Distributions in Seasonally Dry Tropical Forest in Bahia, Brazil
10 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Stat I CH - 4
No ratings yet
Stat I CH - 4
15 pages
ProblemSet1Sol
No ratings yet
ProblemSet1Sol
7 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
A Beginner's Notes On Bayesian Econometrics (Art)
No ratings yet
A Beginner's Notes On Bayesian Econometrics (Art)
21 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Bayesian Statistics: MA501, Statistics For Insurance
No ratings yet
Bayesian Statistics: MA501, Statistics For Insurance
28 pages
Spearman 1904a
No ratings yet
Spearman 1904a
31 pages
Bayesian Inference: The Basics
No ratings yet
Bayesian Inference: The Basics
37 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
MIT18 650F16 Bayesian Statistics
No ratings yet
MIT18 650F16 Bayesian Statistics
18 pages
A Stochastic Frontier Model With Correction For Sample Selection
No ratings yet
A Stochastic Frontier Model With Correction For Sample Selection
10 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
Bayesian Statistics 01
100% (1)
Bayesian Statistics 01
22 pages
bayes
No ratings yet
bayes
3 pages
ln13
No ratings yet
ln13
5 pages
5_Improving text independent speaker recognition with GMM
No ratings yet
5_Improving text independent speaker recognition with GMM
4 pages
Lecture 20 - Bayesian Analysis
No ratings yet
Lecture 20 - Bayesian Analysis
4 pages
Logistic Regression and SGD
No ratings yet
Logistic Regression and SGD
10 pages
Bayesian Inference: by Hoai Nam Nguyen September 9, 2017
No ratings yet
Bayesian Inference: by Hoai Nam Nguyen September 9, 2017
7 pages
Chap 2
No ratings yet
Chap 2
28 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
RSK4801 Discussion Class Presentation - Winter
No ratings yet
RSK4801 Discussion Class Presentation - Winter
39 pages
The Industrial Challenges of Airborne AESA Radars: Stephane Kemkemian, Alain Larroque, Cyrille Enderli
No ratings yet
The Industrial Challenges of Airborne AESA Radars: Stephane Kemkemian, Alain Larroque, Cyrille Enderli
5 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Comparison To ML Estimation
No ratings yet
Comparison To ML Estimation
2 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Notes 2 BayesianStatistics
No ratings yet
Notes 2 BayesianStatistics
6 pages
Business Impact Analysis Template Castellan Solutions
100% (2)
Business Impact Analysis Template Castellan Solutions
9 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

19-Bayesian 2

Uploaded by

19-Bayesian 2

Uploaded by

STAT 5703

Statistical Inference and Modeling

• We will also consider some examples to illustrate the new concepts.

Maximum Likelihood Approach

The likelihood function, 𝐿(𝜃 | 𝒙) is defined as

and the goal is to find a suitable estimate of 𝜃መ of 𝜃. optimisation -

Find max function .

Remark: Very often the parameter 𝜃 is a vector 𝛉 = (𝜃1 , … , 𝜃𝑘 ).

• It is useful to think that 𝜃 likely to be 0 .

value. Apply Bayes

-> depends on type

is the marginal distribution of the data X.

𝑓 𝑦𝑥 = Both are rectors

When we apply it to y = 𝜃 we obtain the posterior

• For large n, as well as when the prior is uniform, the

• If you know something about your parameter, specify it.

• There might have already been research done on similar

Note that 𝑓 𝑥𝑖 𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 , 𝑖 = 1, … , 𝑛

Suppose we have no information about the value of 𝜃. This

The last line says that 𝑓 𝜃 𝑥1 , … , 𝑥𝑛 and 𝑝 𝑥1 , … , 𝑥𝑛 𝜃 are

Therefore, with 𝑎 = 𝑥 + 1, 𝑏 = 𝑛 − 𝑥 + 1, where 𝑥 = σ𝑛𝑖=1 𝑥𝑖 ,

It can even be recognized that

Aside: Beta Distribution

• The pdf of Beta(a, b) is defined for a, b > 0

0.0 0.2 0.4 0.6 0.8 1.0

𝑥 𝑛−𝑥 Γ(𝑎 + 𝑏) 𝑎−1

Find m(x) and 𝑓(𝜃|𝑥).

Example: Normal distribution

Model for the data is 𝑋𝑖 ~𝑁 𝜃, 𝜎 2 . That is,

Conditional likelihood is:

In Bayesian statistics the precision = 1/variance is often more

Obtain reasonable prior and compute the posterior mean and

Obtain reasonable prior:

A: It can be reported as the entire posterior likelihood of the

In general, we can find estimate of ℎ(𝜃) for any integrable

Therefore, the Bayesian approach often results in integration

the best approach

solving shorter distance

• Example (binomial, continued): In this case the posterior ↑

and 𝑓(𝑙(𝑥) | 𝑥) = 𝑓(𝑢(𝑥) | 𝑥), which cannot be done x

The computational drawbacks used to be so severe that in the

Definition: A family of probability distributions ℱ = { 𝜋(𝜃) } is

Note: The Gaussian example was conjugate, because the prior

Therefore, prior is Beta distributed, and posterior is Beta distributed,

Note: See R examples.

However, if, say, Θ = ℝ, and 𝜋 (𝜃) = 𝑐, then

If 𝜋(𝜃) is improper, then Bayesian inference is still possible, if 𝑓(𝜃 | 𝑥) is a

For Stage 3, assume 𝜇 and 𝜎2 are independent with

Advantage: High accuracy. Disadvantage: No individual estimates (bias).

A: There are three major alternatives to exact (calculus) solutions.

• Numerical integration: Grid approximation, Riemann integration, trapezoid

1. Verify algebraically that

2. Redo normal example with n observations

3. Check that 𝜋 𝜃 ∝ 1 is a valid improper prior for the Gaussian

You might also like