0% found this document useful (0 votes)
4 views

1-MS2 (Intro Bayes)

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

1-MS2 (Intro Bayes)

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Mathematical Statistics 2 (MS2)

Lecture 1: Introduction to Bayesian inference

Amine Hadji <[email protected]>


Leiden University February 9, 2022
1 Introduction to Bayesian inference
1.1 Introduction

About this course

• Introduction to theory of Bayesian inference

• Implementation of common algorithms

• 3 Lectures + 3 Computer labs

• One practical assignment (+ Presentation possible)

1 / 25
1 Introduction to Bayesian inference
1.1 Introduction

In this chapter, we will discuss:


• Bayes theorem & Bayesian approach

• Prior & posterior distribution

• Bayesian inference (estimation, testing)

• Choice of the prior

2 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Bayesian and frequentist probabilities


Frequentism: Bayesianism:
The probability of an event A is the The probability of an event A is a
limit of the frequency of the oc- quantification of the belief one has
curence of A in a repeated experi- that A occurs in a specific exeperi-
ment. ment.

Both philosophies have pros and cons. Those will not be discussed in this
course.

3 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Bayes’ rule

Proposition 1.1 (Bayes’ theorem)


Let (Ω, F, P) a probability space with sample space Ω and probability
measure P. Let A, B ∈ F two events such that P(A) 6= 0, then

P(A|B)P(B)
P(B|A) =
P(A)
P(A|B)P(B)
= .
P(A|B)P(B) + P(A|B c )P(B c )

4 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Example - Sensititvity and specificity


A medical test for disease X has outcomes positive and negative. The
disease has a prevalence of 1% (i.e. P(sick)=1%). We are given the
sensitivity and the specificity of the test:

• sensitivity (i.e. P(+ | sick)): 90%

• specificity (i.e. P(– | healthy)): 99%


You get tested positive for disease X. What is the probability you are
actually sick?

5 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Binary case


Let θ and Y two binary random variables (both can take values {0, 1}),
then

P(Y = y |θ = 1)P(θ = 1)
P(θ = 1|Y = y ) =
P(Y = y |θ = 1)P(θ = 1) + P(Y = y |θ = 0)P(θ = 0)

6 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Binary case


Let θ and Y two binary random variables (both can take values {0, 1}),
then

P(Y = y |θ = 1)P(θ = 1)
P(θ = 1|Y = y ) =
P(Y = y |θ = 1)P(θ = 1) + P(Y = y |θ = 0)P(θ = 0)

Terminology:
• P(θ = 1), P(θ = 0) are prior probabilities

• P(Y = y |θ = 1) is the likelihood

• P(θ = 1|Y = y ) is the posterior probability

6 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Categorical case


Let now θ ∈ {θ1 , ..., θK } a categorical random variable, and Y an
arbitrary discrete random variable

P(Y = y |θ = θk )P(θ = θk )
P(θ = θk |Y = y ) = PK
i=1 P(Y = y |θ = θi )P(θ = θi )

7 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Categorical case


Let now θ ∈ {θ1 , ..., θK } a categorical random variable, and Y an
arbitrary discrete random variable

P(Y = y |θ = θk )P(θ = θk )
P(θ = θk |Y = y ) = PK
i=1 P(Y = y |θ = θi )P(θ = θi )
Short-hand notation
P(y |θk )P(θk )
P(θk |y ) = PK
i=1 P(y |θi )P(θi )

7 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Continuous version


Let now θ a continuous random variable, and Y an arbitrary random
variable with L(θ|Y = y ) the likelihood of θ

P(θ)L(θ|Y = y )
P(θ|y ) = R
P(θ)L(θ|Y = y )dθ

8 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Continuous version


Let now θ a continuous random variable, and Y an arbitrary random
variable with L(θ|Y = y ) the likelihood of θ

P(θ)L(θ|Y = y )
P(θ|y ) = R
P(θ)L(θ|Y = y )dθ

P(θ|y ) ∝ P(θ)L(θ|Y = y )

8 / 25
1 Introduction to Bayesian inference
1.3 Prior & posterior distribution

Examples

• Binomial case

• Poisson case

• Gaussian case (σ 2 known)

9 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

• Credible intervals

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

• Credible intervals

• Bayesian hypothesis testing

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior mode

θ̂M = arg maxθ P(θ|y )

Proposition 1.2
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂M the posterior mode. The following
properties are true
• if P(θ) is constant for all possible values of θ, then θ̂M = θ̂MLE

• there exist bijective functions h : Θ → ∆ such that h\ 6 h(θ̂M ).


M (θ) =

11 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior mean
R
θ̂ = E[θ|y ] = θP(θ|y )dθ

Proposition 1.3
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂ the posterior mean. The following
properties are true
• Z
θ̂ = arg min

(θ − θ∗ )2 P(θ|y )dθ
θ ∈Θ

• there exist bijective functions h : Θ → ∆ such that h(θ)


d = 6 h(θ̂).

12 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior median

θ̂Med =
1
2 (inf {θ∗ : P(θ ≤ θ∗ |y ) ≥ 1/2} + sup {θ∗ : P(θ ≥ θ∗ |y )dθ ≥ 1/2})

Proposition 1.4
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂Med the posterior median. The following
properties are true
• Z
θ̂Med = arg min

|θ − θ∗ |P(θ|y )dθ
θ ∈Θ

• all bijective functions h : Θ → ∆ verify h\


Med (θ) = h(θ̂Med ).

13 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior variance

Var(θ|y ) = E[(θ − θ̂)2 |y ] = (θ − θ̂)2 P(θ|y )dθ


R

Proposition 1.5
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂ the posterior mean and Var(θ|y ) the
posterior variance. The following properties are true

Var(θ|y ) = E[θ2 |y ] − θ̂2


Var(θ) = E[Var(θ|y )] + Var(θ̂)
(The posterior variance is on average smaller than the prior variance)

14 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Credible sets

Definition 1.6
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). The set Ĉ ⊂ Θ is a (1 − α)-credible set if it
verifies the following:

P(θ ∈ Ĉ |y ) ≥ (1 − α).

A (1 − α) credible set for a specific posterior distribution is not unique

15 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Credible sets
Two special types of credible set
• Highest posterior density set ĈHPD :
for all θ ∈ ĈHPD and θ0 ∈
/ ĈHPD , we have P(θ|y ) ≥ P(θ0 |y )

• Equal tail interval C̃ := [a(y ), b(y )]:


a(y ) := sup{a : P(θ ≤ a) < α/2}
b(y ) := inf{b : P(θ ≥ b) < α/2}

16 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayesian hypothesis testing


Bayesian tools for hypothesis testing H0 : θ = θ0 :

• Using credible sets (reject H0 if θ0 ∈


/ Ĉ )

• Putting a prior belief on H0 and H1 :

P(y |H0 )P(H0 )


P(H0 |y ) = R
P(y |H0 )P(H0 ) + P(H1 ) H1 P(θ)L(θ|Y = y )dθ

• Using a Bayes factor for model selection

17 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor

Definition 1.7
Let Y = (Y1 , ..., Yn )|θi a conditional iid sample, and θ1 , θ2 two random
variables on Θ1 , Θ2 respectively with prior distributions P1 (θ1 ) and
P2 (θ2 ). The Bayes factor B12 is the marginal likelihood ratio
R
L(θ1 |Y = y )P1 (θ1 )dθ1
B12 = R .
L(θ2 |Y = y )P2 (θ2 )dθ2

18 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor - Interpretation


Jeffrey’s classification for favoring Model 1 (i.e. Θ1 ) according to the
Bayes factor is as follows

• BF12 ∈ [1, 3.2): ’not worth mentioning’

• BF12 ∈ [3.2, 10): ’substantial’ evidence for Model 1

• BF12 ∈ [10, 32): ’strong’ evidence for Model 1

• BF12 ∈ [32, 100): ’very strong’ evidence for Model 1

• BF12 > 100: ’decisive’ evidence for Model 1

19 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor - Interpretation


Jeffrey’s classification for favoring Model 1 (i.e. Θ1 ) according to the
Bayes factor is as follows

• BF12 ∈ [1, 3.2): ’not worth mentioning’

• BF12 ∈ [3.2, 10): ’substantial’ evidence for Model 1

• BF12 ∈ [10, 32): ’strong’ evidence for Model 1

• BF12 ∈ [32, 100): ’very strong’ evidence for Model 1

• BF12 > 100: ’decisive’ evidence for Model 1


If B12 < 1, then the evidence favors Model 2, and we use B21 = 1/B12 to
make interpretations.

19 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

What is a good prior?


A prior distribution is supposed to represent our prior belief on a
parameter...

20 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

What is a good prior?


A prior distribution is supposed to represent our prior belief on a
parameter...

... but choosing a prior without thinking of the posterior might lead to
computational intractability of the posterior

20 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Conjugate priors

Definition 1.8
Let P a family of probability distributions. Let Y = (Y1 , ..., Yn )|θ a
conditional iid sample, and θ a random variable on Θ with prior
distribution P(θ) and posterior distribution P(θ|y ) := P(θ|Y = y ). We
say that the prior and posterior distributions are conjugate distributions
of P for the likelihood L(θ|Y = y ) if P(θ), P(θ|y ) ∈ P. Moreover, we
say call the prior P(θ) a conjugate prior.

21 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family

Definition 1.9
A family of probability distribution defined by its likelihood L(θ|Y = y )
depending on a parameter θ is called a k-dimensional exponential family
if there exist functions c, h, Qj and Vj such that
 
X k 
L(θ|Y = y ) = c(θ)h(y ) exp Qj (θ)Vj (y ) .
 
j=1

22 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family

Definition 1.9
A family of probability distribution defined by its likelihood L(θ|Y = y )
depending on a parameter θ is called a k-dimensional exponential family
if there exist functions c, h, Qj and Vj such that
 
X k 
L(θ|Y = y ) = c(θ)h(y ) exp Qj (θ)Vj (y ) .
 
j=1

The statistic V (y ) := {V1 (y ), ..., Vk (y )} is sufficient for θ

22 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family - Result

Proposition 1.10
Let a k-dimensional exponential family defined by its likelihood
L(θ|Y = y ). All distributions of the family
  
 Xk 
Pα,β = P(θ) ∝ c(θ)β exp Qj (θ)αj
  
j=1

are conjugate priors for L(θ|Y = y ).

23 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

Definition 1.11
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample with likelihood
L(θ|Y = y ). The non-informative Jeffreys prior is the prior verifying
p
PJ (θ) ∝ |I (θ)|,

where I (θ) is the Fisher information in one observation of θ, i.e.

∂ 2 log(L(θ|Y1 = y1 ))
 
∂ log(L(θ|Y1 = y1 ))
I (θ) = Varθ = Eθ .
∂θ ∂θ2

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

Definition 1.11
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample with likelihood
L(θ|Y = y ). The non-informative Jeffreys prior is the prior verifying
p
PJ (θ) ∝ |I (θ)|,

where I (θ) is the Fisher information in one observation of θ, i.e.

∂ 2 log(L(θ|Y1 = y1 ))
 
∂ log(L(θ|Y1 = y1 ))
I (θ) = Varθ = Eθ .
∂θ ∂θ2

Careful:
The Fisher information uses the log-likelihood of one observation

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

• If θ and φ are possible parametrizations of a statistical model, their


Jeffreys priors are invariant under reparametrization


PJ (φ) = PJ (θ)

• Most Jeffreys priors are improper (i.e. they cannot be probability
R
distribution because PJ (θ)dθ is not well-defined)

25 / 25

You might also like