Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018
Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018
Carlo
1/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
2/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
3/39
Summarising Posterior Information
The posterior π(θ | x) is a complete summary of the inference about θ.
In some sense π(θ | x) is the inference.
4/39
Loss Functions
How do we define “best” or optimal?
Loss Function
For a prediction d ∈ D, a loss function
`(θ, d)
5/39
Loss functions
6/39
Loss functions for estimating θ
7/39
Quadratic Loss
Z
Eπ [`(θ, d)] = `(θ, d)π(θ | x)dθ
Z
= (θ − d)2 π(θ | x)dθ
Z
= (θ − E(θ | x) + E(θ | x) − d)2 π(θ | x)dθ
Z Z
= (θ − E(θ | x))2 π(θ | x)dθ + (E(θ | x) − d)2 π(θ | x)dθ
Z
+2 (θ − E(θ | x))(E(θ | x) − d)π(θ | x)dθ
8/39
Linear Loss
Linear loss:
α(d − θ) if d > θ
`(θ, d) =
β(θ − d) if d < θ
for given α, β > 0.
9/39
Linear Loss
E`(X, d) − E`(X, d∗ )
= E[X; d < X < d∗ ] − dP[d < X < d∗ ]
(α + β)
= P[d < X < d∗ ] (E[X | d < X < d∗ ] − d)
| {z }
≥0
The case for d∗ < d is dealt with similarly, so we obtain the following.
β
So linear loss is minimised at the α+β
posterior quantile.
10/39
Absolute Error Loss
11/39
0-1 Loss
0-1 loss:
0 if |d − θ| ≤
`(θ, d) =
1 if |d − θ| >
Here
12/39
Loss functions for making other decisions
13/39
Loss functions for making other decisions
I Loss = -Profit
c(b − d) for b ≥ d (cost of baking surplus loaves)
`(b, d) =
(s − c)(d − b) for b < d (lost profit from not baking enough loaves)
14/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
15/39
Predictive Inference
16/39
Predictive Inference
Can use standard conjugate families to give tractable forms for predictive
distribution in some cases.
17/39
Predictive Inference
Example: Binomial model.
θ | X = x ∼ Beta(a + x, b + n − x).
So for y = 0, 1, . . . , N
!
1
θa+x−1 (1 − θ)b+n−x−1
Z
N y
f (y | x) = θ (1 − θ)N −y × dθ
0 y B(a + x, b + n − x)
!
N B(y + a + x, N − y + b + n − x)
= .
y B(a + x, b + n − x)
θy
f (y | θ) = exp(−θ) .
y!
Hence the predictive distribution is
θy (b + n)a+nx̄n a+nx̄n −1
Z
f (y | x) = exp(−θ) θ exp(−(b + n)θ)dθ
y! Γ(a + nx̄n )
! !y a+nx̄n
y + a + nx̄n − 1 1 1
(Tute 1) = 1− .
y b+n+1 b+n+1
19/39
Monte Carlo posterior predictive distributions
How to generate samples from f (y | x)?
Z
f (y | x) = f (y | θ, x)π(θ | x)dθ
Θ
By inspection of formula:
I Obtain posterior samples θ i ∼ π(θ | τ )
I For each θ i we can generate X i ∼ f (x | θ i )
I This gives us joint samples (θ i , X i ) | x ∼ f (x | θ, τ )π(θ | τ )
I To obtain samples from f (y | x), “integrate out” θ
(i.e. discard the θ i values) to leave X i ∼ f (y | x) only
Hugely simpler than calculating exact algebraic expression!
20/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
21/39
Posterior Asymptotics
1. Consistency:
If the “true” value of θ = θ0 ,
and if π(θ0 ) 6= 0 (or is non-zero in a neighbourhood of θ0 ),
then with increasing amounts of data (n −→ ∞) the posterior
probability that θ equals (or lies in a neighbourhood of) θ0 −→ 1.
22/39
Posterior Asymptotics
1. Consistency:
Let X1 , . . . , Xn ∼ f (x | θ0 ) be iid observations and suppose the prior is such
that π(θ0 ) 6= 0.
1
Pn f (Xi | θ0 )
where Dn (θ) = n i=1 ln f (Xi | θ)
.
23/39
Posterior Asymptotics
For fixed θ, Dn (θ) is the average of n iid random variables,
so converges in probability to its expectation (law of large numbers)
Z
f (x | θ0 )
E[Dn (θ)] = f (x | θ0 ) ln dx := D(θ) .
f (x | θ)
dDn P dD
(θ0 ) −→ (θ0 ) = 0 .
dθ dθ
25/39
Posterior Asymptotics
Further, we have d):
d2 Dn d2 ln f (x | θ)
Z
P
(θ0 ) −→ − f (x | θ0 ) dx := I(θ0 ),
d2 θ dθ2
θ=θ0
distribution.
26/39
Posterior Asymptotics
Example: Normal model
Let X1 , . . . , Xn ∼ N(θ, σ 2 ) where σ 2 is known.
As usual, this gives the log-likelihood
Pn
(xi − θ)2
ln f (x | θ) = − i=1 2 + c1
σ
from which we obtain
n
d ln f (x | θ) 1 X
= 2 (xi − θ)
dθ σ i=1
and so
d2 ln f (x | θ)
= −n/σ 2 .
d2 θ
The mle is θ̂ = X̄n and In (θ) = nI(θ) = n/σ 2 . So asymptotically, as
n → ∞,
θ | X n ∼ N(X̄n , σ 2 /n).
This is true for any prior distribution which places non-zero probability
around the true value of θ.
27/39
Likelihood Asymptotics
Consider again the likelihood model X ∼ Bin(n, θ).
!
n x
f (x | θ) = θ (1 − θ)n−x , x = 0, . . . , n
x
thus
log(f (x | θ)) = x log θ + (n − x) log(1 − θ)
so,
d ln f (x | θ) x n−x
= −
dθ θ 1−θ
and
d2 ln f (x | θ) x n−x
=− 2 − .
d2 θ θ (1 − θ)2
Consequently
nθ n(1 − θ) n
In (θ) = + = . (EX = nθ)
θ2 (1 − θ)2 θ(1 − θ)
Thus, as n → ∞, we have
d θ(1 − θ)
θ | X −→ N θ, .
n
28/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
29/39
Importance Sampling
Notes:
I The samples (X (1) , W (1) ), . . . , (X (N ) , W (N ) ) are weighted samples
from f (x).
I Weight is ∝ f (X)/g(X), not f (X)/Kg(X) (as for rejection sampling)
as K is lost in proportionality
30/39
Importance Sampling
R
How does inference work for weighted samples? (Assume f (x)dx = 1.)
Unweighted expectation:
Z N
1 X
Eg [h(X)] = h(x)g(x)dx ≈ h(X (i) )
N i=1
= Ef [h(x)]
31/39
Importance Sampling
What if f (x) is unnormalised?
We then have f (x) = f˜(x)/Z where Z = f˜(x)dx is unknown.
R
Z Z
1
Ef [h(x)] = h(x)f (x)dx = h(x)f˜(x)dx
Z
Z
1 Eg [w̃(x)h(x)]
= w̃(x)h(x)g(x)dx =
Z Eg [w̃(x)]
PN (i) ∗ N
1
i=1 W̃ (X )h(X (i) ) X
≈ N
1
PN = W (X (i) )h(X (i) )
(i) )
N i=1 W̃ (X i=1
PN
where W (X) = W̃ (X)/ i=1 W̃ (X) and X (1) , . . . , X (N ) ∼ g(x).
Example:
Simulate from the density
20x(1 − x)3
0≤x≤1
f (x) =
0 otherwise.
33/39
Importance Sampling
L=10000
K=135/64
x=runif(L)
ind=(runif(L)<(20*x*(1-x)∧ 3/K))
2.0
hist(x[ind],probability=T,
True xlab="x",ylab="Density",main="")
Rejection Sampling xx=seq(0,1,length=100
Importance Sampling lines(xx,20*xx*(1-xx)∧ 3,lwd=2,col=2)
1.5
d=density(x[ind],from=0,to=1)
lines(d,col=4)
Density
y=runif(L)
w=20*y*(1-y)∧ 3
1.0
wTilde=y*(1-y)∧ 3
W=wTilde/sum(wTilde)
d=density(y,weights=w,from=0,to=1)
0.5
lines(d,col=3)
0.0
> mean(x[ind])
[1] 0.3353401
> mean(y)
[1] 0.5042156
> mean(w*y)
[1] 0.3334709
> sum(W*y)
[1] 0.3363843
34/39
Importance Sampling
0.00020
Weights
Weight
0.00010
0.00000
35/39
Importance Sampling
Note:
Weights
I Variability in weights mean
0.00020
equally as possible
0.0 0.2 0.4 0.6 0.8 1.0
y
37/39
Importance Sampling
Example: Obtain N samples from Beta(2, 2).
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1 X φ(X (i) )
Z Z
φ(x)
φ(x)dx = g(x)dx ≈
g(x) N i g(X (i) )