Lecture 5 - 8 Bayesian Estimation
Lecture 5 - 8 Bayesian Estimation
Bayesian Estimation
Shaobo Jin
Department of Mathematics
θ ∼ π (θ) , X | θ ∼ f (x | θ) .
Given the data x, we can make inference for the current data
generating process.
Denition
MAP: Example
Example
N 0, σ 2 θ = σ −2 .
Let X1 , ..., Xn be iid . The parameter of interest is
We assume that the prior of θ is Gamma (a, b). Find the MAP
estimator.
Nuisance Parameter
Xij | θ ∼ N µij , σ 2 ,
Consider the Neyman-Scott problem, where
i = 1, ..., n and j = 1, 2. We are interested
2
in σ , and µij 's are nuisance
parameters.
The MLE of σ2 is
Pn
2 i=1 (xi1 − xi2 )2 P σ2
σ̂ = → ̸= σ 2 .
4n 2
π (θ) ∝ σ −1 . π σ2 | x
Consider the reference prior The MAP of is
Pn
2 i=1 (xi1− xi2 )2 P 2
σ̂ = →σ .
2n + 4
A Cautious Note
Example
σ2
2
µ | σ , x ∼ N µn , σ 2 | x ∼ InvGamma (an , bn ) ,
λ0 + n
where µn , an , and bn are known constants. Find the joint and marginal
MAPs.
Example
π (θ) = pN µ, σ 2 + (1 − p) N −µ, σ 2 .
Regularized Estimator
g (·).
Example
Posterior Mean
θ̂Mean = E [θ | x] .
Example
LW (θ, d) = (θ − d)T W (θ − d) ,
ˆ
E [LW (θ, d) | x] = LW (θ, d) π (θ | x) dθ < ∞,
Then, the posterior mean minimizes E [LW (θ, d) | x], where W does not
depend on θ.
Example
σ2
2
µ | σ , x ∼ N µn , σ 2 | x ∼ InvGamma (an , bn ) ,
λ0 + n
m
1 X
θj .
m
j=1
Example
βxβ−1
x β
f (x | θ, β) = β
exp − , x > 0, θ > 0, β > 0.
θ θ
βba0 0
1 b0
π (θ | β) = exp − , "InvGamma" prior
Γ (a0 ) θa0 β+1 θβ
dc00 c0 −1
π (β) = β exp (−d0 β) . Gamma prior
Γ (c0 )
With probability 1, the posterior mean of θk does not exist for any
k > 0.
Shaobo Jin (Math) Bayesian Statistics 15 / 65
Bayesian Estimation Posterior Mean
Example
Example
Example
Consider an iid sample (X1 , ..., Xn ) from Poisson (θ). The prior of θ is
Gamma (a0 , b0 ) with density
ba00 a0 −1
π (θ) = θ exp (−b0 θ) .
Γ (a0 )
Yi = xTi β + ϵi , i = 1, ..., n,
1 E [ϵ | X] = 0,
2 Var (ϵ | X) = Σ, where Σ > 0.
A typical assumption is Σ = σ 2 In , where In is an n×n identity matrix.
−1
β̂OLS = XT X X T y.
Y | X, β, Σ ∼ N (Xβ, Σ) .
−1
β̂ML = XT X X T y.
Result
Np µ0 , Λ−1
The conjugate prior for β is 0 . The posterior is
β | y ∼ N µn , Λ−1
n , where
Λn = Λ0 + X T Σ−1 X,
µn = Λ−1 Λ0 µ0 + X T Σ−1 y .
n
Ridge Regression
Y | β ∼ Nn Xβ, σ 2 In σ2
Suppose that , where is known. Let µ0 = 0
λ
and Λ0 = I , that is
σ2 n
σ2
β ∼ Np 0, In .
λ
The posterior is
X T X + σ 2 In
T
−1 T
β | y ∼ Np X X + λIn X y, .
σ2
The posterior mean and MAP give the ridge regression estimator that
minimizes
Laplace Prior
Consider the independent Laplace prior
σ2
iid
βi ∼ Laplace 0, .
λ
The posterior satises
1 1 p
X
T
π (β | y) ∝ exp − 2 (y − Xβ) (y − Xβ) + λ |βj | .
σ 2
j=1
The likelihood is
n −1 o
exp − 12 (y − Xβ)T σ 2 In (y − Xβ)
f y | β, σ 2
=
(2π)n/2 det (σ 2 In )
p
β T X T Xβ − 2y T Xβ
1
∝ exp − .
(σ 2 )n/2 2σ 2
β | σ 2 ∼ Np µ0 , σ 2 Λ−1
0 ,
σ2 ∼ InvGamma (a0 , b0 ) ,
a normal-inverse-gamma distribution.
Normal-Inverse-Gamma Distribution
Denition
( )
(x − µ)T Σ−1 (x − µ) b 1
f (x, λ) = c exp − − a+p/2+1
,
2λ λ λ
ba
c = .
(2π)p/2 Γ (a)
p
det (Σ)
Marginal Distribution
A random vector X ∈ Rp follows a multivariate t-distribution tv (µ, Σ),
if its density is
v+p
−(v+p)/2
Γ 2 1 T −1
f (x) = v
p 1 + (x − µ) Σ (x − µ) ,
Γ 2 v p/2 π p/2 det (Σ) v
Result
For the NIG distribution (X, λ) ∼ NIG (a, b, µ, Σ), the marginal
distributions are
λ ∼ InvGamma (a, b) ,
b
X ∼ t2a µ, Σ .
a
Shaobo Jin (Math) Bayesian Statistics 28 / 65
Bayesian Estimation Normal Linear Model, Unknown σ2
Posterior Distribution
Result
β | y, σ 2 ∼ N µn , σ 2 Λ−1
n ,
σ2 | y ∼ InvGamma (an , bn ) ,
where
Λn = X T X + Λ 0 ,
µn = Λ−1 Λ 0 µ0 + X T y ,
n
n
an = + a0 ,
2
1 T
y y + µT0 Λ0 µ0 − µTn Λn µn .
bn = b0 +
2
β, σ 2 | y ∼ NIG an , bn , µn , Λ−1
That is, n .
Shaobo Jin (Math) Bayesian Statistics 29 / 65
Bayesian Estimation Normal Linear Model, Unknown σ2
β | y, σ 2 ∼ N µn , σ 2 Λ−1
n ,
σ2 | y ∼ InvGamma (an , bn ) ,
β, σ 2 | y ∼ NIG an , bn , µn , Λ−1
that is n . Then,
bn −1
β | y ∼ t2an µn , Λ ,
an n
σ2 | y ∼ InvGamma (an , bn ) .
Predictive Distribution
Result
Y | β ∼ Nn Xβ, σ 2 In σ2
Suppose that , where is unknown. Let µ0 = 0
and Λ0 = λIn , that is
σ2
2
β | σ ∼ Np µ0 , Ip , σ 2 ∼ InvGamma (a0 , b0 ) .
λ
bn −1
2 2
Λ−1
β | y, σ ∼ N µn , σ n , β | y ∼ t2an µn , Λ ,
an n
where
−1
µn = X T X + λIn Xy
σ2
iid
βj | σ ∼ Laplace 0, .
λ
Tuning Parameter
y | β, σ 2 ∼ N Xβ, σ 2 In
σ2
2
β | σ , λ ∼ Np 0, Ip
λ
σ2 ∼ InvGamma (a0 , b0 ) ,
λ ∼ InvGamma (c0 , d0 ) .
π β, σ 2 , λ = π β | σ 2 , λ π σ 2 π (λ).
That is, the prior is
e | X ∼ Nn 0, σ 2 In
Y = Xβ + e,
is given by
−1
β̂ML = XT X X T y.
−1
β̂ML | σ 2 ∼ Np β, σ 2 X T X .
−1
The Zellner's g-prior is given by β | σ 2 ∼ Np µ0 , gσ 2 X T X ,
Posterior Distribution
Result
−1
Under the g-prior β | σ 2 ∼ Np µ0 , gσ 2 X T X and
σ 2 ∼ InvGamma (a0 , b0 ), the posterior distribution is
2 g 2 T
−1
β | y, σ ∼ N µn , σ X X ,
g+1
n
σ 2 | y ∼ InvGamma + a0 , bn ,
2
where
1 g −1 T
µn = µ0 + XT X X y,
g+1 g+1
1 T g T T
−1 T
bn = b0 + y y− y X X X X y
2 g+1
1 1 T T 2 T
+ µ X Xµ0 − y Xµ0 .
2 g+1 0 g+1
Shaobo Jin (Math) Bayesian Statistics 36 / 65
Bayesian Estimation Jereys Prior For Linear Model
f (x) = a1 x1 + a2 x2 ,
T
where x = x1 x2 is a column vector. The gradient is
∂f (x) ∂f /∂x1 a1
= = .
∂x ∂f /∂x2 a2
∂ 2 f (x)
2
∂ f /∂x21 ∂ 2 f /∂x1 ∂x2
= 2 = 02×2 .
∂x∂xT ∂ f /∂x2 ∂x1 ∂ 2 f /∂x22
The gradient is
∂f (x) ∂f /∂x1 2a11 x1 + a12 x2 + a21 x2
= = .
∂x ∂f /∂x2 a12 x1 + a21 x1 + 2a22 x2
∂ 2 f (x)
2a11 a12 + a21
= .
∂x∂xT a12 + a21 2a22
∂f (x)
= a,
∂x
∂ 2 f (x)
= 0p×p .
∂x∂xT
∂f (x)
A + AT x,
=
∂x
∂ 2 f (x)
= A + AT .
∂x∂xT
Jacobian Matrix
Suppose that the output of f (x) is a m×1 vector, where the input x is
a p×1 vector. The Jacobian matrix of f is dened to be
∂f
1 (x)
∂f1 (x) ∂f1 (x) ∂f1 (x)
∂x1 ∂x2 ··· ∂xp
∂xT
∂f (x) ∂f2 (x)
∂f (x)
2 ∂f2 (x)
··· ∂f2 (x)
= ∂xT = ∂x1 ∂x2 ∂xp .
∂xT
.. .. . .. .
. . . . .
. .
∂fm (x) ∂fm (x) ∂fm (x) ∂fm (x)
∂xT ∂x1 ∂x2 ··· ∂xp m×p
" #
∂f1 (x)
∂f (x) ∂xT
a1 a2
= ∂f2 (x) = .
∂xT b1 −b2
∂xT
In general, we have
∂Ax
= A.
∂xT
Shaobo Jin (Math) Bayesian Statistics 41 / 65
Bayesian Estimation Jereys Prior For Linear Model
Jereys Prior
Result
Y | β, σ 2 ∼ Nn Xβ, σ 2 In .
Consider the linear model The Fisher
information of the above model is
1
XT X
2
σ2
0
I β, σ = n .
0 2σ 4
1
π β, σ 2
∝ ,
(σ 2 )p/2+1
1
π β, σ 2
∝ .
σ2
Shaobo Jin (Math) Bayesian Statistics 42 / 65
Bayesian Estimation Jereys Prior For Linear Model
1
π β, σ 2
∝ .
σ2
Consider the change of variables β=β τ = log σ 2 . Then,
and
β
∂ 2
1 σ
π (β, τ ) ∝ 2
det = 1.
σ ∂ β τ
Hence, the independent Jereys prior means that the prior is uniform
β, log σ 2
on .
Theorem
Y | β, σ 2 ∼ Nn Xβ, σ 2 In .
Consider the linear regression model Let the
prior be
−m
π β, σ 2 σ2
∝ .
The posterior is
−1
β | σ 2 , y ∼ Np µn , σ 2 X T X ,
2 n−p 1 T
σ | y ∼ InvGamma + m − 1, y (In − H) y ,
2 2
−1 −1
where µn = X T X XT y and H = X XT X XT is the hat matrix.
−1
β − µn | σ 2 , y ∼ Np 0, σ 2 X T X .
−1
β̂ − β | β, σ 2 ∼ Np 0, σ 2 X T X .
We can compare what our model predicts with the observed data,
or compare statistics applied to the simulated data with the same
statistics applied to the observed data.
Model 1
observed Rep1 Rep2
90
60
30
0
Rep3 Rep4 Rep5
90
count
60
30
0
Rep6 Rep7 Rep8
90
60
30
0
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
response
Model 2
observed Rep1 Rep2
200
150
100
50
0
Rep3 Rep4 Rep5
200
150
count
100
50
0
Rep6 Rep7 Rep8
200
150
100
50
0
0 10 0 10 0 10
response
Gaussian Process
Denition
m (x) , k x, x′
f (x) ∼ GP ,
where x ∈ Rp ,
k x, x′ = f (x) , f x′
cov . covariance function
f (xn )
multivariate normal. For a large enough n, the multivariate normal
vector seems to produce a smooth function in x.
n=5 n = 100
0.6
2.5
0.4
2.0
0.2
1.5
f(x)
f(x)
0.0
1.0
−0.4 −0.2
0.5
−4 −2 0 2 4 −4 −2 0 2 4
x x
y = f (x) + ϵ,
Y | σ 2 ∼ N 0, K (X, X) + σ 2 In ,
where
k (x1 , x1 ) k (x1 , x2 ) · · · k (x1 , xn )
k (x2 , x1 ) k (x2 , x2 ) · · · k (x2 , xn )
K (X, X) = .
. . .. .
. . . .
. . .
k (xn , x1 ) k (xn , x2 ) · · · k (xn , xn )
Y1 | Y2 = y2 ∼ N µ1 + Σ12 Σ−1 −1
22 (y2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21 .
Predicted Value
f (X∗ ) K (X∗ , X∗ ) K (X∗ , X)
| σ 2 ∼ N 0, .
Y K (X, X ∗ ) K (X, X) + σ 2 In
−1
K (X∗ , X) K (X, X) + σ 2 In
mean y,
−1
K (X∗ , X∗ ) − K (X∗ , X) K (X, X) + σ 2 In K (X, X ∗ ) .
covariance
−1
K (X∗ , X) K (X, X) + σ 2 In
The tted function is then y.
2
y
−4 −2 0 2 4
x
2
y
−4 −2 0 2 4
x
y = f (x) + ϵ, f (x) = xT β,
ϵ | σ 2 ∼ N 0, σ 2 .
where
β ∼ Np 0, Λ−1
Under the conjugate prior 0 , the posterior is
β | y ∼ N µn , Λ−1
n , where
−1
µn = Λ0 + X T X X T y.
Transform x
ϵ | σ 2 ∼ N 0, σ 2 .
where
γ ∼ Nd 0, Ω−1
Under the conjugate prior 0 , the predictive
distribution is
Kernel Function
Example
Let Φ = ϕ (X) = ..
. . We can show that
ϕT (xn )
−1 −1
Ω−1
0 Φ
T
σ 2 In + ΦΩ−1
0 Φ
T
= σ 2 Ω0 + Φ T Φ ΦT .
Example
Show that κ (x, z) = ϕT (x) Ω−1
0 ϕ (z) is a kernel function for a symmetric Ω0 .
−1
ϕT (x0 ) µn = K (x0 , X) σ 2 In + K (X, X)
y,
where
ϕT (x0 ) Ω−1
T
K (x0 , X) = 0 ϕ (xi ) = ψ (x0 ) ψ (xi )
ϕ (xi ) Ω−1
T T
K (X, X) = 0 ϕ (xj ) = ψ (xi ) ψ (xj )
is a n×n matrix.
Shaobo Jin (Math) Bayesian Statistics 60 / 65
Bayesian Estimation Gaussian Process Regression
Kernel Trick
∞
X
κ (x, z) = ρm em (x) em (z) ,
m=1
∞
X √ √
κ (x, z) = ρm em (x) ρm em (z) .
new feature ψm (x)new feature ψm (z)
| {z } | {z }
m=1
−1
K (x0 , X) σ 2 In + K (X, X)
mean y,
−1
σ 2 + ϕT (x0 ) Ω0 + σ −2 ΦT Φ
variance ϕ (x0 ) .
−1
K (x0 , x0 ) − K (x0 , X) σ 2 In + K (X, X) K (X, x0 ) .
−1
K (x0 , X) σ 2 In + K (X, X)
mean y,
−1
K (x0 , x0 ) − K (x0 , X) K (X, X) + σ 2 In
covariance K (X, x0 ) .
Prior on Function
Consider the linear regression model
ϵ | σ 2 ∼ N 0, σ 2 .
where
γ ∼ Nd 0, Ω−1
The conjugate prior 0 implies that
f (x1 )
.. −1 T
. = ϕ (X) γ ∼ Nn 0, ϕ (X) Ω0 ϕ (X) .
f (xn )
Posterior on Function
γ | y, σ 2 ∼ N µn , Ω−1
n ,
where
−1 T
µn = σ 2 Ω0 + ϕT (X) ϕ (X) ϕ (X) y.