0% found this document useful (0 votes)
6 views

Lecture 5 - 8 Bayesian Estimation

The document discusses Bayesian estimation, focusing on the Maximum a Posteriori (MAP) estimator, which is the mode of the posterior distribution. It contrasts MAP with the frequentist approach, highlights the role of nuisance parameters, and provides examples including the Neyman-Scott problem. Additionally, it addresses the differences between MAP and Maximum Likelihood Estimation (MLE), as well as the posterior mean, and their implications in various statistical models.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 5 - 8 Bayesian Estimation

The document discusses Bayesian estimation, focusing on the Maximum a Posteriori (MAP) estimator, which is the mode of the posterior distribution. It contrasts MAP with the frequentist approach, highlights the role of nuisance parameters, and provides examples including the Neyman-Scott problem. Additionally, it addresses the differences between MAP and Maximum Likelihood Estimation (MLE), as well as the posterior mean, and their implications in various statistical models.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Bayesian Statistics

Bayesian Estimation

Shaobo Jin

Department of Mathematics

Shaobo Jin (Math) Bayesian Statistics 1 / 65


Bayesian Estimation MAP

Maximum a Posteriori Estimator


In a Bayes model, the parameter θ is a random variable with known
distribution π.
Finding the true parameter makes no sense in a Bayes model.

Data that we observe are generated in a hierarchical manner:

θ ∼ π (θ) , X | θ ∼ f (x | θ) .

Given the data x, we can make inference for the current data
generating process.

Denition

The maximum a posteriori (MAP) estimator is the mode of the


posterior π (θ | x) as

θ̂MAP (x) = arg max π (θ | x) .


θ

Shaobo Jin (Math) Bayesian Statistics 2 / 65


Bayesian Estimation MAP

MAP: Example

Note that π (θ | x) ∝ f (x | θ) π (θ) and m (x) does not include any θ.


The MAP estimator only requires the kernel of f (x | θ) π (θ).
We can skip the integration step to obtain m (x).

Example

N 0, σ 2 θ = σ −2 .

Let X1 , ..., Xn be iid . The parameter of interest is
We assume that the prior of θ is Gamma (a, b). Find the MAP
estimator.

Shaobo Jin (Math) Bayesian Statistics 3 / 65


Bayesian Estimation MAP

MAP and 0-1 Loss

For an estimator d, consider the loss function

L (θ, d) = 1 (∥θ − d∥ > ϵ) ,


q
where ∥θ − d∥ = (θ − d)T (θ − d). Consider the expected value of

L (θ, d) under the posterior distribution:


ˆ
E [L (θ, d) | x] = 1 (∥θ − d∥ > ϵ) π (θ | x) dθ
= 1 − P (∥θ − d∥ ≤ ϵ | x) .

To minimize the expected loss, we want P (∥θ − d∥ ≤ ϵ | x) to be as


large as possible, that is, the distribution of θ | x is concentrated
around d.

Shaobo Jin (Math) Bayesian Statistics 4 / 65


Bayesian Estimation MAP

Nuisance Parameter

Suppose that data are generated from f (x | θ, τ ), where θ is the


parameter of interest and τ is the nuisance parameter.

The frequentist approach will nd a sucient statistic T (x) for τ


and make inference for θ using the conditional distribution of
X | T (X).
Alternatively, inference is based on the prole likelihood

L (θ) = max f (x | θ, τ ) = f (x | θ, τ̂ (θ)) ,


τ

where τ̂ (θ) maximizes f (x | θ, τ ) for xed θ.


An advantage of the Bayes approach is that we can simply integrate
out the nuisance parameter τ and make inference from the marginal
posterior π (θ | x), instead of the joint posterior π (θ, τ | x).

Shaobo Jin (Math) Bayesian Statistics 5 / 65


Bayesian Estimation MAP

Neyman-Scott Problem: Example

Xij | θ ∼ N µij , σ 2 ,

Consider the Neyman-Scott problem, where
i = 1, ..., n and j = 1, 2. We are interested
2
in σ , and µij 's are nuisance
parameters.

The MLE of σ2 is
Pn
2 i=1 (xi1 − xi2 )2 P σ2
σ̂ = → ̸= σ 2 .
4n 2
π (θ) ∝ σ −1 . π σ2 | x

Consider the reference prior The MAP of is

Pn
2 i=1 (xi1− xi2 )2 P 2
σ̂ = →σ .
2n + 4

Shaobo Jin (Math) Bayesian Statistics 6 / 65


Bayesian Estimation MAP

A Cautious Note

Suppose that the posterior is π (θ, τ | x), where θ is the parameter of


interest. The mode of π (θ, τ | x) may not equal the marginal posterior
mode, the mode of π (θ | x).

Example

Consider the normal-inverse-gamma model, where the posterior is

σ2
 
2
µ | σ , x ∼ N µn , σ 2 | x ∼ InvGamma (an , bn ) ,
λ0 + n

where µn , an , and bn are known constants. Find the joint and marginal
MAPs.

Shaobo Jin (Math) Bayesian Statistics 7 / 65


Bayesian Estimation MAP

One More Issue: Existence

Example

Suppose that we observe iid Xi | θ ∼ N (θ, 1). The prior of θ is a


mixture normal

π (θ) = pN µ, σ 2 + (1 − p) N −µ, σ 2 .
 

Find the mode of π (θ | x).

Shaobo Jin (Math) Bayesian Statistics 8 / 65


Bayesian Estimation MAP

Regularized Estimator

The MAP estimator essentially maximizes f (x | θ) π (θ) or

log f (x | θ) + log π (θ) ,

provided that the logarithms are well dened. Intuitively speaking, we


maximize the log-likelihood log f (x | θ), but the penalty term log π (θ)
cannot be too big.

Suppose that data are generated from Xi | θ = θ0 ∼ f (x | θ0 ),


i = 1, ..., n.
If n−1 log π (θ) → 0 as n → ∞, we should expect the MAP and the
MLE to share similar properties.

Shaobo Jin (Math) Bayesian Statistics 9 / 65


Bayesian Estimation MAP

One Dierence Between MLE and MAP

Theorem (Invariance of MLE)


 
Let θ̂ML be the MLE of θ. Then, g θ̂ML is the MLE of g (θ) for any

g (·).

However, MAP is not invariant with respect to reparametrization.

Example

Suppose that we observe one observation from X | θ ∼ Binomial (n, θ).


Let the prior be θ ∼ Beta (a0 , b0 ), where a0 > 1 and b0 > 1.
1 Find the MAP estimator of θ.
2 Find the MAP estimator of η = θ/ (1 − θ).

Shaobo Jin (Math) Bayesian Statistics 10 / 65


Bayesian Estimation Posterior Mean

Posterior Mean

An alternative to MAP is the posterior mean

θ̂Mean = E [θ | x] .

Example

Suppose that we observe one observation from X | θ ∼ Binomial (n, θ).


Let the prior be θ ∼ Beta (a0 , b0 ), where a0 > 1 and b0 > 1.
The posterior is Beta (a0 + x, b0 + n − x).
Hence, θ̂Mean = a0 +x
a0 +b0 +n .

Shaobo Jin (Math) Bayesian Statistics 11 / 65


Bayesian Estimation Posterior Mean

Posterior Mean and L2 Loss

Consider the weighted L2 loss

LW (θ, d) = (θ − d)T W (θ − d) ,

where W is a p×p positive denite matrix and d is an estimator of θ


using x.
Theorem

Suppose that there exists an estimator d such that

ˆ
E [LW (θ, d) | x] = LW (θ, d) π (θ | x) dθ < ∞,

Then, the posterior mean minimizes E [LW (θ, d) | x], where W does not

depend on θ.

Shaobo Jin (Math) Bayesian Statistics 12 / 65


Bayesian Estimation Posterior Mean

Posterior Mean versus MAP


Suppose that the posterior is π (θ, τ | x), where θ is the parameter of
interest.

The mode of π (θ, τ | x) may not equal the marginal posterior


mode, the mode of π (θ | x).
But the marginal posterior mean is the same as the joint posterior
mean.

Example

Consider the normal-inverse-gamma model, where the posterior is

σ2
 
2
µ | σ , x ∼ N µn , σ 2 | x ∼ InvGamma (an , bn ) ,
λ0 + n

where µn , an , and bn are known constants. The posterior mean is the


mean of InvGamma (an , bn ).

Shaobo Jin (Math) Bayesian Statistics 13 / 65


Bayesian Estimation Posterior Mean

Posterior Mean versus MAP


To obtain the closed form expression of E [θ | x], we need the
normalizing constant of π (θ | x).
The MAP estimator only requires the kernel f (x | θ) π (θ). We can
skip the integration step to obtain m (x).
Even we know m (x), the integral to get E [θ | x] may not be
tractable.

But if we can sample from π (θ | x), we don't need to compute m (x)


nor evaluate the integral for E [θ | x].

Suppose that we have a sample θ1 , ..., θm from π (θ | x), then we


can approximate the posterior mean by

m
1 X
θj .
m
j=1

Shaobo Jin (Math) Bayesian Statistics 14 / 65


Bayesian Estimation Posterior Mean

Posterior Mean versus MAP


It can even happen that the posterior mean does not exist, even though
the posterior is proper.

Example

Let X1 , ..., Xn be iid from a two parameter Weibull distribution

βxβ−1
   
x β
f (x | θ, β) = β
exp − , x > 0, θ > 0, β > 0.
θ θ

Consider the proper priors

βba0 0
 
1 b0
π (θ | β) = exp − , "InvGamma" prior
Γ (a0 ) θa0 β+1 θβ
dc00 c0 −1
π (β) = β exp (−d0 β) . Gamma prior
Γ (c0 )

With probability 1, the posterior mean of θk does not exist for any
k > 0.
Shaobo Jin (Math) Bayesian Statistics 15 / 65
Bayesian Estimation Posterior Mean

Posterior Mean versus MAP

It can happen that the likelihood involves intractable integrals. Hence,


the MAP is not easy to obtain but we can sample easily from the
posterior.

Example

Suppose that Yij | Zi , β, λ ∼ Bernoulli (pij ), i = 1, ..., n, j = 1, ..., k ,


where

exp (βj + λzi )


pij = .
1 + exp (βj + λzi )

But we only observe {Yij }.

Shaobo Jin (Math) Bayesian Statistics 16 / 65


Bayesian Estimation Posterior Mean

Invariance of Posterior Mean

The posterior mean is not invariant with respect to reparametrization


either.

Example

Suppose that we observe one observation from X | θ ∼ Binomial (n, θ).


Let the prior be θ ∼ Beta (a0 , b0 ), where a0 > 1 and b0 > 1.
1 Find the posterior mean of θ.
2 Find the posterior mean of η = θ/ (1 − θ).

Shaobo Jin (Math) Bayesian Statistics 17 / 65


Bayesian Estimation Prediction

Predict New Value


In frequentist statistics, the prediction of a new observation z after
observing x is
ˆ  
ẑ (x) = zf z | x, θ̂ dz.

In Bayesian statistics, the predictive distribution of a new observation z


after observing x is
ˆ
f (z | x) = f (z | x, θ) π (θ | x) dθ.

A predictor can be the predictive mean


ˆ
ẑ (x) = zf (z | x) dz,

or the predictive mode maxf (z | x).


z
Shaobo Jin (Math) Bayesian Statistics 18 / 65
Bayesian Estimation Prediction

Derive Predictive Distribution

Example

Consider an iid sample (X1 , ..., Xn ) from Poisson (θ). The prior of θ is
Gamma (a0 , b0 ) with density

ba00 a0 −1
π (θ) = θ exp (−b0 θ) .
Γ (a0 )

1 Find the posterior π (θ | x).


2 Let z be a future value. Find the predictive distribution f (z | x).
3 Propose a predictor of z.

Shaobo Jin (Math) Bayesian Statistics 19 / 65


Bayesian Estimation Normal Linear Model, Known σ2

Multiple Linear Regression


A multiple linear regression is

Yi = xTi β + ϵi , i = 1, ..., n,

where Yi is the response, xi is the vector of covariates (or regressors, or


features), and β is the vector of unknown regression parameter.

In matrix notation, the model is

Yn×1 = Xn×p βp×1 + ϵn×1 .

Some examples are:

1 Y is apartment price, Z includes crime rate, number of rooms, size


of the apartment, year of construction, etc.

2 Y is waste water ow rate, Z includes temperature, precipitation,


date of the year, time, etc.

Shaobo Jin (Math) Bayesian Statistics 20 / 65


Bayesian Estimation Normal Linear Model, Known σ2

Normal Linear Model

Yn×1 = Xn×p βp×1 + ϵn×1

The usual assumptions are

1 E [ϵ | X] = 0,
2 Var (ϵ | X) = Σ, where Σ > 0.
A typical assumption is Σ = σ 2 In , where In is an n×n identity matrix.

The ordinary least squares (OLS) estimator of β minimizes


T
(y − Xβ) (y − Xβ), and the minimizer is

−1
β̂OLS = XT X X T y.

Shaobo Jin (Math) Bayesian Statistics 21 / 65


Bayesian Estimation Normal Linear Model, Known σ2

Normal Linear Model


In the normal linear model, we further assume that ϵ is normal:
ϵ | X ∼ Nn (0, Σ). Hence,

Y | X, β, Σ ∼ N (Xβ, Σ) .

The likelihood function is


 
1 1 T −1
f (y | X, β, Σ) = exp − (y − Xβ) Σ (y − Xβ) .
(2π)n/2 det (Σ) 2
p

If Σ = σ2I , the the MLE of β coincides with the OLS estimator:

−1
β̂ML = XT X X T y.

For notation simplicity, we will treat X as xed and drop it from


conditioning.

Shaobo Jin (Math) Bayesian Statistics 22 / 65


Bayesian Estimation Normal Linear Model, Known σ2

Bayesian Linear Model: Known Σ


Suppose that Σ is completely known, i.e., β is the only unknown
parameter.

Result

Np µ0 , Λ−1

The conjugate prior for β is 0 . The posterior is
β | y ∼ N µn , Λ−1

n , where

Λn = Λ0 + X T Σ−1 X,
µn = Λ−1 Λ0 µ0 + X T Σ−1 y .

n

Suppose that we observe a new x0 and want to predict the new y0 . If


xT0 β, σ 2 , where 2

y0 | β ∼ N σ is known, then the predictive
distribution is

y0 | y ∼ N xT0 µn , σ 2 + xT0 Λ−1



n x0 .

Shaobo Jin (Math) Bayesian Statistics 23 / 65


Bayesian Estimation Normal Linear Model, Known σ2

Ridge Regression
Y | β ∼ Nn Xβ, σ 2 In σ2

Suppose that , where is known. Let µ0 = 0
λ
and Λ0 = I , that is
σ2 n

σ2
 
β ∼ Np 0, In .
λ
The posterior is

X T X + σ 2 In
 
T
−1 T
β | y ∼ Np X X + λIn X y, .
σ2
The posterior mean and MAP give the ridge regression estimator that
minimizes

β̂ = arg min (y − Xβ)T (y − Xβ) + λβ T β


β
1 1
= arg max − 2
(y − Xβ)T (y − Xβ) − 2 β T β.
β 2σ 2σ /λ
Shaobo Jin (Math) Bayesian Statistics 24 / 65
Bayesian Estimation Normal Linear Model, Known σ2

Laplace Prior
Consider the independent Laplace prior

σ2
 
iid
βi ∼ Laplace 0, .
λ
The posterior satises
  
 1 1 p
X 
T
π (β | y) ∝ exp − 2  (y − Xβ) (y − Xβ) + λ |βj |  .
 σ 2 
j=1

The MAP gives the lasso regression estimator that minimizes


p
1 X
β̂ = arg min (y − Xβ)T (y − Xβ) + λ |βj |
β 2
j=1
 
p
1 1 X
= arg max − 2  (y − Xβ)T (y − Xβ) + λ |βj | .
β σ 2
j=1

Shaobo Jin (Math) Bayesian Statistics 25 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Bayesian Linear Model: Unknown σ 2


Suppose that Σ = σ 2 In , but σ2 is unknown. The parameter is
2

θ = β, σ .

The likelihood is
n −1 o
exp − 12 (y − Xβ)T σ 2 In (y − Xβ)
f y | β, σ 2

=
(2π)n/2 det (σ 2 In )
p

β T X T Xβ − 2y T Xβ
 
1
∝ exp − .
(σ 2 )n/2 2σ 2

The conjugate prior is

β | σ 2 ∼ Np µ0 , σ 2 Λ−1

0 ,
σ2 ∼ InvGamma (a0 , b0 ) ,

a normal-inverse-gamma distribution.

Shaobo Jin (Math) Bayesian Statistics 26 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Normal-Inverse-Gamma Distribution
Denition

A random vector X ∈ Rp and a positive random scalar λ>0 follow a


normal-inverse-gamma (NIG) distribution if

X | λ ∼ Np (µ, λΣ) , and λ ∼ InvGamma (a, b) .

(X, λ) ∼ NIG (a, b, µ, Σ). The joint density is


It is denoted by

( )
(x − µ)T Σ−1 (x − µ) b 1
f (x, λ) = c exp − − a+p/2+1
,
2λ λ λ

where the constant c is given by

ba
c = .
(2π)p/2 Γ (a)
p
det (Σ)

Shaobo Jin (Math) Bayesian Statistics 27 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Marginal Distribution
A random vector X ∈ Rp follows a multivariate t-distribution tv (µ, Σ),
if its density is

v+p
  −(v+p)/2
Γ 2 1 T −1
f (x) = v
 p 1 + (x − µ) Σ (x − µ) ,
Γ 2 v p/2 π p/2 det (Σ) v

where v is the degrees of freedom, µ = E [X] for v > 1, and


v
var (X) =
v−2 Σ for v > 2.

Result

For the NIG distribution (X, λ) ∼ NIG (a, b, µ, Σ), the marginal
distributions are

λ ∼ InvGamma (a, b) ,
 
b
X ∼ t2a µ, Σ .
a
Shaobo Jin (Math) Bayesian Statistics 28 / 65
Bayesian Estimation Normal Linear Model, Unknown σ2

Posterior Distribution
Result

Under the conjugate prior, the posterior distribution is

β | y, σ 2 ∼ N µn , σ 2 Λ−1

n ,
σ2 | y ∼ InvGamma (an , bn ) ,

where

Λn = X T X + Λ 0 ,
µn = Λ−1 Λ 0 µ0 + X T y ,

n
n
an = + a0 ,
2
1 T
y y + µT0 Λ0 µ0 − µTn Λn µn .

bn = b0 +
2
β, σ 2 | y ∼ NIG an , bn , µn , Λ−1
 
That is, n .
Shaobo Jin (Math) Bayesian Statistics 29 / 65
Bayesian Estimation Normal Linear Model, Unknown σ2

Marginal Posterior of Normal Linear Model

Under the conjugate prior, the posterior distribution is

β | y, σ 2 ∼ N µn , σ 2 Λ−1

n ,
σ2 | y ∼ InvGamma (an , bn ) ,

β, σ 2 | y ∼ NIG an , bn , µn , Λ−1
 
that is n . Then,
 
bn −1
β | y ∼ t2an µn , Λ ,
an n
σ2 | y ∼ InvGamma (an , bn ) .

Shaobo Jin (Math) Bayesian Statistics 30 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Predictive Distribution

Result

Suppose that we observe a new x0 and want to predict the new y0 .


Assume that y0 ⊥ y | 2
β, σ . Under the conjugate prior, the predictive
distribution is
 
T bn T −1

y0 | y ∼ t2an x0 µn , 1 + x0 Λn x0 ,
an

same expectation as σ2 were known.

Shaobo Jin (Math) Bayesian Statistics 31 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Ridge Regression Again

Y | β ∼ Nn Xβ, σ 2 In σ2

Suppose that , where is unknown. Let µ0 = 0
and Λ0 = λIn , that is

σ2
 
2
β | σ ∼ Np µ0 , Ip , σ 2 ∼ InvGamma (a0 , b0 ) .
λ

The posterior satises

 
bn −1
2 2
Λ−1

β | y, σ ∼ N µn , σ n , β | y ∼ t2an µn , Λ ,
an n

where

−1
µn = X T X + λIn Xy

coincides with the ridge regression estimator.

Shaobo Jin (Math) Bayesian Statistics 32 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Laplace Prior Again

Consider the independent Laplace prior

σ2
 
iid
βj | σ ∼ Laplace 0, .
λ

The posterior satises


  
 p
π σ2  1 1 T
X 
π β, σ 2 | y

∝ exp −  (y − Xβ) (y − Xβ) + λ |βj | .
(σ 2 )
p+n/2  σ2 2 
j=1

The MAP gives the lasso regression estimator.

Shaobo Jin (Math) Bayesian Statistics 33 / 65


Bayesian Estimation Normal Linear Model, Unknown σ2

Tuning Parameter

The tuning parameter λ is often selected using cross validation in


ridge/lasso regression.

In Bayesian linear model, we can also treat λ as an unknown variable


and use a prior for λ. A hierarchical setup can be

y | β, σ 2 ∼ N Xβ, σ 2 In


σ2
 
2
β | σ , λ ∼ Np 0, Ip
λ
σ2 ∼ InvGamma (a0 , b0 ) ,

λ ∼ InvGamma (c0 , d0 ) .

π β, σ 2 , λ = π β | σ 2 , λ π σ 2 π (λ).
  
That is, the prior is

Shaobo Jin (Math) Bayesian Statistics 34 / 65


Bayesian Estimation Zellner's g-prior

Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) of the model

e | X ∼ Nn 0, σ 2 In

Y = Xβ + e,

is given by

−1
β̂ML = XT X X T y.

Its sampling distribution is

 −1 
β̂ML | σ 2 ∼ Np β, σ 2 X T X .
 −1 
The Zellner's g-prior is given by β | σ 2 ∼ Np µ0 , gσ 2 X T X ,

where the constant g > 0.

Shaobo Jin (Math) Bayesian Statistics 35 / 65


Bayesian Estimation Zellner's g-prior

Posterior Distribution
Result
 −1 
Under the g-prior β | σ 2 ∼ Np µ0 , gσ 2 X T X and
σ 2 ∼ InvGamma (a0 , b0 ), the posterior distribution is
 
2 g 2 T
−1
β | y, σ ∼ N µn , σ X X ,
g+1
n 
σ 2 | y ∼ InvGamma + a0 , bn ,
2
where
1 g −1 T
µn = µ0 + XT X X y,
g+1 g+1
 
1 T g T T
−1 T
bn = b0 + y y− y X X X X y
2 g+1
 
1 1 T T 2 T
+ µ X Xµ0 − y Xµ0 .
2 g+1 0 g+1
Shaobo Jin (Math) Bayesian Statistics 36 / 65
Bayesian Estimation Jereys Prior For Linear Model

Detour: Gradient and Hessian of Linear Form

Consider the function

f (x) = a1 x1 + a2 x2 ,
 T
where x = x1 x2 is a column vector. The gradient is

   
∂f (x) ∂f /∂x1 a1
= = .
∂x ∂f /∂x2 a2

The Hessian matrix is

∂ 2 f (x)
 2
∂ f /∂x21 ∂ 2 f /∂x1 ∂x2

= 2 = 02×2 .
∂x∂xT ∂ f /∂x2 ∂x1 ∂ 2 f /∂x22

Shaobo Jin (Math) Bayesian Statistics 37 / 65


Bayesian Estimation Jereys Prior For Linear Model

Detour: Gradient and Hessian of Quadratic Form


Consider
  
 a11 a12 x1
f (x) = x1 x2
a21 a22 x2
= a11 x21 + a12 x1 x2 + a21 x2 x1 + a22 x22 .

The gradient is

   
∂f (x) ∂f /∂x1 2a11 x1 + a12 x2 + a21 x2
= = .
∂x ∂f /∂x2 a12 x1 + a21 x1 + 2a22 x2

The Hessian matrix is

∂ 2 f (x)
 
2a11 a12 + a21
= .
∂x∂xT a12 + a21 2a22

Shaobo Jin (Math) Bayesian Statistics 38 / 65


Bayesian Estimation Jereys Prior For Linear Model

General Results for Linear and Quadratic Form

If f (x) = aT x with a and x being p×1 column vectors, then

∂f (x)
= a,
∂x
∂ 2 f (x)
= 0p×p .
∂x∂xT

If f (x) = xT Ax with x being a p×1 column vector, then

∂f (x)
A + AT x,

=
∂x
∂ 2 f (x)
= A + AT .
∂x∂xT

Shaobo Jin (Math) Bayesian Statistics 39 / 65


Bayesian Estimation Jereys Prior For Linear Model

Jacobian Matrix

Suppose that the output of f (x) is a m×1 vector, where the input x is
a p×1 vector. The Jacobian matrix of f is dened to be

 
 ∂f
1 (x)
 ∂f1 (x) ∂f1 (x) ∂f1 (x)
∂x1 ∂x2 ··· ∂xp
∂xT
   
   
∂f (x)  ∂f2 (x) 
 ∂f (x)
2 ∂f2 (x)
··· ∂f2 (x) 

=  ∂xT  =  ∂x1 ∂x2 ∂xp  .
∂xT   
 ..   ..  . .. . 
 .   . . . .
. .


∂fm (x) ∂fm (x) ∂fm (x) ∂fm (x)
∂xT ∂x1 ∂x2 ··· ∂xp m×p

Shaobo Jin (Math) Bayesian Statistics 40 / 65


Bayesian Estimation Jereys Prior For Linear Model

Example: Compute Jacobian Matrix


Example
  
a1 a2 x1
Find the Jacobian matrix of f (x) = .
b1 −b2 x2
Note that
   
∂f1 (x) ∂a1 x1 + a2 x2 a ∂f2 (x) ∂b1 x1 − b2 x2 b1
= = 1 , = = .
∂x ∂x a2 ∂x ∂x −b2

Hence, the Jacobian matrix is

" #
∂f1 (x)  
∂f (x) ∂xT
a1 a2
= ∂f2 (x) = .
∂xT b1 −b2
∂xT

In general, we have

∂Ax
= A.
∂xT
Shaobo Jin (Math) Bayesian Statistics 41 / 65
Bayesian Estimation Jereys Prior For Linear Model

Jereys Prior
Result

Y | β, σ 2 ∼ Nn Xβ, σ 2 In .

Consider the linear model The Fisher
information of the above model is

1
XT X
 
2

σ2
0
I β, σ = n .
0 2σ 4

Hence, the Jereys prior is

1
π β, σ 2

∝ ,
(σ 2 )p/2+1

and the independent Jereys prior is

1
π β, σ 2

∝ .
σ2
Shaobo Jin (Math) Bayesian Statistics 42 / 65
Bayesian Estimation Jereys Prior For Linear Model

Independent Jereys Prior

The independent Jereys prior for the linear model


Y | β, σ 2 ∼ Nn Xβ, σ 2 In

is

1
π β, σ 2

∝ .
σ2
Consider the change of variables β=β τ = log σ 2 . Then,
and

   
β
∂ 2
1  σ 
π (β, τ ) ∝ 2
det    = 1.
σ  ∂ β τ 

Hence, the independent Jereys prior means that the prior is uniform
β, log σ 2

on .

Shaobo Jin (Math) Bayesian Statistics 43 / 65


Bayesian Estimation Jereys Prior For Linear Model

Posterior with Jereys Prior

Theorem

Y | β, σ 2 ∼ Nn Xβ, σ 2 In .

Consider the linear regression model Let the

prior be

−m
π β, σ 2 σ2

∝ .

The posterior is

 −1 
β | σ 2 , y ∼ Np µn , σ 2 X T X ,
 
2 n−p 1 T
σ | y ∼ InvGamma + m − 1, y (In − H) y ,
2 2
−1 −1
where µn = X T X XT y and H = X XT X XT is the hat matrix.

Shaobo Jin (Math) Bayesian Statistics 44 / 65


Bayesian Estimation Jereys Prior For Linear Model

MLE versus Posterior

The previous theorem shows that

 −1 
β − µn | σ 2 , y ∼ Np 0, σ 2 X T X .

If maximum likelihood is used to estimate, then the MLE is


−1
β̂ = X T X XT y and

 −1 
β̂ − β | β, σ 2 ∼ Np 0, σ 2 X T X .

Shaobo Jin (Math) Bayesian Statistics 45 / 65


Bayesian Estimation Posterior Predictive Check

Posterior Predictive Checks

Posterior predictive check is a way to investigate whether our model


can capture some relevant aspects of the data.

We simulate data xsim from the posterior predictive distribution


ˆ
f (xnew | x) = f (xnew | x, θ) π (θ | x) dθ.

We can compare what our model predicts with the observed data,
or compare statistics applied to the simulated data with the same
statistics applied to the observed data.

Shaobo Jin (Math) Bayesian Statistics 46 / 65


Bayesian Estimation Posterior Predictive Check

Model 1
observed Rep1 Rep2

90
60
30
0
Rep3 Rep4 Rep5

90
count

60
30
0
Rep6 Rep7 Rep8

90
60
30
0
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
response

Shaobo Jin (Math) Bayesian Statistics 47 / 65


Bayesian Estimation Posterior Predictive Check

Model 2
observed Rep1 Rep2
200
150
100
50
0
Rep3 Rep4 Rep5
200
150
count

100
50
0
Rep6 Rep7 Rep8
200
150
100
50
0
0 10 0 10 0 10
response

Shaobo Jin (Math) Bayesian Statistics 48 / 65


Bayesian Estimation Gaussian Process Regression

Gaussian Process

Denition

A Gaussian process is a collection of random variables, any nite


number of which have a joint Gaussian distribution.

Let f be a scalar-valued function. We denote a Gaussian process by

m (x) , k x, x′

f (x) ∼ GP ,

where x ∈ Rp ,

m (x) = E [f (x)] , mean function

k x, x′ = f (x) , f x′
 
cov . covariance function

Shaobo Jin (Math) Bayesian Statistics 49 / 65


Bayesian Estimation Gaussian Process Regression

Gaussian Process As Smooth Function


 
f (x1 )
 .. 
By the denition, the joint distribution of any nite  .  is

f (xn )
multivariate normal. For a large enough n, the multivariate normal
vector seems to produce a smooth function in x.

n=5 n = 100
0.6

2.5
0.4

2.0
0.2

1.5
f(x)

f(x)
0.0

1.0
−0.4 −0.2

0.5

−4 −2 0 2 4 −4 −2 0 2 4

x x

Shaobo Jin (Math) Bayesian Statistics 50 / 65


Bayesian Estimation Gaussian Process Regression

Gaussian Process Regression


Consider the Gaussian process regression model

y = f (x) + ϵ,

ϵ | σ 2 ∼ N 0, σ 2 and f (x) ∼ GP (0, k (x, x′ )). If



where we have
observed n observations from this model, then

Y | σ 2 ∼ N 0, K (X, X) + σ 2 In ,


where
 
k (x1 , x1 ) k (x1 , x2 ) · · · k (x1 , xn )
 k (x2 , x1 ) k (x2 , x2 ) · · · k (x2 , xn ) 
K (X, X) =  .
 
. . .. .
. . . .
 . . . 
k (xn , x1 ) k (xn , x2 ) · · · k (xn , xn )

Shaobo Jin (Math) Bayesian Statistics 51 / 65


Bayesian Estimation Gaussian Process Regression

Recap: Conditional Distribution

Result: Conditional Distribution of Multivariate Gaussian Distribution


     
Y1 µ1 Σ11 Σ12
Suppose that ∼ Np , such that Σ22 > 0. Then,
Y2 µ2 Σ21 Σ22

Y1 | Y2 = y2 ∼ N µ1 + Σ12 Σ−1 −1

22 (y2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21 .

Shaobo Jin (Math) Bayesian Statistics 52 / 65


Bayesian Estimation Gaussian Process Regression

Predicted Value

Suppose that we want to predict the response value based on a new X∗ .


Then,

    
f (X∗ ) K (X∗ , X∗ ) K (X∗ , X)
| σ 2 ∼ N 0, .
Y K (X, X ∗ ) K (X, X) + σ 2 In

Hence, f (X∗ ) | Y, σ 2 is also Gaussian with

−1
K (X∗ , X) K (X, X) + σ 2 In

mean y,
−1
K (X∗ , X∗ ) − K (X∗ , X) K (X, X) + σ 2 In K (X, X ∗ ) .

covariance

−1
K (X∗ , X) K (X, X) + σ 2 In

The tted function is then y.

Shaobo Jin (Math) Bayesian Statistics 53 / 65


Bayesian Estimation Gaussian Process Regression

Fitted Function Curve: k (x, z) = xT Λ−1


0 z

2
y

−4 −2 0 2 4
x

Shaobo Jin (Math) Bayesian Statistics 54 / 65


Bayesian Estimation Gaussian Process Regression
n o
Fitted Function Curve: k (x, z) = exp − ∥x − z∥2 /2
2

2
y

−4 −2 0 2 4
x

Shaobo Jin (Math) Bayesian Statistics 55 / 65


Bayesian Estimation Gaussian Process Regression

Bayesian Linear Model

Consider the linear regression model

y = f (x) + ϵ, f (x) = xT β,

ϵ | σ 2 ∼ N 0, σ 2 .

where

β ∼ Np 0, Λ−1

Under the conjugate prior 0 , the posterior is
β | y ∼ N µn , Λ−1

n , where

−1
µn = Λ0 + X T X X T y.

Suppose that we observe a new x0 . and want to predict the new


y0 . The predictive distribution is

y0 | y ∼ N xT0 µn , σ 2 + xT0 Λ−1



n x0 .

Shaobo Jin (Math) Bayesian Statistics 56 / 65


Bayesian Estimation Gaussian Process Regression

Transform x

If we transform x ∈ Rp and obtain ϕ (x) ∈ Rd , then we can consider the


linear regression model

y = f (x) + ϵ, f (x) = ϕT (x) γ,

ϵ | σ 2 ∼ N 0, σ 2 .

where

γ ∼ Nd 0, Ω−1

Under the conjugate prior 0 , the predictive
distribution is

y0 | y, σ 2 ∼ N ϕT (x0 ) µn , σ 2 + ϕT (x0 ) Ω−1



n ϕ (x0 ) ,
−1 T
where µn = σ 2 Ω0 + ϕT (X) ϕ (X) ϕ (X) y .
The predictor is not linear in x0 but linear in ϕ (x0 ).

Shaobo Jin (Math) Bayesian Statistics 57 / 65


Bayesian Estimation Gaussian Process Regression

Kernel Function

A function κ (x, z) is a kernel function if

1 it is symmetric, κ (x, z) = κ (z, x),


2 the kernel matrix K with (i, j)th entry κ (xi , xj ) is positive
semi-denite for all x1 , ..., xn .

Example

Show that κ (x, z) = xT Λ−1


0 z is a kernel function for a symmetric Λ0 .

Shaobo Jin (Math) Bayesian Statistics 58 / 65


Bayesian Estimation Gaussian Process Regression

Rewrite Predictive Distribution


ϕT (x1 )
 

Let Φ = ϕ (X) =  ..
. . We can show that
 

ϕT (xn )
−1 −1
Ω−1
0 Φ
T
σ 2 In + ΦΩ−1
0 Φ
T
= σ 2 Ω0 + Φ T Φ ΦT .

Hence, the predictor from the predictive distribution is


−1
ϕT (x0 ) µn = ϕT (x0 ) σ 2 Ω0 + ΦT Φ ΦT y
−1
= ϕT (x0 ) Ω−1
0 Φ
T
σ 2 In + ΦΩ−1
0 Φ
T
y,

where ϕT (x0 ) Ω−1


0 Φ is a 1 × n vector with elements ϕT (x0 ) Ω−1
0 ϕ (xi ) and
T

−1 T −1 T
ΦΩ0 Φ is a n × n matrix with elements ϕ (xi ) Ω0 ϕ (xj ) .


Example
Show that κ (x, z) = ϕT (x) Ω−1
0 ϕ (z) is a kernel function for a symmetric Ω0 .

Shaobo Jin (Math) Bayesian Statistics 59 / 65


Bayesian Estimation Gaussian Process Regression

Predictive Distribution Using Kernel Function


Ifκ (x, z) is a kernel function, then we can nd a function ψ () such
T
that κ (x, z) = ψ (x) ψ (z).
h iT
−1/2 −1/2
κ (x, z) = ϕT (x) Ω−1
0 ϕ (z) = Ω 0 ϕ (x) Ω0 ϕ (z), where
−1/2
ψ (x) = Ω0 ϕ (x).
We can express the predictor from the predictive distribution as

−1
ϕT (x0 ) µn = K (x0 , X) σ 2 In + K (X, X)

y,

where

ϕT (x0 ) Ω−1
  T
K (x0 , X) = 0 ϕ (xi ) = ψ (x0 ) ψ (xi )

is a 1×n vector and

ϕ (xi ) Ω−1
 T  T
K (X, X) = 0 ϕ (xj ) = ψ (xi ) ψ (xj )

is a n×n matrix.
Shaobo Jin (Math) Bayesian Statistics 60 / 65
Bayesian Estimation Gaussian Process Regression

Kernel Trick

Our predictor ϕT (x0) µn depends on x only


 through the inner products
T T T
ψ (x) ψ (z) such as ψ (x0 ) ψ (xi ) and ψ (xi ) ψ (xj ) .
Kernel trick is a commonly used trick to create new features from
your original observed features, if our prediction depends on x only
through inner products.

By varying the kernel function, we obtain dierent sets of ϕ (x)


and ψ (x) as our new features.

Shaobo Jin (Math) Bayesian Statistics 61 / 65


Bayesian Estimation Gaussian Process Regression

Create New Feature

If κ (x, z) is a kernel function, then we will have an eigen-decomposition


X
κ (x, z) = ρm em (x) em (z) ,
m=1

for some eigenvalues ρk and eigenfunctions em (x).

It can possibly be viewed as innite new features have been created as


X √ √
κ (x, z) = ρm em (x) ρm em (z) .
new feature ψm (x)new feature ψm (z)
| {z } | {z }
m=1

Shaobo Jin (Math) Bayesian Statistics 62 / 65


Bayesian Estimation Gaussian Process Regression

Bayesian Regression and Gaussian Process


The predictive distribution y0 | y, σ 2 is Gaussian with

−1
K (x0 , X) σ 2 In + K (X, X)

mean y,
−1
σ 2 + ϕT (x0 ) Ω0 + σ −2 ΦT Φ

variance ϕ (x0 ) .

We can show that the variance is equivalent to

−1
K (x0 , x0 ) − K (x0 , X) σ 2 In + K (X, X) K (X, x0 ) .

Recall that in Gaussian process regression, f (x0 ) | y, σ 2 is also


Gaussian with

−1
K (x0 , X) σ 2 In + K (X, X)

mean y,
−1
K (x0 , x0 ) − K (x0 , X) K (X, X) + σ 2 In

covariance K (X, x0 ) .

They are the same thing!


Shaobo Jin (Math) Bayesian Statistics 63 / 65
Bayesian Estimation Gaussian Process Regression

Prior on Function
Consider the linear regression model

y = f (x) + ϵ, f (x) = ϕT (x) γ,

ϵ | σ 2 ∼ N 0, σ 2 .

where

γ ∼ Nd 0, Ω−1

The conjugate prior 0 implies that

f (x) ∼ N 0, ϕT (x) Ω−1



0 ϕ (x) .

It can be viewed as the function has a Gaussian prior.

The prior distribution of any set of function values satises

 
f (x1 )
 ..  −1 T

 .  = ϕ (X) γ ∼ Nn 0, ϕ (X) Ω0 ϕ (X) .
f (xn )

Shaobo Jin (Math) Bayesian Statistics 64 / 65


Bayesian Estimation Gaussian Process Regression

Posterior on Function

The corresponding posterior is

γ | y, σ 2 ∼ N µn , Ω−1

n ,

where

−1 T
µn = σ 2 Ω0 + ϕT (X) ϕ (X) ϕ (X) y.

It can be viewed as the function has a Gaussian posterior

f (x) | y, σ 2 ∼ N µn , ϕT (x) Ω−1



n ϕ (x) .

The predictive distribution is also Gaussian.

Shaobo Jin (Math) Bayesian Statistics 65 / 65

You might also like