0% found this document useful (0 votes)
21 views

Lec24 BayesianLinearRegression

The document discusses Bayesian linear regression when the variance σ2 is unknown. It introduces the posterior distribution p(w, σ2|D) when using a normal-inverse-gamma prior. The posterior takes the form of a normal-inverse-gamma distribution with updated hyperparameters. Marginal posterior distributions for w and σ2 can be derived in closed form. Empirical Bayes approximations can be used to perform model selection in Bayesian linear regression.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lec24 BayesianLinearRegression

The document discusses Bayesian linear regression when the variance σ2 is unknown. It introduces the posterior distribution p(w, σ2|D) when using a normal-inverse-gamma prior. The posterior takes the form of a normal-inverse-gamma distribution with updated hyperparameters. Marginal posterior distributions for w and σ2 can be derived in closed form. Empirical Bayes approximations can be used to perform model selection in Bayesian linear regression.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Bayesian

Linear Regression
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

October 2, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Bayesian inference in linear regression when s2 is unknown, Posterior, Posterior marginals
and Predictive distributions

 Zellner’s g-Prior, Uninformative (semi-conjugate) prior, Credible intervals, Variable selection

 Example of computing model evidence, Example: Regression model selection

 Limitations of fixed basis functions

Following closely:

 Chris Bishops’ PRML book, Chapter 3


 Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7
 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals for today’s lecture include the following:

 Learn to work with Bayesian regression models with unknown 𝜎 2

 Introduce the Zellner’s G-prior and its uninformative limits

 Understand how to compute posterior, posterior marginals and predictive


distributions as well as credible intervals

 Understand how to implement variable selection

 Learn to apply empirical Bayes/evidence approximation to Bayesian Linear


Regression models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Bayesian Inference when s2 is Unknown
 Let us extend the previous results for linear regression assuming now that 𝜎2 is
unknown.
 Assume a likelihood of the form:*
  y  Xw T  y  Xw  
p  y | X , w , s 2   N  y | Xw , s 2 I N   s 
1 2  N /2
exp   
 2 
N /2
 2s 2

 
 A conjugate prior has the following form:

𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |𝒘0 , 𝑽0 , 𝑎0 , 𝑏0 ≜ 𝒩 𝒘|𝒘0 , 𝜎 2 𝑽0 ℐ𝒢 𝜎 2 |𝑎0 , 𝑏0


𝑎 𝑇 𝑽−1 𝒘 − 𝒘
𝑏0 0 2 − 𝑎 + 𝐷 Τ2+1 𝒘 − 𝒘 0 0 0 + 2𝑏0
𝜎 0 exp −
2𝜋 𝐷Τ2 |𝑉0 |1Τ2 𝛤 𝑎0 2𝜎 2

 The posterior is now derived as:


b0a0   w  w0 T V01  w  w0    y  Xw T  y  Xw   2b0 
p  w, s | D  
2
s 
2  a0  ( D  N )/2 1
exp   
 2 
N  D  /2 1/2
  a0   2s 2

V0  

 Here the response is denoted as 𝒚, the dimensionality of 𝒘 as 𝐷 and the design matrix as 𝑿.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Bayesian Inference when s2 is Unknown
b0a0   w  w0 T V01  w  w0    y  Xw T  y  Xw   2b0 
p  w, s | D  
2
s 
2  a0  ( D  N )/2 1
exp   
 2   N  D  /2 1/2
  a0   2s 2

V0  

 Let us define the following:


VN  V01  X T X  , w N  VN V01w0  X T y 
1

aN  a0  N / 2, bN  b0   w0 V0 w0  y T y  w TNVN1w N 
1 T 1
2

 With these definitions, one with simple algebra can show:


  w  w0 T V01  w  w0    y  Xw T  y  Xw   2bN  w0TV01w0  y T y  w TNVN1w N 
p  w , s | D   s
2

2  aN  D /2 1
exp   
 2s 2 
 
  w  w N T VN1  w  w N   2bN 
 s 
2  aN  D /2 1
exp   
 2s 2 
 

𝑝 𝒘, 𝜎 2 |𝒟 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |𝒘𝑁 , 𝑽𝑁 , 𝑎𝑁 , 𝑏𝑁 ≜ 𝒩 𝒘|𝒘𝑁 , 𝜎 2 𝑽𝑁 ℐ𝒢 𝜎 2 |𝑎𝑁 , 𝑏𝑁


 The posterior marginals can now be derived explicitly:
2 aN  D

p s 2 | D   IG s 2 | aN , bN     w  wN  V  w  wN  
T 1
 b
2
p  w | D   TD  w N , N VN , 2aN   1  
N

 a N   2bN 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Posterior Marginals
2 aN  D

   w  w N  V  w  w N   2bN
T 1  2   w  wN  V  w  wN  
T 1 2
p  w | D    s 2 
 aN  D /2 1
exp   N
 ds  1  N

 2s 2  
0    2bN 

 The marginal posterior can be directly written as:


2 aN  D

   w  wN  V  w  wN   1
T
 bN
2
p  w | D   TD  w N , VN , 2aN   1  
N

 a N   2bN 
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
Students’ 𝒯 ( )
2
distribution  2   x      x    (Mahalanobis Distance )
T


 x    if   1, cov  x   1 if   2, mode  x   
 2

To compute the integral above, simply set 𝜆 = 𝜎 −2 , 𝑑𝜎2 = −𝜆−2 𝑑𝜆 and use the normalizing factor න 𝜆𝑎−1 𝑒 −𝑏𝜆 𝑑𝜆 = 𝛤(𝑎)𝑏 −𝑎 ∼ 𝑏 −𝑎 .
0
of the Gamma distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Posterior Predictive Distribution
 Consider the posterior predictive for 𝑚 new test inputs:

෩, 𝒟
෥ |𝑿
𝑝 𝒚
1 ෥
𝒚 − ෩𝒘 𝑇 𝒚
𝑿 ෩𝒘
෥−𝑿 𝒘 − 𝒘𝑁 𝑇 𝑽−1
𝑁 𝒘 − 𝒘𝑁 + 2𝑏𝑁
−𝑚Τ2 2 − 𝑎𝑁 +𝐷 Τ2+1
∝ඵ 𝜎2 exp − 𝜎 exp − 𝑑𝒘𝑑𝜎 2
2𝜋 𝑚 Τ2 2𝜎 2 2𝜎 2

 As a first step, let us integrate in 𝒘 by writing: These terms


cancel out
from the
𝑇 integration in
𝒚 ෩𝒘
෥−𝑿 𝒚 ෩ 𝒘 + 𝒘 − 𝒘𝑁 𝑇 𝑽−1
෥−𝑿 𝑁 𝒘 − 𝒘𝑁 + 2𝑏𝑁 = 𝒘
𝑇
−1 −1
𝒘− ෩𝑇𝑿
𝑿 ෩ + 𝑽−1
𝑁
෩𝑇𝒚
𝑿 ෥ + 𝑽−1
𝑁 𝒘𝑁
෩𝑇𝑿
𝑿 ෩ + 𝑽−1
𝑁
෩𝑇𝑿
𝒘− 𝑿 ෩ + 𝑽−1
𝑁
෩𝑇𝒚
𝑿 ෥ + 𝑽−1
𝑁 𝒘𝑁
𝑇 −𝑇
෩𝑇𝒚
− 𝑿 ෥ + 𝑽−1
𝑁 𝒘𝑁
෩𝑇𝑿
𝑿 ෩ + 𝑽−1
𝑁
෩𝑇𝒚
𝑿 ෥ + 𝑽−1 𝑇 −1 ෥𝑇 𝒚
𝑁 𝒘𝑁 + 𝒘𝑁 𝑽𝑁 𝒘𝑁 + 𝒚 ෥ + 2𝑏𝑁

 Let us denote the last term in the Eq. above as


𝑇 −𝑇
෩𝑇𝒚
2𝛽 = − 𝑿 ෥ + 𝑽−1
𝑁 𝒘𝑁
෩𝑇𝑿
𝑿 ෩ + 𝑽−1
𝑁
෩𝑇𝒚
𝑿 ෥ + 𝑽−1 𝑇 −1 ෥𝑇 𝒚
𝑁 𝒘𝑁 + 𝒘𝑁 𝑽𝑁 𝒘𝑁 + 𝒚 ෥ + 2𝑏𝑁

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Posterior Predictive Distribution
 The posterior predictive
෩, 𝒟
෥ |𝑿
𝑝 𝒚
1 ෥
𝒚 − ෩𝒘 𝑇 𝒚
𝑿 ෩𝒘
෥−𝑿 𝒘 − 𝒘𝑁 𝑇 𝑽−1
𝑁 𝒘 − 𝒘𝑁 + 2𝑏𝑁
−𝑚Τ2 2 − 𝑎𝑁 +𝐷 Τ2+1
∝ඵ 𝜎2 exp − 𝜎 exp − 𝑑𝒘𝑑𝜎 2
2𝜋 𝑚 Τ2 2𝜎 2 2𝜎 2

is now simplified using 𝜆 = 1/𝜎2 and recalling the normalization of


the Gamma distribution:

෩, 𝒟 ∝ න 𝜆
෥ |𝑿
𝑝 𝒚 𝑚Τ2+𝑎𝑁 −1 exp −𝛽𝜆 𝑑𝜆 ∼ 𝛽− 𝑚Τ2+𝑎𝑁

 Substituting 𝛽 and by comparing the two Eqs., one can verify:


𝑚
𝑇 −𝑇 − +𝑎𝑁
෩, 𝒟 ∝ − 𝑿
෩𝑇𝒚 ෩𝑇𝑿
෩+ ෩𝑇𝒚 2
෥ |𝑿
𝑝 𝒚 ෥ + 𝑽−1
𝑁 𝒘𝑁 𝑿 𝑽−1
𝑁 𝑿 ෥ + 𝑽−1 𝑇 −1
෥𝑇 𝒚
𝑁 𝒘𝑁 + 𝒘𝑁 𝑽𝑁 𝒘𝑁 + 𝒚 ෥ + 2𝑏𝑁
𝑚
− +𝑎𝑁
−1 2
෩ 𝒘𝑁 𝑇 𝑏𝑁 ෩ 𝑽𝑁 𝑿
෩𝑇 ෩ 𝒘𝑁
෥−𝑿
𝒚 𝑰 +𝑿 ෥−𝑿
𝒚
𝑎𝑁 𝑚
∝ 1+  Use the Sherman Morrison
2𝑎𝑁 Woodburry formula here to show
that (symmetry of 𝑽0 is assumed)
−1 −1
෩ 𝑽𝑁 𝑿
𝑰𝑚 + 𝑿 ෩𝑇 ෩ 𝑿
= 𝑰𝑚 − 𝑿 ෩𝑇𝑿
෩ + 𝑽−1
𝑁
෩𝑇
𝑿
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Bayesian Inference when s2 is Unknown
 The posterior predictive is also a Student’s T:
𝑏𝑁
෩ , 𝒟 = 𝒯𝑚 𝒚
෥ |𝑿
𝑝 𝒚 ෩ 𝒘𝑁 ,
෥ |𝑿 ෩ 𝑽𝑁 𝑿
𝑰𝑚 + 𝑿 ෩ 𝑇 , 2𝑎𝑁
𝑎𝑁

 The predictive variance has two terms


bN
 aN
Im due to the measurement noise

𝑏𝑁
 and ෩ 𝑽𝑁 𝑿
𝑿 ෩ 𝑇 due to the uncertainty in 𝒘. The second term depends on how
𝑎𝑁
close a test input is to the training data.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Zellner’s G-Prior
𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |𝒘0 , 𝑽0 , 𝑎0 , 𝑏0 ≜ 𝒩 𝒘|𝒘0 , 𝜎 2 𝑽0 ℐ𝒢 𝜎 2 |𝑎0 , 𝑏0

 It is common to set 𝑎0 = 𝑏0 = 0, corresponding to an uninformative prior for


𝜎2, and to set 𝒘0 = 𝟎 and 𝑽0 = 𝑔(𝑿𝑇𝑿)−1 for any positive value 𝑔.
 This is called Zellner’s g-prior. Here 𝑔 plays a role analogous to 1/𝜆 in ridge
regression. However, the prior covariance is proportional to (𝑿𝑇𝑿)−1 rather
than 𝑰.
𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |0, 𝑔 𝑿𝑇 𝑿 −1
, 0,0 ≜ 𝒩 𝒘|0, 𝜎 2 𝑔 𝑿𝑇 𝑿 −1
ℐ𝐺 𝜎 2 |0,0

 This ensures that the posterior is invariant to scaling of the inputs.

 Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In Bayesian
inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6. North Holland.
 Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Unit Information Prior
𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |0, 𝑔 𝑿𝑇 𝑿 −1
, 0,0 ≜ 𝒩 𝒘|0, 𝜎 2 𝑔 𝑿𝑇 𝑿 −1
ℐ𝐺 𝜎 2 |0,0

 We will see below that if we use an uninformative prior, the posterior precision
1
given 𝑁 measurements is V  X X . N
T

 The unit information prior is defined to contain as much information as one


sample.
1 T
 To create a unit information prior for linear regression, we need to use V01 
N
X X

which is equivalent to the g-prior with 𝑔 = 𝑁.


 Zellner’s prior depends on the data: This is contrary to much of our Bayesian
inference discussion!

 Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to the schwarz criterion. J. of the Am. Stat. Assoc.
90(431), 928–934.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Uninformative Prior
 An uninformative prior can be obtained by considering the uninformative limit
of the conjugate g-prior, which corresponds to setting 𝑔 = ∞. This is
equivalent to an improper NIG prior with 𝒘0 = 0, 𝑽0 = ∞𝐈, 𝑎0 = 0 and 𝑏0 =
0, which gives 𝑝(𝒘, 𝜎2) ∝ 𝜎 −(𝐷+2) .

𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |0, ∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎 2 ∞𝑰 ℐ𝒢 𝜎 2 |0,0 → 𝜎 −(𝐷+2)

 Alternatively, we can start with the semi-conjugate prior 𝑝(𝒘, 𝜎2) =


𝑝(𝒘)𝑝(𝜎2), and take each term to its uninformative limit individually, which
gives 𝑝(𝒘, 𝜎2) ∝ 𝜎 −2 .

This is equivalent to an improper NIG prior with 𝒘0 = 𝟎, 𝑽 = ∞𝐈, 𝑎0 = −𝐷/2


and 𝑏0 = 0.
𝐷
𝑝 𝒘, 𝜎 2 = 𝒩ℐ𝒢 𝒘, 𝜎 2 |0, ∞𝑰, 0,0 ≜𝒩 𝒘|0, 𝜎 2 ∞𝑰 ℐ𝒢 𝜎 2| − , 0 → 𝜎 −2
2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Uninformative Prior
 Using the uninformative prior, p  w, s 2   s 2 , the corresponding posterior and
marginal posteriors are given by
p  w , s 2 | D   NIG  w , s 2 | w N ,VN , aN , bN 

𝑏𝑁 𝑠2
𝑝 𝒘|𝒟 = 𝒯𝐷 ෝ 𝑀𝐿𝐸 ,
𝒘𝑁 , 𝑽𝑁 , 2𝑎𝑁 = 𝒯𝐷 𝒘|𝒘 𝑪, 𝑁 − 𝐷
𝑎𝑁 𝑁−𝐷
𝑽𝑁 = 𝑪 = 𝑽−1 𝑇
0 +𝑿 𝑿
−1 → 𝑿𝑇 𝑿 𝒘𝑵 = 𝑽𝑁 𝑽−1
−1 , 𝑇 𝑇
0 𝒘0 + 𝑿 𝒚 → 𝑿 𝑿
−1 𝑿𝑇 𝒚 ෝ 𝑀𝐿𝐸
=𝒘
𝑎𝑁 = 𝑎0 + 𝑁Τ2 = 𝑁 − 𝐷 Τ2 ,
1 𝑇 −1 𝑇
𝒃𝑁 = 𝒃0 + 𝒘0 𝑽0 𝒘0 + 𝒚𝑇 𝒚 − 𝒘𝑇𝑁 𝑽−1 2 2
𝑁 𝒘𝑁 = 𝑠 Τ2 , 𝑠 = 𝒚 − 𝑿𝒘
ෝ 𝑀𝐿𝐸 ෝ 𝑀𝐿𝐸
𝒚 − 𝑿𝒘
2
𝒘𝑁 = 𝒘 ෝ 𝑀𝐿𝐸 = 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚
 Note in the calculation of 𝑠2:
𝑇
𝑠 2 = 𝒚 − 𝑿𝒘 ෝ 𝑀𝐿𝐸 𝒚 − 𝑿𝒘 ෝ 𝑀𝐿𝐸 = 𝒚 − 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 𝑇 𝒚 − 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚
ෝ 𝑇𝑀𝐿𝐸 𝑿𝑇 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 = 𝒚𝑇 𝒚 − 𝒘
= 𝒚𝑇 𝒚 − 𝒚𝑇 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 = 𝒚𝑇 𝒚 − 𝒘 ෝ 𝑇𝑀𝐿𝐸 𝑽−1
𝑁 𝒘ෝ 𝑀𝐿𝐸

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Frequentist Confidence Interval Vs. Bayesian Marginal Credible Interval

 The use of a (semi-conjugate) uninformative prior is quite interesting since the


resulting posterior turns out to be equivalent to the results obtained from
frequentist statistics.
𝐶𝑗𝑗 𝑠 2
𝑝 𝒘𝒋 |𝒟 = 𝒯 𝑤𝑗 |𝑤
ෝ𝑗 , ,𝑁 − 𝐷
𝑁−𝐷

 This is equivalent to the sampling distribution of the MLE which is given by the
following:
𝑤𝑗 − 𝑤ෝ𝑗 𝐶𝑗𝑗 𝑠 2
~𝒯𝑁−𝐷 , 𝑠𝑗 =
𝑠𝑗 𝑁−𝐷

𝑠𝑗 is the standard error of the estimated parameter.


 The frequentist confidence interval and the Bayesian marginal credible
interval for the parameters are the same.
 Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)
 Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Variable Selection: The Caterpillar Example
 As a worked example of the uninformative prior, consider the caterpillar
dataset. We can compute the posterior mean and standard deviation, and the
95% credible intervals (CI) for the regression coefficients.

 The 95% credible intervals coeff mean stddev 95pc CI sig


are identical to the 95% w0 10.998 3.06027 [ 4.652, 17.345] *
w1 -0.004 0.00156 [ -0.008, -0.001] *
confidence intervals w2 -0.054 0.02190 [ -0.099, -0.008] *
computed using standard w3 0.068 0.09947 [ -0.138, 0.274]
frequentist methods. w4 -1.294 0.56381 [ -2.463, -0.124] *
w5 0.232 0.10438 [ 0.015, 0.448] *
w6 -0.357 1.56646 [ -3.605, 2.892]
w7 -0.237 1.00601 [ -2.324, 1.849]
w8 0.181 0.23672 [ -0.310, 0.672]
w9 -1.285 0.86485 [ -3.079, 0.508]
w10 -0.433 0.73487 [ -1.957, 1.091]
Run linregBayesCaterpillar  Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to
from PMTK3 computational Bayesian statistics. Springer.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Variable Selection
 We can use these marginal posteriors to compute if the coefficients are
significantly different from 0 -- check if its 95% CI excludes 0.

 The CIs for coefficients 0, 1, 2, 4, 5 are all significant.

 These results are the same as those produced by a frequentist approach


using 𝑝 −values at the 5% level.

 But note that the MLE does not even exist when 𝑁 < 𝐷, so standard
frequentist inference theory breaks down in this setting. Bayesian inference
theory still works using proper priors.

 Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Example: Model Evidence
 We have seen that the conjugate prior for a Gaussian with unknown mean and
unknown precision is a normal-gamma distribution.

 We apply this for the regression case for our likelihood (𝑿 denoting the training data)

p  t | X , w ,     N  tn | w T  ( xn ),  1 
N

n 1

for which the conjugate prior for 𝒘 and the precision 𝛽 is:
p  w,    N  w | m0 ,  1 S0  Gam   | a0 , b0 

 It can be shown that the corresponding posterior for this takes the form:

p  w ,  | t   N  w | m N ,  1 S N  Gam   | aN , bN 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Example: Posterior Distribution
 The posterior takes the form:
 1 T 1 N 2
p  w ,  | t    exp  w   w   w  t    tn 
N /2 T T T

 2 2 n 1 
 1 
 exp   w  m0   S01  w  m0    a0 1e  b0 
M /2 T

 2 
N

 tn2
1
1 T
m N  S N1m N   1
 m0T  S01m0
N /2    w -m N  S N1  w -m N 
T

   a 1e  b 
2 2 n1 M /2 2
e e e e 0 0

M /2    w  m N  S N  w  m N 
1 T 1
 e bNaN  aN 1e  bN 
  aN 
p  w ,  | t   N  w | m N ,  1 S N  Gam   | aN , bN 

where:
m N  S N  S01m0  T t  , S N1   S01  T  
N 1  T 1 N

aN  a0  , bN  b0   m0 S0 m0  m N S N m N   tn2 
T 1

2 2 n 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Example: Model Evidence
 The model evidence for our example is given as:

p  t    p  t | w ,   p  w |   dwp (  )d 

    
N /2

   exp   t  w   t  w  
T

 2   2 
    
M /2

exp   w  m0  S01  w  m0   dw
1/2 T
  S0
 2   2 
(a0 ) 1 b0a0  a0 1 e  b0  d  
b0a0   
   2        
T
exp t w t w 
 2   
M N 1/2
S0

  
exp   w  m0  S01  w  m0   dw
T

 2 
(a0 ) 1  a0 1 N /2  M /2 e  b0  d 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Example: Model Evidence
m N  S N  S01m0  T t  , S N1   S01  T  
 Using some of our earlier results in deriving the posterior: N 1 N

aN  a0  , bN  b0   m01 S01m0  m N1 S N1m N   tn2 
2 2 n 1 
b0a0   
p t     2    1
  
T
exp w m S w m N  dw
 2  
N N

M N 1/2
S0

  
exp   t T t + m0T S01m0  m TN S N1m N  
 2 
(a0 ) 1  aN 1 M /2 exp  b0   d 
 Performing the integration in 𝒘 and using the normalization factor for the Gamma
distribution:
b0a0
p t   1/2 
2  (a0 ) 1   aN 1 exp  bN   d 
M /2 1/2
SN
 2  M N
S0    aN  / bNN
a

b0a0   aN 
1/2
1 SN

 2 
N /2
S0
1/2
bNaN   a0 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Example: Regression Model Selection
 Let us consider a regression model with the following particulars (𝑑 = 𝑀 + 1
dimensional data):*
y | w , s 2 ,  ~ N n ( w , s 2 I n )
w | s 2 ,  ~ N M 1  w0 , s 2 M 1  , M a ( M  1)  ( M  1) pos. def . symm. matrix
s 2 |  ~ InvGamma  a, b  , a, b  0

M  I M 1 / c, c  0 and w0  0 M 1

 Our data are in a matrix form (dimension 𝑁 × (𝑀 + 1)):


1 x1 x12 ... x1M 
 
1 x2 x22 ... x2M 
1 x3 x32 ... x3M 
 
  . . . . 
. . . . . 
 
. . . . . 
1 x xN2 ... xNM 
 N

* We only slightly change our notation here again to conform with the MatLab program implementing this example (from Zoubin Ghahramani)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Example: Likelihood Calculation
 We will derive the model evidence analytically. At first the likelihood can be written as:

1  1 
( w , s 2 | y, )  exp   ( y   w ) T
( y   w ) 
  2 N /2
 s 
2
2s 2
 With simple algebra, we can rewrite the likelihood introducing the MLE estimate of
the parameters as follows:
1  1 1 
( w , s 2 | y,  )  exp   ( y   w ) T
( y   w )  ( w  w ) T
 T
 ( w  w ML  
)
 2s  s s
ML ML ML
2 N /2
 
2 2
2 2

1  1 1 
 exp   2 s 2  2 ( w  w ML )T T  ( w  w ML ) 
 2s 2   2s 2s 
N /2

where
wML   T   T y
1

𝑠 2 ≜ 𝒚 − 𝜱𝒘𝑀𝐿 𝑇 (𝒚 − 𝜱𝒘𝑀𝐿 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22


Example: Computing Model Evidence
 Our posterior is then of the following form:
b a  d /2  M  N  2 a 3  1  2 T 1  
p  w  s 2 | w ML , s 2 ,     s  2b  w  I +    w   w  w ML  w ML     w ML  
1 1
c s exp   2 T T T T T

 2   2  (a )  2s c 
N /2 d /2
 

 The evidence is now computed as (use first the normalization of the Inv Gamma
distribution):
p ( y | M )    ( w , s 2 | y, ) p  w  s 2 | w ML , s 2 ,   dwds 2 
 
ba  1  2 T     2
 s  2b  w  I +    w   w  w ML  w ML     w ML   ds dw 
1 1 1
c  d /2   s  M  N  2 a 3 exp   2 T T T T T

 2  (a )  2   2s c 
N /2 d /2
  
 A 
 1   A/ 2
a a
c d /2   s 2 
1 b 1 1 b 1   M  N  2 a 3 /2
c  d / 2   s  M  N  2 a 3 exp   2 A  d s 2 dw  exp   2  ds 2 dw 
 2  (a )  2   2s   2  (a)  2   s 
N /2 d /2 N /2 d /2

ba  1  2 ba    M  N  2a  1 / 2 
1 1
c  d /2
   s 
2  M  N  2 a 1 /2 1
exp   ( A / 2)  d s d w 
1 1
c d /2   M  N  2 a 1 /2
dw 
 2  (a )  2   s     

N /2 d /2 2
2
N /2
 ( a ) 2
d /2
( A / 2)
  
 M  N  2 a 1/2

ba M  N  2 a 1 /2  2 1  
1 1
c  d /2    M  N  2a  1 / 2  2   s  2b  w T  I + T   w  2 w T T w ML  w ML
T
 T   w ML  dw
 2  (a )  2  c 
N /2 d /2
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Example: Computing Model Evidence
 We will use now the normalization of the multivariate Student–t distribution:

1
1 
1 ba
1 dN 
d N
a
   I + T  
c 
   w
T
ML

p( y | M )  c  d /2    a2 2

 2  (a )  2   2 
N /2 d /2

 d  N /2a

    1

  1
1
  

 s 2
 2b  w T
ML   T
  w ML   T 1
   ( w   )T
 I +     ( w   ) 
T
dw
 c   
  n  2a
 g
   
 d  n 2 a /2
d N
a
 d /2  d  N   1 
 
1 b 1 a  d  N /2a

  g  
T 1
c    a 2
2
g 1  ( w  )  ( w  )  dw 
 2  (a )  2 d /2  
N /2
2 
 d  /2

1 ba 1  d /2  d  N 
d N
a  d  N /2a  1  T  g 
1

 2 
N /2
(a )  2  d /2
c 
 2
 a2

2
g  1   (w   )  n  2a   ( w   )  dw
  

Useful formulas for the Student-t:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Example: Computing Model Evidence
 Performing the integration in 𝒘 in the last slide:
d N d /2 1/2
 a  d  N /2a  N 
1 ba 1 d /2  g  1 
p( y | M )   d /2
       I+   
d /2 T
   
2
c 2 g a N 2 a
 2  (a )  2 d /2 2   N  2a  c 
N /2

 N /2  a 1/2
b a  d /2  N  1  1 
1 1 T
c    a   s 2  b  w ML  T   w ML   T  1 
1
 I+  
T

 2  (a ) 2  2  c 
N /2
2 2
 N /2  a
  1/2
b a  d /2  N   1 
1 1 1
c    a   ( y  w ML )T ( y  w ML )  b  w ML
T
 T   w ML   T  1 
1
 I+  
T

 2  (a ) 2  2 c 
N /2
2 2 
 E 

 Using some of the earlier definitions,


1

wML       y
1
1 
T T 2
𝑠 ≜ 𝒚 − 𝜱𝒘𝑀𝐿 (𝒚 − 𝜱𝒘𝑴𝑳 ) 𝑇    I + T  
c 
   w
T
ML

1
E  ( y  w ML )T ( y  w ML )  b  w ML
2
T
 T   wML   T 1  b 
1
2
1
2
1 T
2
1 T
y y  w ML
2
 T   w ML  y T w ML  w ML
1 T
2
 T   w ML
1 1
1  1  1 
1 T
 T    I + T    I + T    I + T      w y y  y T    T     T    T    T y
1 T 1 1
 w ML T
ML b
2 c  c  c  2
1 1
1  1 
 y      T    I + T         
1 1 T 1 1 1
T T
 y  w ML
T T T
 y = b  yT y  yT   I + T   T y
T

2 c  2 2 c 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Example: Model Evidence
 The final evidence in analytical form is given as:
 N /2  a 1/2
 
1
1 b a  d /2  N 1 T 1 T 1  1 
p( y | M )  c    a   b  y y  y   I +    T y 
T
 I+  
T

 2  (a ) 2   c   c 
N /2
2 2 

 Compare this with what is given in this MatLab Implementation.

 The model evidence and samples of different order (𝑀) regression models are given
below. The specific data of the problem can be found in the MatLab file.

 We are looking for the order of the polynomial that maximizes the evidence. Note
that the MatLab implementation utilized random input/output data.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Example: Bayesian Model Comparison
M=0 M=1 M=2
40 40 40
1
20 20 20
0 0 0 0.8

-20 -20 -20

P(D|M)
0.6
0 5 10 0 5 10 0 5 10
M=3 M=4 M=5 0.4

40 40 40 0.2

20 20 20 0
0 1 2 3 4 5
0 0 0 M
-20 -20 -20
0 5 10 0 5 10 0 5 10

MatLab implementation of Bayesian Model Selection (from Zoubin Ghahramani)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Limitations of Fixed Basis Functions
 Up to now the basis functions 𝝓𝑗(𝒙) are fixed before the training data set is
observed.
 As a consequence, the number of basis functions grows exponentially with the
dimensionality 𝐷 of the input space.
 There are two properties of real data sets that we can exploit to alleviate the curse of
dimensionality:
 the data vectors {𝒙𝒏} typically lie close to a nonlinear manifold whose intrinsic
dimensionality is smaller than that of the input space as a result of strong
correlations between the input variables.
 If we are using localized basis functions, we can arrange that they are scattered
in input space only in regions containing data (radial basis functions, support
vector and relevance vector machines).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Adaptive Basis Functions
 Neural network models using adaptive basis functions having e.g. sigmoidal
nonlinearities can adapt their parameters so that the regions of input space over
which the basis functions vary correspond to the data manifold.

 The target variables may have significant dependence on only a small number of
possible directions within the data manifold.

 Neural networks can exploit this property by choosing the directions in input space to
which the basis functions respond.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

You might also like