Lec24 BayesianLinearRegression
Lec24 BayesianLinearRegression
Linear Regression
Prof. Nicholas Zabaras
Email: [email protected]
URL: https://www.zabaras.com/
October 2, 2020
Following closely:
Here the response is denoted as 𝒚, the dimensionality of 𝒘 as 𝐷 and the design matrix as 𝑿.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Bayesian Inference when s2 is Unknown
b0a0 w w0 T V01 w w0 y Xw T y Xw 2b0
p w, s | D
2
s
2 a0 ( D N )/2 1
exp
2 N D /2 1/2
a0 2s 2
V0
aN a0 N / 2, bN b0 w0 V0 w0 y T y w TNVN1w N
1 T 1
2
p s 2 | D IG s 2 | aN , bN w wN V w wN
T 1
b
2
p w | D TD w N , N VN , 2aN 1
N
a N 2bN
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Posterior Marginals
2 aN D
w w N V w w N 2bN
T 1 2 w wN V w wN
T 1 2
p w | D s 2
aN D /2 1
exp N
ds 1 N
2s 2
0 2bN
a N 2bN
D
( ) 2 /2 D /2
| |1/2
T ( x | , , ) 2 2
D /2
1
Students’ 𝒯 ( )
2
distribution 2 x x (Mahalanobis Distance )
T
x if 1, cov x 1 if 2, mode x
2
∞
To compute the integral above, simply set 𝜆 = 𝜎 −2 , 𝑑𝜎2 = −𝜆−2 𝑑𝜆 and use the normalizing factor න 𝜆𝑎−1 𝑒 −𝑏𝜆 𝑑𝜆 = 𝛤(𝑎)𝑏 −𝑎 ∼ 𝑏 −𝑎 .
0
of the Gamma distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Posterior Predictive Distribution
Consider the posterior predictive for 𝑚 new test inputs:
෩, 𝒟
|𝑿
𝑝 𝒚
1
𝒚 − ෩𝒘 𝑇 𝒚
𝑿 ෩𝒘
−𝑿 𝒘 − 𝒘𝑁 𝑇 𝑽−1
𝑁 𝒘 − 𝒘𝑁 + 2𝑏𝑁
−𝑚Τ2 2 − 𝑎𝑁 +𝐷 Τ2+1
∝ඵ 𝜎2 exp − 𝜎 exp − 𝑑𝒘𝑑𝜎 2
2𝜋 𝑚 Τ2 2𝜎 2 2𝜎 2
෩, 𝒟 ∝ න 𝜆
|𝑿
𝑝 𝒚 𝑚Τ2+𝑎𝑁 −1 exp −𝛽𝜆 𝑑𝜆 ∼ 𝛽− 𝑚Τ2+𝑎𝑁
𝑏𝑁
and ෩ 𝑽𝑁 𝑿
𝑿 ෩ 𝑇 due to the uncertainty in 𝒘. The second term depends on how
𝑎𝑁
close a test input is to the training data.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In Bayesian
inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6. North Holland.
Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.
We will see below that if we use an uninformative prior, the posterior precision
1
given 𝑁 measurements is V X X . N
T
Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to the schwarz criterion. J. of the Am. Stat. Assoc.
90(431), 928–934.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Uninformative Prior
An uninformative prior can be obtained by considering the uninformative limit
of the conjugate g-prior, which corresponds to setting 𝑔 = ∞. This is
equivalent to an improper NIG prior with 𝒘0 = 0, 𝑽0 = ∞𝐈, 𝑎0 = 0 and 𝑏0 =
0, which gives 𝑝(𝒘, 𝜎2) ∝ 𝜎 −(𝐷+2) .
𝑏𝑁 𝑠2
𝑝 𝒘|𝒟 = 𝒯𝐷 ෝ 𝑀𝐿𝐸 ,
𝒘𝑁 , 𝑽𝑁 , 2𝑎𝑁 = 𝒯𝐷 𝒘|𝒘 𝑪, 𝑁 − 𝐷
𝑎𝑁 𝑁−𝐷
𝑽𝑁 = 𝑪 = 𝑽−1 𝑇
0 +𝑿 𝑿
−1 → 𝑿𝑇 𝑿 𝒘𝑵 = 𝑽𝑁 𝑽−1
−1 , 𝑇 𝑇
0 𝒘0 + 𝑿 𝒚 → 𝑿 𝑿
−1 𝑿𝑇 𝒚 ෝ 𝑀𝐿𝐸
=𝒘
𝑎𝑁 = 𝑎0 + 𝑁Τ2 = 𝑁 − 𝐷 Τ2 ,
1 𝑇 −1 𝑇
𝒃𝑁 = 𝒃0 + 𝒘0 𝑽0 𝒘0 + 𝒚𝑇 𝒚 − 𝒘𝑇𝑁 𝑽−1 2 2
𝑁 𝒘𝑁 = 𝑠 Τ2 , 𝑠 = 𝒚 − 𝑿𝒘
ෝ 𝑀𝐿𝐸 ෝ 𝑀𝐿𝐸
𝒚 − 𝑿𝒘
2
𝒘𝑁 = 𝒘 ෝ 𝑀𝐿𝐸 = 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚
Note in the calculation of 𝑠2:
𝑇
𝑠 2 = 𝒚 − 𝑿𝒘 ෝ 𝑀𝐿𝐸 𝒚 − 𝑿𝒘 ෝ 𝑀𝐿𝐸 = 𝒚 − 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 𝑇 𝒚 − 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚
ෝ 𝑇𝑀𝐿𝐸 𝑿𝑇 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 = 𝒚𝑇 𝒚 − 𝒘
= 𝒚𝑇 𝒚 − 𝒚𝑇 𝑿 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚 = 𝒚𝑇 𝒚 − 𝒘 ෝ 𝑇𝑀𝐿𝐸 𝑽−1
𝑁 𝒘ෝ 𝑀𝐿𝐸
This is equivalent to the sampling distribution of the MLE which is given by the
following:
𝑤𝑗 − 𝑤ෝ𝑗 𝐶𝑗𝑗 𝑠 2
~𝒯𝑁−𝐷 , 𝑠𝑗 =
𝑠𝑗 𝑁−𝐷
But note that the MLE does not even exist when 𝑁 < 𝐷, so standard
frequentist inference theory breaks down in this setting. Bayesian inference
theory still works using proper priors.
Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Example: Model Evidence
We have seen that the conjugate prior for a Gaussian with unknown mean and
unknown precision is a normal-gamma distribution.
We apply this for the regression case for our likelihood (𝑿 denoting the training data)
p t | X , w , N tn | w T ( xn ), 1
N
n 1
for which the conjugate prior for 𝒘 and the precision 𝛽 is:
p w, N w | m0 , 1 S0 Gam | a0 , b0
It can be shown that the corresponding posterior for this takes the form:
p w , | t N w | m N , 1 S N Gam | aN , bN
2 2 n 1
1
exp w m0 S01 w m0 a0 1e b0
M /2 T
2
N
tn2
1
1 T
m N S N1m N 1
m0T S01m0
N /2 w -m N S N1 w -m N
T
a 1e b
2 2 n1 M /2 2
e e e e 0 0
M /2 w m N S N w m N
1 T 1
e bNaN aN 1e bN
aN
p w , | t N w | m N , 1 S N Gam | aN , bN
where:
m N S N S01m0 T t , S N1 S01 T
N 1 T 1 N
aN a0 , bN b0 m0 S0 m0 m N S N m N tn2
T 1
2 2 n 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Example: Model Evidence
The model evidence for our example is given as:
p t p t | w , p w | dwp ( )d
N /2
exp t w t w
T
2 2
M /2
exp w m0 S01 w m0 dw
1/2 T
S0
2 2
(a0 ) 1 b0a0 a0 1 e b0 d
b0a0
2
T
exp t w t w
2
M N 1/2
S0
exp w m0 S01 w m0 dw
T
2
(a0 ) 1 a0 1 N /2 M /2 e b0 d
exp t T t + m0T S01m0 m TN S N1m N
2
(a0 ) 1 aN 1 M /2 exp b0 d
Performing the integration in 𝒘 and using the normalization factor for the Gamma
distribution:
b0a0
p t 1/2
2 (a0 ) 1 aN 1 exp bN d
M /2 1/2
SN
2 M N
S0 aN / bNN
a
b0a0 aN
1/2
1 SN
2
N /2
S0
1/2
bNaN a0
M I M 1 / c, c 0 and w0 0 M 1
* We only slightly change our notation here again to conform with the MatLab program implementing this example (from Zoubin Ghahramani)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Example: Likelihood Calculation
We will derive the model evidence analytically. At first the likelihood can be written as:
1 1
( w , s 2 | y, ) exp ( y w ) T
( y w )
2 N /2
s
2
2s 2
With simple algebra, we can rewrite the likelihood introducing the MLE estimate of
the parameters as follows:
1 1 1
( w , s 2 | y, ) exp ( y w ) T
( y w ) ( w w ) T
T
( w w ML
)
2s s s
ML ML ML
2 N /2
2 2
2 2
1 1 1
exp 2 s 2 2 ( w w ML )T T ( w w ML )
2s 2 2s 2s
N /2
where
wML T T y
1
𝑠 2 ≜ 𝒚 − 𝜱𝒘𝑀𝐿 𝑇 (𝒚 − 𝜱𝒘𝑀𝐿 )
2 2 (a ) 2s c
N /2 d /2
The evidence is now computed as (use first the normalization of the Inv Gamma
distribution):
p ( y | M ) ( w , s 2 | y, ) p w s 2 | w ML , s 2 , dwds 2
ba 1 2 T 2
s 2b w I + w w w ML w ML w ML ds dw
1 1 1
c d /2 s M N 2 a 3 exp 2 T T T T T
2 (a ) 2 2s c
N /2 d /2
A
1 A/ 2
a a
c d /2 s 2
1 b 1 1 b 1 M N 2 a 3 /2
c d / 2 s M N 2 a 3 exp 2 A d s 2 dw exp 2 ds 2 dw
2 (a ) 2 2s 2 (a) 2 s
N /2 d /2 N /2 d /2
ba 1 2 ba M N 2a 1 / 2
1 1
c d /2
s
2 M N 2 a 1 /2 1
exp ( A / 2) d s d w
1 1
c d /2 M N 2 a 1 /2
dw
2 (a ) 2 s
N /2 d /2 2
2
N /2
( a ) 2
d /2
( A / 2)
M N 2 a 1/2
ba M N 2 a 1 /2 2 1
1 1
c d /2 M N 2a 1 / 2 2 s 2b w T I + T w 2 w T T w ML w ML
T
T w ML dw
2 (a ) 2 c
N /2 d /2
1
1
1 ba
1 dN
d N
a
I + T
c
w
T
ML
p( y | M ) c d /2 a2 2
2 (a ) 2 2
N /2 d /2
d N /2a
1
1
1
s 2
2b w T
ML T
w ML T 1
( w )T
I + ( w )
T
dw
c
n 2a
g
d n 2 a /2
d N
a
d /2 d N 1
1 b 1 a d N /2a
g
T 1
c a 2
2
g 1 ( w ) ( w ) dw
2 (a ) 2 d /2
N /2
2
d /2
1 ba 1 d /2 d N
d N
a d N /2a 1 T g
1
2
N /2
(a ) 2 d /2
c
2
a2
2
g 1 (w ) n 2a ( w ) dw
N /2 a 1/2
b a d /2 N 1 1
1 1 T
c a s 2 b w ML T w ML T 1
1
I+
T
2 (a ) 2 2 c
N /2
2 2
N /2 a
1/2
b a d /2 N 1
1 1 1
c a ( y w ML )T ( y w ML ) b w ML
T
T w ML T 1
1
I+
T
2 (a ) 2 2 c
N /2
2 2
E
wML y
1
1
T T 2
𝑠 ≜ 𝒚 − 𝜱𝒘𝑀𝐿 (𝒚 − 𝜱𝒘𝑴𝑳 ) 𝑇 I + T
c
w
T
ML
1
E ( y w ML )T ( y w ML ) b w ML
2
T
T wML T 1 b
1
2
1
2
1 T
2
1 T
y y w ML
2
T w ML y T w ML w ML
1 T
2
T w ML
1 1
1 1 1
1 T
T I + T I + T I + T w y y y T T T T T y
1 T 1 1
w ML T
ML b
2 c c c 2
1 1
1 1
y T I + T
1 1 T 1 1 1
T T
y w ML
T T T
y = b yT y yT I + T T y
T
2 c 2 2 c
2 (a ) 2 c c
N /2
2 2
The model evidence and samples of different order (𝑀) regression models are given
below. The specific data of the problem can be found in the MatLab file.
We are looking for the order of the polynomial that maximizes the evidence. Note
that the MatLab implementation utilized random input/output data.
P(D|M)
0.6
0 5 10 0 5 10 0 5 10
M=3 M=4 M=5 0.4
40 40 40 0.2
20 20 20 0
0 1 2 3 4 5
0 0 0 M
-20 -20 -20
0 5 10 0 5 10 0 5 10
The target variables may have significant dependence on only a small number of
possible directions within the data manifold.
Neural networks can exploit this property by choosing the directions in input space to
which the basis functions respond.