0% found this document useful (0 votes)
11 views

Bishop CH 3 Notes

Uploaded by

anirbansingha345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Bishop CH 3 Notes

Uploaded by

anirbansingha345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

#

Maximum likelihood and least squares


t
y(X w) + E
=

normal distribution for


Assuming noise ,

bCt(x ,
w, B) =

X(t/y(x , w) ,
B-)

For N datapoints the likelihood would take the


,
log
farm of,
([p(tIX ,
w , BCT =

Ten [N(enlycxn , w , B-I

-Ien() -

itn-yexn w} .

-
loss function

setting the
gradient to zero for finding wi ,

x en [p(tX ,
w , B)J =

2 Cen -wid(xn)) (xn)T

where G(X)= 100(x) 4 , (x)


. . .
.

M-2x)]+

1-8t/max-likelihood parameter

I
% & M-(X)

I
=
bo(x ) , 4 , (x1) . . . .

: : :
%o(Xw) & . CXw) .... m (XN)

If we solve explicitly for the bias ,

E
Ew pi(x) - compensates for diffe in ang .

wo = -

; sum of target and regressor


output

E =
with ; x =

12 ,
$ixal
solving for B we
get ,

m "
2. [tn-widexn) ;
etension for multiple target variable , approach would
be same ,
just t replaced by
will be the Matrix T .

Wm= (II) I T .

Adding regularization
-
After adding regularization with X ,

Loss function becomes ,

Stn -
wid(xnl + w w

Wm =

(xI +
&Td)" !t

# Bias-Variance Decomposition
Expected squared loss is ,


<((Y(x) -

n(x))2b(x)dx + f(t -

n(x))p(x t)dx ,

this could be reduced to zero

if we have infinite data


For a
finite dataset D ,
the first term is written as
,

(y(x : D) -
hexy
NOTE :

y(X : D) is the function we come up with with the help


dataset D
of .

Expected value of it is ,

Ex[4yx : D) -
h(x))2]= [Eptyx D)]-ncx)+ var(ycx : : D))
Bias variance

divide the dataset into folds and


we I
getX varying bias
and variance
by changing regularization .

low variance but (fixed modell


High)
low ↓
->
low bias but
high bias
->
high variance
# Bayesian Linear Regressor
Let's define the prior distribution for w ,

b(w) =
N(w/mo So) ,

where mo is the mean vector and So is covariance matrix

The posterior probability could be found using the


Likelihood ,

b(+ (w) =

N(t/w +
(x) , B-)
Now ,
b(wIt) = b(+/w)b(w)
b(witl will also be a normal distribution .

b (WIt) =

N(w/mw SN) ,

Mx =

SN/So"mo+BIt) and Si =
So + BI
mode
For
gaussian :
mean .
Thus , MM= WMAP .

take variance of infinite -I


If we
weights to be i e So :

. .

and x-> 0 .

(as equally likely)


weights
Then ,
WMAD :

Wi now , all are

For prior =

p(w) =
N (w10 , 9-I)
likelihood would
log
The be ,

(n(p(wIt) <tr-wid(XnY2 swiw


-B
=
-
+coust

&

L The regularized loss function actually corresponds


likelihood for the posterior of
to the
log weights

Sequential Learning
:

Posterior for the


weights are calculated for
each datapoint
and used as prior for the next calculation
distribution
pre posterior dis of .
weights
I
bCt1E ,
<
, B) =

/ p(t1w B)b(wIt , , , B(dw>


7
↑ ↑
probability distribution is of
integrateweight the

for new t
given old vector probability - target
given certain weights
I noise variance " and
,
and noise variance -
brior weights variance x-1

b(t/E ,
< , B) takes a normal form as ,

N It 1mb(x) ,
Ow(x))

04(x) = +
ex SN(X)

mi =

SNCSo"mo+BIt) Sw =
So + BId

ent Kernel

mean for the predictive distribution can


be given by.

y(x , mr) =

mo(x) =
BecxSNETE- : Bd(X) Snd(Xn) tn

Here the Keenel k(X , x) BG(X) SNG(X')


eg .
=

Thus ,
predictive mean becomes " K(X xn)tn ,

n = 1

waninga
equivalent kernel depends
enter
NOTE
a
ou
:

mean for x is a
weighted sum of
where is
weight
given
more

have inputs closer to y

Covtycx ) y(x2)] ,
,
=

B. k(x , x 2)
& it also quantifies the closeness
among multiple predictions
# Bayesian Model Comparison

Let's consider models with prior distributions as ,

Ib <MiS] ? =

we want to compute the posterior distribution ,

p(milD) < b(mi)b(DIMi)


↑ ↑ model evidence that the
we basically update the
current data is expressed by
prior data
using our to
this model
help make new predictions

Ratio of model evidence helps in comparison of two models


as how well the data is expressed by each model as opposed
to the other .

this is called Bayes factor , P(D/Mi)/PCD/Mj)


Predictive distribution from all modele also known as
mixture distribution is given as ,

b(t1x D) ,
=

zb(t1x , mi , D) b(MilD

Du to computational limitation , most probable model is


choosed model selection
however
using .

:Model evidence p(D/Mi) =

(p(D1w , Mi)b(wImi) dw
&

This is the normalization -


for the posterior b (wID Mis
,

The posterior distribution is proportional to b(D(w)b(w)


Approximations -> Let the posterior be sharp with peak WMD
and width AWpost

Let briar be width awpria


flat
with
Then , p(W) =
I
AW brior

SPCDIW)P(W)
integral
AtmADS
Then ,
becomes , - AW post

The log likelihood thu is , logb(D1wmap)) +

log(or)
If the
weights has M dimensions ,
this becomes ,

logb(b/Mi) logp(DIWmnp Mi)


M10g or
= +

Since AWbost <Ambria due to


approximation , this term is -ve

and
low
increasing
likelihood .
i leads to

NOTE : As we increase M term increases con data is


,
the first
fitted properly but 2nd term decreases .

Thus , optimal model likelihood would be with medium


model complexity M .

# Evidence Approximation
True predictive dis .

after introducing x & B is ,

p(tIT) =
((/p(t1w B) b(wIT , , 2 , B) p(x B(T) dwddB ,

Let's assume b( BIT),


is sharply peaked at " & B .

Then p(tIT) =
b(+1+ 2 B) ,
, = (p(+1w ,
B) p(WIT , , B) dw

b(2 BIT) cb(T


: , 1
,B)
model evidence
Thus
model
, for finding
evidence
&
just like
B ,
wel
earlier
need to maximize the

Log Likelihood would come out as ,

E(MN)
(np(T1 B)
MMx NMB- ICIAl-Hen (2)e
= -
+

we can maximize this iteratively for optimal M .

For maximizing w r
. t
.
.
& B ,

and
Excitn mid(xn) Y

is
B :
-

mimN
and xis an eigenvalue of Bot" ot

:
Y depends on both <& B .
We first choose & & B ,
compute
Y and iterate
accordingly .

You might also like