0% found this document useful (0 votes)
89 views

Maximum Likelihood Estimation

This document discusses maximum likelihood estimation (MLE) and regularization. It begins by introducing MLE, which estimates parameters by assigning the highest probability to the training data. MLE picks the parameters that maximize the likelihood function. Regularization is then discussed as a technique to avoid overfitting by adding constraints to the MLE. Examples of MLE for Bernoulli and linear regression models are provided.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Maximum Likelihood Estimation

This document discusses maximum likelihood estimation (MLE) and regularization. It begins by introducing MLE, which estimates parameters by assigning the highest probability to the training data. MLE picks the parameters that maximize the likelihood function. Regularization is then discussed as a technique to avoid overfitting by adding constraints to the MLE. Examples of MLE for Bernoulli and linear regression models are provided.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Maximum Likelihood Estimation (MLE)

Regularizations

Faculty of Computer Science


University of Information Technology (UIT)
Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for Computer Science, Fall 2022

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 1 / 46


References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 2 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 3 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 4 / 46


Introduction

The process of estimating θ from D is called model fitting, or


training, is at the heart of machine learning.
There are many methods for estimating θ, and they involve an
optimization problem of the form

θ̂ = argmin L(θ)
θ

where L(θ) is some kind of loss function or objective function.


The process of quantifying uncertainty about an unknown quantity
estimated from a finite sample of data is called inference.
In deep learning, the term “inference” refers to “prediction”, namely
computing
p(y|x, θ̂)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 5 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 6 / 46


Maximum Likelihood Estimation
The most common approach to parameter estimation is to pick the
parameters that assign the highest probability to the training data.
This is called maximum likelihood estimation or MLE.
θ̂ mle = argmax p(D|θ)
θ
We usually assume the training examples are “independent and
identically distributed”, and are sampled from the same distribution
(i.e., the iid assumption). The conditional likelihood becomes
N
Y
p(D|θ) = p(y1 , y2 , . . . , yN |x1 , x2 , . . . , xN , θ) = p(yn |xn , θ)
n=1
We usually work with the log likelihood, which decomposes into a
sum of terms, one per example.
N
Y N
X
LL(θ) = log p(D|θ) = log p(yn |xn , θ) = log p(yn |xn , θ)
n=1 n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 7 / 46


Maximum Likelihood Estimation
The MLE is given by
N
X
θ̂ mle = argmax log p(yn |xn , θ)
θ n=1

Because most optimization algorithms are designed to minimize cost


functions, we redefine the objective function to be the conditional
negative log likelihood or NLL:
N
X
NLL(θ) = − log p(D|θ) = − log p(yn |xn , θ)
n=1

Minimizing this will give the MLE.


N
X
θ̂ mle = argmin − log p(yn |xn , θ)
θ n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 8 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 9 / 46


Bernoulli Random Variables

A Bernoulli r.v. X takes two possible values, usually 0 and 1,


modeling random experiments that have two possible outcomes (e.g.,
“success” and “failure”).
e.g., tossing a coin. The outcome is either Head or Tail.
e.g., taking an exam. The result is either Pass or Fail.
e.g., classifying images. An image is either Cat or Non-cat.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 10 / 46


Bernoulli Random Variables

Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)

0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 11 / 46
Example

A bag contains 3 balls, each ball is either red or blue.


The number of blue balls θ can be 0, 1, 2, 3.
Choose 4 balls randomly with replacement.
Random variables X1 , X2 , X3 , X4 are defined as
(
1, if the i-th chosen ball is blue
Xi =
0, if the i-th chosen ball is red

After doing the experiment, the following values for Xi ’s are


observed: x1 = 1, x2 = 0, x3 = 1, x4 = 1.
Note that Xi ’s are i.i.d. (independent and identically distributed) and
Xi ∼ Bernoulli( 3θ ). For which value of θ is the probability of the
observed sample is the largest?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 12 / 46


Example
(
θ
PXi (x) = 3, for x = 1
θ
1− 3, for x = 0
Xi ’s are independent, the joint PMF of X1 , X2 , X3 , X4 can be written
PX1 X2 X3 X4 (x1 , x2 , x3 , x4 ) = PX1 (x1 )PX2 (x2 )PX3 (x3 )PX4 (x4 )
   3  
θ θ θ θ θ θ
PX1 X2 X3 X4 (1, 0, 1, 1) = · 1 − · · = 1−
3 3 3 3 3 3

θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 13 / 46
MLE for the Bernoulli distribution

Suppose Y is a random variable representing a coin toss.


The event Y = 1 corresponds to heads, Y = 0 corresponds to tails.
The probability distribution for this rv is the Bernoulli. The NLL for
the Bernoulli distribution is
N
Y N
Y
NLL(θ) = − log p(yn |θ) = − log θI(yn =1) (1 − θ)I(yn =0)
n=1 n=1
N
X
=− I(yn = 1) log θ + I(yn = 0) log(1 − θ)
n=1
= −[N1 log θ + N0 log(1 − θ)]

where N1 = N
P
PN n=1 I(yn = 1) is the number of heads, and
N0 = n=1 I(yn = 0) is the number of tails.
N = N0 + N1 is the sample size.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 14 / 46
MLE for the Bernoulli distribution

NLL(θ) = −[N1 log θ + N0 log(1 − θ)]

The derivative of the NLL is


d −N1 N0
NLL(θ) = +
dθ θ 1−θ
d
The MLE can be found by solving dθ NLL(θ) = 0.
The MLE is given by
N1
θ̂mle =
N0 + N1
which is the empirical fraction of heads.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 15 / 46


MLE for the categorical distribution

Suppose we roll a K-sided dice N times.


Let Yn ∈ {1, . . . , K} be the n-th outcome, where Yn ∼ Cat(θ).
We want to estimate θ from the dataset D{yn : n = 1 : N }.
The NLL is given by
X
NLL(θ) = − Nk log θk
k

where Nk is the number of times the event Y = k is observed.


The compute the MLE, we have to minimize the NLL subject to the
constraint that
K
X
θk = 1
k=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 16 / 46


MLE for the categorical distribution
We use the method of Lagrange multipliers. The Lagrangian is as
X  X 
L(θ, λ) = − Nk log θk − λ 1 − θk
k k
Taking derivatives with respect to λ yields the original constraint
∂L X
=1− θk = 0
∂λ
k
Taking derivatives with respect to θk yields
∂L Nk
=− + λ = 0 −→ Nk = λθk
∂θk θk
We can solve for λ using the sum-to-one constraint
X X
Nk = N = λ θk = λ
k k
Nk Nk
Thus the MLE is given by θ̂k = λ = N , the empirical fraction of
times event k occurs.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 17 / 46
Standard Normal (Gaussian) Random Variable N (0, 1)

1.0

0.8
x 2 /2
0.6
fX (x)

e −x 2 /2
1 e −x 2 /2
0.4 p

0.2

0.0
3 2 1 0 1 2 3
x

Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 18 / 46


General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π

0.2

0.0
6 4 2 0 2 4 6 8 10
x

1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π
 
1 1 2
=√ exp − 2 (y − µ)
2πσ 2 2σ
E [X] = µ Var (X) = σ 2
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 19 / 46
General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
σ = 0.5
0.6 σ=1
fX (x)

σ=2
0.4 σ=3
0.2

0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aE[X] + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 20 / 46


Example

We have N = 3 data points y1 = 1, y2 = 0.5, y3 = 1.5 which are


independent and Gaussian with unknown mean µ and variance 1:

yi ∼ N (µ, 1)

Likelihood P (y1 y2 y3 |µ) = P (y1 |µ)P (y2 |µ)P (y3 |µ).


Consider two guesses µ = 1.0 and µ = 2.5. Which has higher
likelihood?
Finding the µ that maximizes the likelihood is equivalent to moving
the Gaussian until the product P (y1 |µ)P (y2 |µ)P (y3 |µ) is maximized.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 21 / 46


MLE for the univariate Gaussian
Y ∼ N (µ, σ 2 ) and D = {yn : n = 1 : N } be an iid sample of size N .
 
1 1
p(y|θ) = N (y|µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2 2σ
We can estimate the parameters θ = (µ, σ 2 ) using MLE.
We derive the NLL, which is given by
N  1  
2
X 1 2 1 2
NLL(µ, σ ) = − log exp − 2 (yn − µ)
2πσ 2 2σ
n=1
N
1 X N
= (yn − µ)2 + log(2πσ 2 )
2σ 2 2
n=1

The minimum of this function must satisfy the following conditions


∂ ∂
NLL(µ, σ 2 ) = 0, NLL(µ, σ 2 ) = 0
∂µ ∂σ 2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 22 / 46


MLE for the univariate Gaussian
The solution is given by
N
1 X
µ̂mle = yn = ȳ
N
n=1
N  N 
2 1 X
21 X 2
σ̂mle = (yn − µ̂mle ) = yn + µ̂mle − 2yn µ̂mle = s2 − ȳ 2
2
N N
n=1 n=1
N
1 X
s2 , yn2
N
n=1

The quantities ȳ and s2 are called the sufficient statistics of the


data because they are sufficient to compute the MLE.
Sometimes, we might se the estimate for the variance as
N
2 1 X
σ̂ = (yn − µ̂mle )2
N −1
n=1
which is not the MLE, but is a different kind of estimate.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 23 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 24 / 46


MLE for linear regression

We can make the parameters of the Gaussian to be functions of some


input variables

p(y|x; θ) = N (y|fµ (x; θ), fσ (x; θ)2 )

fµ (x; θ) ∈ R predicts mean, and fσ (x; θ) ∈ R+ predicts variance.


It is common to assume that the variance is fixed, and is independent
of the input. This is called homoscedastic regression.
Furthermore, it is common to assume the mean is a linear function of
the input. The resulting model is called linear regression.

p(y|x; θ) = N (y|wT x + b, σ 2 )

where θ = (w, b, σ).

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 25 / 46


MLE for linear regression

Figure: Linear regression using Gaussian output with mean µ(x) = b + wx and
fixed variance σ 2 .

The figure plots the 95% predictive interval [µ(x) − 2σ, µ(x) + 2σ].
This is the uncertainty in the predicted observation y given x, and
capture the variablity in the blue dots.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 26 / 46
MLE for linear regression
Linear regression model

p(y|x; θ) = N (y|wT x, σ 2 )

where θ = (w, σ 2 ), and w = (b, w1 , w2 , . . . , wD ).


Assume that σ 2 is fixed, we estimate the weights w. The NLL is
N  1  
X 1 2 1
NLL(w) = − log exp − 2 (yn − wT xn )2
2πσ 2 2σ
n=1

Dropping the irrelevant additive constants gives the simplified


objective, known as the residual sum of squares or RSS:
N
X N
X
RSS(w) = (yn − wT xn )2 = rn2
n=1 n=1

where rn is the n-th residual error.


(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 27 / 46
MLE for linear regression
Residual sum of squares or RSS:
N
X
RSS(w) = (yn − wT xn )2
n=1

Mean squared error or MSE:


N
1 X
MSE(w) = (yn − wT xn )2
N
n=1

Root mean squared error or RMSE:


v
p
u
u1 X N
RMSE(w) = MSE(w) = t (yn − wT xn )2
N
n=1

We can compute the MLE by minimizing the NLL, RSS, MSE, or


RMSE. All give the same results.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 28 / 46
MLE for linear regression
The RSS can be written in matrix notation as follows
N
X
RSS(w) = (yn − wT xn )2 = kXw − yk22 = (Xw − y)T (Xw − y)
n=1

The gradient is given by

∇w RSS(w) = X T Xw − X T y

Setting the gradient to zero ∇w RSS(w) = 0 and solving gives

X T Xw = X T y

These are known as the normal equations.


The MLE solution ŵmle is called the ordinary least squares (OLS)
solution:
ŵmle = argmin RSS(w) = (X T X)−1 X T y
w

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 29 / 46


MLE for linear regression

ŵmle = argmin RSS(w) = (X T X)−1 X T y


w

The quantity X † = (X T X)−1 X T is the (left) pseudo-inverse of the


(non-square) matrix X.
Is the solution ŵmle unique?
The gradient is ∇w RSS(w) = X T Xw − X T y. Then, the Hessian is

∂2
H(w) = RSS(w) = X T X
∂w2
If X is full rank (i.e., the columns of X are linearly independent),
then H is positive definite, since for any v, we have

v T (X T X)v = (Xv)T (Xv) = kXvk2 > 0

In the full rank case, the RSS(w) has a unique global minimum.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 30 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 31 / 46


Overfitting
MLE will try to pick parameters that minimize loss on the training
set, but this may not result in a model that has low loss on future
data. This is called overfitting.
Ex: We want to predict the probability of heads when tossing a coin.
We toss it N = 3 times and observe 3 heads. The MLE is
N1 3
θ̂mle = = =1
N0 + N1 3+0
If we use this Ber(y|θ̂mle ) to make predictions, we will predict that all
future coin tosses will also be heads!!!
The model has enough parameters to perfectly fit the observed
training data, so it can perfectly match the empirical distribution.
In most cases, the empirical distribution is not the same as the true
distribution. Putting all the probability mass on the observed set of N
examples will not leave over any probability for novel data in the
future. The model may not generalize.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 32 / 46
Example: MLE for Linear Regression

Example 1:
   
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0 
   
1 0 1
X= , y=
1  1
ŵmle = (X T X)−1 X T y =?
Example 2:
   
1 1
Training data: x1 = , y1 = 1 +  x 2 = , y2 = 1.
0 
   
1 0 1+
X= , y=
1  1
ŵmle = (X T X)−1 X T y =?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 33 / 46


Example: MLE for Linear Regression

Example 1:
   
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0 
   
1 0 1
X= , y=
1  1
ŵmle = (X T X)−1 X T y
    
T 1 1 1 0 2 
X X= =
0  1   2
 
1 −1/
(X T X)−1 =
−1/ 2/2
     
1 −1/ 1 1 1 1
ŵmle = 2 = .
−1/ 2/ 0  1 0

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 34 / 46


Example: MLE for Linear Regression

Example 2:
   
1 1
Training data: x1 = , y1 = 1 +  x 2 = , y2 = 1.
0 
   
1 0 1+
X= , y=
1  1
ŵmle = (X T X)−1 X T y
    
T 1 1 1 0 2 
X X= =
0  1   2
 
1 −1/
(X T X)−1 =
−1/ 2/2
     
1 −1/ 1 1 1 +  1+
ŵmle = = .
−1/ 2/2 0  1 −1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 35 / 46


Regularization
The main solution to overfitting is to use regularization.
We add a penalty term to the NLL (or empirical risk):
 N 
1 X
L(θ; λ) = `(yn , f (xn ; θ) + λC(θ)
N
n=1

where λ ≥ 0 is the regularization parameter, and C(θ) is some


form of complexity penalty.
A common complexity penalty is to use C(θ) = − log p(θ), where
p(θ) is the prior for θ.
If ` is the log loss, the regularized objective becomes
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 36 / 46


Maximum a posteriori estimation (MAP)

N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

By setting λ = 1 and rescaling p(θ) appropriately, we can equivalently


minimize the following
X N 
L(θ; λ) = − log p(yn |xn , θ)+log p(θ) = −[log p(D|θ)+log p(θ)]
n=1

Minimizing this is equivalent to maximizing the log posterior:


p(D|θ)p(θ)
θ̂ map = argmax log p(θ|D) = argmax log
θ θ p(D)
= argmax[log p(D|θ) + log p(θ) − const]
θ

This is MAP estimation, or maximum a posteriori estimation.


(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 37 / 46
MAP estimation for Bernoulli distribution
Coin tossing. If we observe just one head, the MLE is θ̂mle = 1.
To avoid this, we can add a penalty to θ to discourage “extreme”
values, such as θ = 0 or θ = 1.
We can use a beta distribution as our prior p(θ) = Beta(θ|a, b),
where a, b > 1 encourages values of θ near to a/(a + b).
If a = b = 1, we get uniform
distribution
If a and b are both less than 1,
we get bimodal distribution.
If a and b are both greater than
1, the distribution is unimodal.
a
mean =
a+b
ab
var = 2
(a + b) (a + b + 1)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 38 / 46


MAP estimate for Bernoulli dsitribution

Using the beta distribution as our prior p(θ) = Beta(θ|a, b), the log
likelihood plus log prior becomes

LL(θ) = log p(D|θ) + log p(θ)


= [N1 log θ + N0 log(1 − θ)] + [(a − 1) log θ + (b − 1) log(1 − θ)]

The MAP estimate is


N1 + a − 1
θ̂map =
N1 + N0 + a + b − 2
If we set a = b = 2, that weakly favor a value of θ near 0.5, the
estimate becomes
N1 + 1
θ̂map =
N1 + N0 + 2
This is called add-one smoothing to avoid the zero count problem.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 39 / 46


Black swan paradox

The zero-count problem, and


overfitting, is analogous to the black
swan paradox.
It is used to illustrate the problem of
induction: how to draw general
conclusions about the future from
specific observations from the past.
The solution to the paradox is to admit
that induction is in general impossible.
The best we can do is to make plausible
guesses by combining the empirical data
with prior knowledge.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 40 / 46


Weight decay

Polynomial regression with too much degree of freedom can result in


overfitting. One solution is to reduce the degree of the polynomial.
A more general solution is to penalize the magnitude of the weights
(regression coefficients).
We use a zero-mean Gaussian prior p(w). The MAP estimate is

ŵmap = argmin NLL(w) + λkwk22


w

where kwk22 = D 2
P
d=1 wd . We penalize the magnitude of weight
vectors w, rather than the bias term b.
The equation is called `2 regularization or weight decay.
The larger the value of λ, the more the parameters are penalized for
being large (i.e., deviating from the zero-mean prior), and thus the
less flexible the model.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 41 / 46


Ridge regression

In the case of linear regression, the weight decay penalization scheme


is called ridge regression.
Consider polynomial regression, where the predictor has the form
D
X
f (x; w) = wd xd = wT [1, x, x2 , . . . , xD ]
d=0

Suppose we use a high degree polynomial, say D = 14, even though


we have a small dataset with just N = 21 examples.
MLE for the parameters will enable the model to fit the data very
well, but the resulting function is very “wiggly”, thus resulting in
overfitting.
Increasing λ can reduce overfitting.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 42 / 46


Ridge regression

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 43 / 46


Ridge regression
MAP estimation with a zero-mean Gaussian prior
p(w) = N (w|0, λ−1 I).

1 1
ŵmap = argmin 2
(y − Xw)T (y − Xw) + 2 wT w
w 2σ 2τ
2
= argmin RSS(w) + λkwk2
w

σ2
where λ = τ2
is proportional to the strength of the prior, and
v
uD
uX √
kwk2 = t |wd |2 = wT w
d=1

is the `2 norm of the vector w.


We do not penalize the offset w0 , since that only affects the global
mean of the output, and does not contribute to the overfitting.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 44 / 46
Ridge Regression

The MAP estimate corresponds to minimizing the penalized objective:

J(w) = (y − Xw)T (y − Xw) + λkwk22


σ2
where λ = τ2
is the strength of the regularizer.
The derivative is given by

∇w J(w) = 2(X T Xw − X T y + λw)

Therefore,

ŵmap = (X T X + λI D )−1 X T y
X X
=( xn xT -1
n + λI D ) ( yn x n )
n n

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 45 / 46


Example: MAP for Linear Regression

Maximum likelihood
 estimation.
  Let  = 0.1  
1 −1/ 1 1 1 1
Ex. 1: ŵmle = 2 = .
−1/ 2/ 0  1 0
       
1 −1/ 1 1 1 +  1+ 1.1
Ex. 2: ŵmle = = = .
−1/ 2/2 0  1 −1 −1
Maximum a posteriori estimation. Let λ = 0.05
   
T 2+λ  2.05 0.1
(X X + λI D ) = =
 2 + λ 0.1 0.06
 
0.531 −0.885
(X T X + λI D )−1 =
−0.885 18.1416
     
0.531 −0.885 1 1 1 0.9735
Ex. 1: ŵmap = =
−0.885 18.1416 0  1 0.0442
     
0.531 −0.885 1 1 1 +  1.0265
Ex. 2: ŵmap = =
−0.885 18.1416 0  1 −0.0442
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 46 / 46

You might also like