Maximum Likelihood Estimation
Maximum Likelihood Estimation
Regularizations
The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1
1
https://probml.github.io/pml-book/book1.html
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 2 / 46
Table of Contents
1 Introduction
3 Examples
5 Regularization
1 Introduction
3 Examples
5 Regularization
θ̂ = argmin L(θ)
θ
1 Introduction
3 Examples
5 Regularization
1 Introduction
3 Examples
5 Regularization
Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)
0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 11 / 46
Example
θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 13 / 46
MLE for the Bernoulli distribution
where N1 = N
P
PN n=1 I(yn = 1) is the number of heads, and
N0 = n=1 I(yn = 0) is the number of tails.
N = N0 + N1 is the sample size.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 14 / 46
MLE for the Bernoulli distribution
1.0
0.8
x 2 /2
0.6
fX (x)
e −x 2 /2
1 e −x 2 /2
0.4 p
2π
0.2
0.0
3 2 1 0 1 2 3
x
Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2
2π
1.0
0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π
0.2
0.0
6 4 2 0 2 4 6 8 10
x
1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π
1 1 2
=√ exp − 2 (y − µ)
2πσ 2 2σ
E [X] = µ Var (X) = σ 2
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 19 / 46
General Normal (Gaussian) Random Variable N (µ, σ 2 )
1.0
0.8
σ = 0.5
0.6 σ=1
fX (x)
σ=2
0.4 σ=3
0.2
0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aE[X] + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )
yi ∼ N (µ, 1)
1 Introduction
3 Examples
5 Regularization
p(y|x; θ) = N (y|wT x + b, σ 2 )
Figure: Linear regression using Gaussian output with mean µ(x) = b + wx and
fixed variance σ 2 .
The figure plots the 95% predictive interval [µ(x) − 2σ, µ(x) + 2σ].
This is the uncertainty in the predicted observation y given x, and
capture the variablity in the blue dots.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 26 / 46
MLE for linear regression
Linear regression model
p(y|x; θ) = N (y|wT x, σ 2 )
∇w RSS(w) = X T Xw − X T y
X T Xw = X T y
∂2
H(w) = RSS(w) = X T X
∂w2
If X is full rank (i.e., the columns of X are linearly independent),
then H is positive definite, since for any v, we have
In the full rank case, the RSS(w) has a unique global minimum.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 30 / 46
Table of Contents
1 Introduction
3 Examples
5 Regularization
Example 1:
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0
1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?
Example 2:
1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0
1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?
Example 1:
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0
1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y
T 1 1 1 0 2
X X= =
0 1 2
1 −1/
(X T X)−1 =
−1/ 2/2
1 −1/ 1 1 1 1
ŵmle = 2 = .
−1/ 2/ 0 1 0
Example 2:
1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0
1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y
T 1 1 1 0 2
X X= =
0 1 2
1 −1/
(X T X)−1 =
−1/ 2/2
1 −1/ 1 1 1 + 1+
ŵmle = = .
−1/ 2/2 0 1 −1
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1
Using the beta distribution as our prior p(θ) = Beta(θ|a, b), the log
likelihood plus log prior becomes
where kwk22 = D 2
P
d=1 wd . We penalize the magnitude of weight
vectors w, rather than the bias term b.
The equation is called `2 regularization or weight decay.
The larger the value of λ, the more the parameters are penalized for
being large (i.e., deviating from the zero-mean prior), and thus the
less flexible the model.
1 1
ŵmap = argmin 2
(y − Xw)T (y − Xw) + 2 wT w
w 2σ 2τ
2
= argmin RSS(w) + λkwk2
w
σ2
where λ = τ2
is proportional to the strength of the prior, and
v
uD
uX √
kwk2 = t |wd |2 = wT w
d=1
Therefore,
ŵmap = (X T X + λI D )−1 X T y
X X
=( xn xT -1
n + λI D ) ( yn x n )
n n
Maximum likelihood
estimation.
Let = 0.1
1 −1/ 1 1 1 1
Ex. 1: ŵmle = 2 = .
−1/ 2/ 0 1 0
1 −1/ 1 1 1 + 1+ 1.1
Ex. 2: ŵmle = = = .
−1/ 2/2 0 1 −1 −1
Maximum a posteriori estimation. Let λ = 0.05
T 2+λ 2.05 0.1
(X X + λI D ) = =
2 + λ 0.1 0.06
0.531 −0.885
(X T X + λI D )−1 =
−0.885 18.1416
0.531 −0.885 1 1 1 0.9735
Ex. 1: ŵmap = =
−0.885 18.1416 0 1 0.0442
0.531 −0.885 1 1 1 + 1.0265
Ex. 2: ŵmap = =
−0.885 18.1416 0 1 −0.0442
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 46 / 46