0% found this document useful (0 votes)

89 views

Maximum Likelihood Estimation

This document discusses maximum likelihood estimation (MLE) and regularization. It begins by introducing MLE, which estimates parameters by assigning the highest probability to the training data. MLE picks the parameters that maximize the likelihood function. Regularization is then discussed as a technique to avoid overfitting by adding constraints to the MLE. Examples of MLE for Bernoulli and linear regression models are provided.

Uploaded by

Toàn Phạm Đức

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Maximum Likelihood Estimation

Uploaded by

Toàn Phạm Đức

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Maximum Likelihood Estimation (MLE)

Regularizations

Faculty of Computer Science

University of Information Technology (UIT)
Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for Computer Science, Fall 2022

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 1 / 46

References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 2 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 3 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 4 / 46

Introduction

The process of estimating θ from D is called model fitting, or

training, is at the heart of machine learning.
There are many methods for estimating θ, and they involve an
optimization problem of the form

θ̂ = argmin L(θ)
θ

where L(θ) is some kind of loss function or objective function.

The process of quantifying uncertainty about an unknown quantity
estimated from a finite sample of data is called inference.
In deep learning, the term “inference” refers to “prediction”, namely
computing
p(y|x, θ̂)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 5 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 6 / 46

Maximum Likelihood Estimation
The most common approach to parameter estimation is to pick the
parameters that assign the highest probability to the training data.
This is called maximum likelihood estimation or MLE.
θ̂ mle = argmax p(D|θ)
θ
We usually assume the training examples are “independent and
identically distributed”, and are sampled from the same distribution
(i.e., the iid assumption). The conditional likelihood becomes
N
Y
p(D|θ) = p(y1 , y2 , . . . , yN |x1 , x2 , . . . , xN , θ) = p(yn |xn , θ)
n=1
We usually work with the log likelihood, which decomposes into a
sum of terms, one per example.
N
Y N
X
LL(θ) = log p(D|θ) = log p(yn |xn , θ) = log p(yn |xn , θ)
n=1 n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 7 / 46

Maximum Likelihood Estimation
The MLE is given by
N
X
θ̂ mle = argmax log p(yn |xn , θ)
θ n=1

Because most optimization algorithms are designed to minimize cost

functions, we redefine the objective function to be the conditional
negative log likelihood or NLL:
N
X
NLL(θ) = − log p(D|θ) = − log p(yn |xn , θ)
n=1

Minimizing this will give the MLE.

N
X
θ̂ mle = argmin − log p(yn |xn , θ)
θ n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 8 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 9 / 46

Bernoulli Random Variables

A Bernoulli r.v. X takes two possible values, usually 0 and 1,

modeling random experiments that have two possible outcomes (e.g.,
“success” and “failure”).
e.g., tossing a coin. The outcome is either Head or Tail.
e.g., taking an exam. The result is either Pass or Fail.
e.g., classifying images. An image is either Cat or Non-cat.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 10 / 46

Bernoulli Random Variables

Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)

0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 11 / 46
Example

A bag contains 3 balls, each ball is either red or blue.

The number of blue balls θ can be 0, 1, 2, 3.
Choose 4 balls randomly with replacement.
Random variables X1 , X2 , X3 , X4 are defined as
(
1, if the i-th chosen ball is blue
Xi =
0, if the i-th chosen ball is red

After doing the experiment, the following values for Xi ’s are

observed: x1 = 1, x2 = 0, x3 = 1, x4 = 1.
Note that Xi ’s are i.i.d. (independent and identically distributed) and
Xi ∼ Bernoulli( 3θ ). For which value of θ is the probability of the
observed sample is the largest?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 12 / 46

Example
(
θ
PXi (x) = 3, for x = 1
θ
1− 3, for x = 0
Xi ’s are independent, the joint PMF of X1 , X2 , X3 , X4 can be written
PX1 X2 X3 X4 (x1 , x2 , x3 , x4 ) = PX1 (x1 )PX2 (x2 )PX3 (x3 )PX4 (x4 )
3
θ θ θ θ θ θ
PX1 X2 X3 X4 (1, 0, 1, 1) = · 1 − · · = 1−
3 3 3 3 3 3

θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 13 / 46
MLE for the Bernoulli distribution

Suppose Y is a random variable representing a coin toss.

The event Y = 1 corresponds to heads, Y = 0 corresponds to tails.
The probability distribution for this rv is the Bernoulli. The NLL for
the Bernoulli distribution is
N
Y N
Y
NLL(θ) = − log p(yn |θ) = − log θI(yn =1) (1 − θ)I(yn =0)
n=1 n=1
N
X
=− I(yn = 1) log θ + I(yn = 0) log(1 − θ)
n=1
= −[N1 log θ + N0 log(1 − θ)]

where N1 = N
P
PN n=1 I(yn = 1) is the number of heads, and
N0 = n=1 I(yn = 0) is the number of tails.
N = N0 + N1 is the sample size.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 14 / 46
MLE for the Bernoulli distribution

NLL(θ) = −[N1 log θ + N0 log(1 − θ)]

The derivative of the NLL is

d −N1 N0
NLL(θ) = +
dθ θ 1−θ
d
The MLE can be found by solving dθ NLL(θ) = 0.
The MLE is given by
N1
θ̂mle =
N0 + N1
which is the empirical fraction of heads.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 15 / 46

MLE for the categorical distribution

Suppose we roll a K-sided dice N times.

Let Yn ∈ {1, . . . , K} be the n-th outcome, where Yn ∼ Cat(θ).
We want to estimate θ from the dataset D{yn : n = 1 : N }.
The NLL is given by
X
NLL(θ) = − Nk log θk
k

where Nk is the number of times the event Y = k is observed.

The compute the MLE, we have to minimize the NLL subject to the
constraint that
K
X
θk = 1
k=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 16 / 46

MLE for the categorical distribution
We use the method of Lagrange multipliers. The Lagrangian is as
X X
L(θ, λ) = − Nk log θk − λ 1 − θk
k k
Taking derivatives with respect to λ yields the original constraint
∂L X
=1− θk = 0
∂λ
k
Taking derivatives with respect to θk yields
∂L Nk
=− + λ = 0 −→ Nk = λθk
∂θk θk
We can solve for λ using the sum-to-one constraint
X X
Nk = N = λ θk = λ
k k
Nk Nk
Thus the MLE is given by θ̂k = λ = N , the empirical fraction of
times event k occurs.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 17 / 46
Standard Normal (Gaussian) Random Variable N (0, 1)

1.0

0.8
x 2 /2
0.6
fX (x)

e −x 2 /2
1 e −x 2 /2
0.4 p
2π

0.2

0.0
3 2 1 0 1 2 3
x

Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2
2π

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 18 / 46

General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π

0.2

0.0
6 4 2 0 2 4 6 8 10
x

1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π

1 1 2
=√ exp − 2 (y − µ)
2πσ 2 2σ
E [X] = µ Var (X) = σ 2
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 19 / 46
General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
σ = 0.5
0.6 σ=1
fX (x)

σ=2
0.4 σ=3
0.2

0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aE[X] + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 20 / 46

Example

We have N = 3 data points y1 = 1, y2 = 0.5, y3 = 1.5 which are

independent and Gaussian with unknown mean µ and variance 1:

yi ∼ N (µ, 1)

Likelihood P (y1 y2 y3 |µ) = P (y1 |µ)P (y2 |µ)P (y3 |µ).

Consider two guesses µ = 1.0 and µ = 2.5. Which has higher
likelihood?
Finding the µ that maximizes the likelihood is equivalent to moving
the Gaussian until the product P (y1 |µ)P (y2 |µ)P (y3 |µ) is maximized.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 21 / 46

MLE for the univariate Gaussian
Y ∼ N (µ, σ 2 ) and D = {yn : n = 1 : N } be an iid sample of size N .

1 1
p(y|θ) = N (y|µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2 2σ
We can estimate the parameters θ = (µ, σ 2 ) using MLE.
We derive the NLL, which is given by
N 1
2
X 1 2 1 2
NLL(µ, σ ) = − log exp − 2 (yn − µ)
2πσ 2 2σ
n=1
N
1 X N
= (yn − µ)2 + log(2πσ 2 )
2σ 2 2
n=1

The minimum of this function must satisfy the following conditions

∂ ∂
NLL(µ, σ 2 ) = 0, NLL(µ, σ 2 ) = 0
∂µ ∂σ 2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 22 / 46

MLE for the univariate Gaussian
The solution is given by
N
1 X
µ̂mle = yn = ȳ
N
n=1
N N
2 1 X
21 X 2
σ̂mle = (yn − µ̂mle ) = yn + µ̂mle − 2yn µ̂mle = s2 − ȳ 2
2
N N
n=1 n=1
N
1 X
s2 , yn2
N
n=1

The quantities ȳ and s2 are called the sufficient statistics of the

data because they are sufficient to compute the MLE.
Sometimes, we might se the estimate for the variance as
N
2 1 X
σ̂ = (yn − µ̂mle )2
N −1
n=1
which is not the MLE, but is a different kind of estimate.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 23 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 24 / 46

MLE for linear regression

We can make the parameters of the Gaussian to be functions of some

input variables

p(y|x; θ) = N (y|fµ (x; θ), fσ (x; θ)2 )

fµ (x; θ) ∈ R predicts mean, and fσ (x; θ) ∈ R+ predicts variance.

It is common to assume that the variance is fixed, and is independent
of the input. This is called homoscedastic regression.
Furthermore, it is common to assume the mean is a linear function of
the input. The resulting model is called linear regression.

p(y|x; θ) = N (y|wT x + b, σ 2 )

where θ = (w, b, σ).

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 25 / 46

MLE for linear regression

Figure: Linear regression using Gaussian output with mean µ(x) = b + wx and
fixed variance σ 2 .

The figure plots the 95% predictive interval [µ(x) − 2σ, µ(x) + 2σ].
This is the uncertainty in the predicted observation y given x, and
capture the variablity in the blue dots.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 26 / 46
MLE for linear regression
Linear regression model

p(y|x; θ) = N (y|wT x, σ 2 )

where θ = (w, σ 2 ), and w = (b, w1 , w2 , . . . , wD ).

Assume that σ 2 is fixed, we estimate the weights w. The NLL is
N 1
X 1 2 1
NLL(w) = − log exp − 2 (yn − wT xn )2
2πσ 2 2σ
n=1

Dropping the irrelevant additive constants gives the simplified

objective, known as the residual sum of squares or RSS:
N
X N
X
RSS(w) = (yn − wT xn )2 = rn2
n=1 n=1

where rn is the n-th residual error.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 27 / 46
MLE for linear regression
Residual sum of squares or RSS:
N
X
RSS(w) = (yn − wT xn )2
n=1

Mean squared error or MSE:

N
1 X
MSE(w) = (yn − wT xn )2
N
n=1

Root mean squared error or RMSE:

v
p
u
u1 X N
RMSE(w) = MSE(w) = t (yn − wT xn )2
N
n=1

We can compute the MLE by minimizing the NLL, RSS, MSE, or

RMSE. All give the same results.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 28 / 46
MLE for linear regression
The RSS can be written in matrix notation as follows
N
X
RSS(w) = (yn − wT xn )2 = kXw − yk22 = (Xw − y)T (Xw − y)
n=1

The gradient is given by

∇w RSS(w) = X T Xw − X T y

Setting the gradient to zero ∇w RSS(w) = 0 and solving gives

X T Xw = X T y

These are known as the normal equations.

The MLE solution ŵmle is called the ordinary least squares (OLS)
solution:
ŵmle = argmin RSS(w) = (X T X)−1 X T y
w

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 29 / 46

MLE for linear regression

ŵmle = argmin RSS(w) = (X T X)−1 X T y

The quantity X † = (X T X)−1 X T is the (left) pseudo-inverse of the

(non-square) matrix X.
Is the solution ŵmle unique?
The gradient is ∇w RSS(w) = X T Xw − X T y. Then, the Hessian is

∂2
H(w) = RSS(w) = X T X
∂w2
If X is full rank (i.e., the columns of X are linearly independent),
then H is positive definite, since for any v, we have

v T (X T X)v = (Xv)T (Xv) = kXvk2 > 0

In the full rank case, the RSS(w) has a unique global minimum.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 30 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 31 / 46

Overfitting
MLE will try to pick parameters that minimize loss on the training
set, but this may not result in a model that has low loss on future
data. This is called overfitting.
Ex: We want to predict the probability of heads when tossing a coin.
We toss it N = 3 times and observe 3 heads. The MLE is
N1 3
θ̂mle = = =1
N0 + N1 3+0
If we use this Ber(y|θ̂mle ) to make predictions, we will predict that all
future coin tosses will also be heads!!!
The model has enough parameters to perfectly fit the observed
training data, so it can perfectly match the empirical distribution.
In most cases, the empirical distribution is not the same as the true
distribution. Putting all the probability mass on the observed set of N
examples will not leave over any probability for novel data in the
future. The model may not generalize.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 32 / 46
Example: MLE for Linear Regression

Example 1:

1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0

1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?
Example 2:

1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0

1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 33 / 46

Example: MLE for Linear Regression

Example 1:

1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0

1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y

T 1 1 1 0 2
X X= =
0 1 2

1 −1/
(X T X)−1 =
−1/ 2/2

1 −1/ 1 1 1 1
ŵmle = 2 = .
−1/ 2/ 0 1 0

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 34 / 46

Example: MLE for Linear Regression

Example 2:

1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0

1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y

T 1 1 1 0 2
X X= =
0 1 2

1 −1/
(X T X)−1 =
−1/ 2/2

1 −1/ 1 1 1 + 1+
ŵmle = = .
−1/ 2/2 0 1 −1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 35 / 46

Regularization
The main solution to overfitting is to use regularization.
We add a penalty term to the NLL (or empirical risk):
N
1 X
L(θ; λ) = `(yn , f (xn ; θ) + λC(θ)
N
n=1

where λ ≥ 0 is the regularization parameter, and C(θ) is some

form of complexity penalty.
A common complexity penalty is to use C(θ) = − log p(θ), where
p(θ) is the prior for θ.
If ` is the log loss, the regularized objective becomes
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 36 / 46

Maximum a posteriori estimation (MAP)

N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

By setting λ = 1 and rescaling p(θ) appropriately, we can equivalently

minimize the following
X N
L(θ; λ) = − log p(yn |xn , θ)+log p(θ) = −[log p(D|θ)+log p(θ)]
n=1

Minimizing this is equivalent to maximizing the log posterior:

p(D|θ)p(θ)
θ̂ map = argmax log p(θ|D) = argmax log
θ θ p(D)
= argmax[log p(D|θ) + log p(θ) − const]
θ

This is MAP estimation, or maximum a posteriori estimation.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 37 / 46
MAP estimation for Bernoulli distribution
Coin tossing. If we observe just one head, the MLE is θ̂mle = 1.
To avoid this, we can add a penalty to θ to discourage “extreme”
values, such as θ = 0 or θ = 1.
We can use a beta distribution as our prior p(θ) = Beta(θ|a, b),
where a, b > 1 encourages values of θ near to a/(a + b).
If a = b = 1, we get uniform
distribution
If a and b are both less than 1,
we get bimodal distribution.
If a and b are both greater than
1, the distribution is unimodal.
a
mean =
a+b
ab
var = 2
(a + b) (a + b + 1)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 38 / 46

MAP estimate for Bernoulli dsitribution

Using the beta distribution as our prior p(θ) = Beta(θ|a, b), the log
likelihood plus log prior becomes

LL(θ) = log p(D|θ) + log p(θ)

= [N1 log θ + N0 log(1 − θ)] + [(a − 1) log θ + (b − 1) log(1 − θ)]

The MAP estimate is

N1 + a − 1
θ̂map =
N1 + N0 + a + b − 2
If we set a = b = 2, that weakly favor a value of θ near 0.5, the
estimate becomes
N1 + 1
θ̂map =
N1 + N0 + 2
This is called add-one smoothing to avoid the zero count problem.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 39 / 46

Black swan paradox

The zero-count problem, and

overfitting, is analogous to the black
swan paradox.
It is used to illustrate the problem of
induction: how to draw general
conclusions about the future from
specific observations from the past.
The solution to the paradox is to admit
that induction is in general impossible.
The best we can do is to make plausible
guesses by combining the empirical data
with prior knowledge.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 40 / 46

Weight decay

Polynomial regression with too much degree of freedom can result in

overfitting. One solution is to reduce the degree of the polynomial.
A more general solution is to penalize the magnitude of the weights
(regression coefficients).
We use a zero-mean Gaussian prior p(w). The MAP estimate is

ŵmap = argmin NLL(w) + λkwk22

where kwk22 = D 2
P
d=1 wd . We penalize the magnitude of weight
vectors w, rather than the bias term b.
The equation is called `2 regularization or weight decay.
The larger the value of λ, the more the parameters are penalized for
being large (i.e., deviating from the zero-mean prior), and thus the
less flexible the model.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 41 / 46

Ridge regression

In the case of linear regression, the weight decay penalization scheme

is called ridge regression.
Consider polynomial regression, where the predictor has the form
D
X
f (x; w) = wd xd = wT [1, x, x2 , . . . , xD ]
d=0

Suppose we use a high degree polynomial, say D = 14, even though

we have a small dataset with just N = 21 examples.
MLE for the parameters will enable the model to fit the data very
well, but the resulting function is very “wiggly”, thus resulting in
overfitting.
Increasing λ can reduce overfitting.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 42 / 46

Ridge regression

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 43 / 46

Ridge regression
MAP estimation with a zero-mean Gaussian prior
p(w) = N (w|0, λ−1 I).

1 1
ŵmap = argmin 2
(y − Xw)T (y − Xw) + 2 wT w
w 2σ 2τ
2
= argmin RSS(w) + λkwk2
w

σ2
where λ = τ2
is proportional to the strength of the prior, and
v
uD
uX √
kwk2 = t |wd |2 = wT w
d=1

is the `2 norm of the vector w.

We do not penalize the offset w0 , since that only affects the global
mean of the output, and does not contribute to the overfitting.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 44 / 46
Ridge Regression

The MAP estimate corresponds to minimizing the penalized objective:

J(w) = (y − Xw)T (y − Xw) + λkwk22

σ2
where λ = τ2
is the strength of the regularizer.
The derivative is given by

∇w J(w) = 2(X T Xw − X T y + λw)

Therefore,

ŵmap = (X T X + λI D )−1 X T y
X X
=( xn xT -1
n + λI D ) ( yn x n )
n n

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 45 / 46

Example: MAP for Linear Regression

Maximum likelihood
estimation.
Let = 0.1
1 −1/ 1 1 1 1
Ex. 1: ŵmle = 2 = .
−1/ 2/ 0 1 0

1 −1/ 1 1 1 + 1+ 1.1
Ex. 2: ŵmle = = = .
−1/ 2/2 0 1 −1 −1
Maximum a posteriori estimation. Let λ = 0.05

T 2+λ 2.05 0.1
(X X + λI D ) = =
2 + λ 0.1 0.06

0.531 −0.885
(X T X + λI D )−1 =
−0.885 18.1416

0.531 −0.885 1 1 1 0.9735
Ex. 1: ŵmap = =
−0.885 18.1416 0 1 0.0442

0.531 −0.885 1 1 1 + 1.0265
Ex. 2: ŵmap = =
−0.885 18.1416 0 1 −0.0442
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 46 / 46

Module 2 Quiz
33% (3)
Module 2 Quiz
8 pages
Hennig 2021 Probabilistic Machine Learning
No ratings yet
Hennig 2021 Probabilistic Machine Learning
189 pages
6 437-Pset1
No ratings yet
6 437-Pset1
8 pages
Skew Gaussian Process For Nonlinear Regression
No ratings yet
Skew Gaussian Process For Nonlinear Regression
26 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
Experiment 1: Digital Signal Processing Lab
No ratings yet
Experiment 1: Digital Signal Processing Lab
17 pages
Unbiasedness: Sudheesh Kumar Kattumannil University of Hyderabad
No ratings yet
Unbiasedness: Sudheesh Kumar Kattumannil University of Hyderabad
10 pages
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
No ratings yet
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
18 pages
CSE 512 Machine Learning: Homework I: Mari Wahl, Marina.w4hl at Gmail
No ratings yet
CSE 512 Machine Learning: Homework I: Mari Wahl, Marina.w4hl at Gmail
36 pages
Awgn Channel
No ratings yet
Awgn Channel
19 pages
MLE Dan Bayesian Estimation From Walpole Book
No ratings yet
MLE Dan Bayesian Estimation From Walpole Book
13 pages
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
No ratings yet
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
17 pages
DSP Manual Ece
No ratings yet
DSP Manual Ece
57 pages
Elt-43007 Digital Communications: 1 System Model and Generation of Bit Sequence and Qam Symbols
No ratings yet
Elt-43007 Digital Communications: 1 System Model and Generation of Bit Sequence and Qam Symbols
8 pages
Lab 02A-Signals and Wave Form Generation: Objectives
No ratings yet
Lab 02A-Signals and Wave Form Generation: Objectives
8 pages
Channel Estimation Modeling
No ratings yet
Channel Estimation Modeling
15 pages
Estimation Theory
100% (1)
Estimation Theory
8 pages
Digital Transmission
No ratings yet
Digital Transmission
5 pages
Neural Networks Backtracking
No ratings yet
Neural Networks Backtracking
14 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
ECE 531: Detection and Estimation Theory: Natasha Devroye Devroye@ece - Uic.edu Spring 2011
No ratings yet
ECE 531: Detection and Estimation Theory: Natasha Devroye Devroye@ece - Uic.edu Spring 2011
15 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Sheet 5 Parameter Estimation
No ratings yet
Sheet 5 Parameter Estimation
2 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
MET 1113 - EE - Lab Simulation 2
No ratings yet
MET 1113 - EE - Lab Simulation 2
12 pages
ESL Lab Manual1
No ratings yet
ESL Lab Manual1
69 pages
Receive Antenna Subset Selection For MIMO System - Matlab Simulation
No ratings yet
Receive Antenna Subset Selection For MIMO System - Matlab Simulation
7 pages
None
No ratings yet
None
23 pages
Maximum Likelihood Decoding
No ratings yet
Maximum Likelihood Decoding
5 pages
Labsheet-1: Question No.1:Basic Signal Generation and Plotting
No ratings yet
Labsheet-1: Question No.1:Basic Signal Generation and Plotting
15 pages
Lab Manual
No ratings yet
Lab Manual
88 pages
Matlab For Microeconometrics: Numerical Optimization: Nick Kuminoff Virginia Tech: Fall 2008
No ratings yet
Matlab For Microeconometrics: Numerical Optimization: Nick Kuminoff Virginia Tech: Fall 2008
16 pages
Adaptive Filtering Ranjeet
No ratings yet
Adaptive Filtering Ranjeet
7 pages
DSP Lab Manual GEC Dahod 1
No ratings yet
DSP Lab Manual GEC Dahod 1
40 pages
HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing
No ratings yet
HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing
25 pages
8-Channel Models Estimation
No ratings yet
8-Channel Models Estimation
11 pages
Rician Fading Channel Simulations: 6.1 Performance Objectives
No ratings yet
Rician Fading Channel Simulations: 6.1 Performance Objectives
4 pages
Adaptive Filters
No ratings yet
Adaptive Filters
14 pages
Rayleigh Fading Channel Modeling
No ratings yet
Rayleigh Fading Channel Modeling
4 pages
5G NR Jamming Spoofing and Sniffing Threat Assessment and Mitigation
100% (1)
5G NR Jamming Spoofing and Sniffing Threat Assessment and Mitigation
6 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
Statistical Signal Processing
No ratings yet
Statistical Signal Processing
135 pages
Signal-To-Noise Ratio (SNR), E, E: N Ax y A X N N
No ratings yet
Signal-To-Noise Ratio (SNR), E, E: N Ax y A X N N
3 pages
Overview of As Convergence
No ratings yet
Overview of As Convergence
17 pages
Introduction To: Fading Channels, Part 2 Fading Channels, Part 2
No ratings yet
Introduction To: Fading Channels, Part 2 Fading Channels, Part 2
39 pages
Estimation Theory
No ratings yet
Estimation Theory
40 pages
Lab Manual 15
No ratings yet
Lab Manual 15
9 pages
1.transmit Diversity vs. Receive Diversity: Matlab Code With Output
No ratings yet
1.transmit Diversity vs. Receive Diversity: Matlab Code With Output
11 pages
Demystifying Deep Convolutional Neural Networks - Adam Harley (2014) CNN PDF
No ratings yet
Demystifying Deep Convolutional Neural Networks - Adam Harley (2014) CNN PDF
27 pages
Signals in MatLab PDF
No ratings yet
Signals in MatLab PDF
18 pages
Performance Analysis of MIMO Receiver Techniques
No ratings yet
Performance Analysis of MIMO Receiver Techniques
5 pages
Mutual Information
No ratings yet
Mutual Information
8 pages
Simulation Reference Library For
No ratings yet
Simulation Reference Library For
12 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
L08-MaximumLikelihoodEstimation
No ratings yet
L08-MaximumLikelihoodEstimation
5 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
MLE_Assingnment (1)
No ratings yet
MLE_Assingnment (1)
7 pages
Machine Learning
100% (1)
Machine Learning
185 pages
Rudyregularization PDF
No ratings yet
Rudyregularization PDF
56 pages
Machine Learning Question Bank-Unit 3
No ratings yet
Machine Learning Question Bank-Unit 3
6 pages
Journal of Magnetic Resonance: Iain J. Day
No ratings yet
Journal of Magnetic Resonance: Iain J. Day
8 pages
HW1 Solutions
No ratings yet
HW1 Solutions
3 pages
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
33 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
Numerical Methods Using Trust-Region Approach For Solving Nonlinear Ill-Posed Problems
No ratings yet
Numerical Methods Using Trust-Region Approach For Solving Nonlinear Ill-Posed Problems
11 pages
Bstract
No ratings yet
Bstract
12 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
Ridge Regression
No ratings yet
Ridge Regression
6 pages
Application of Tikhonov Regularization Technique To Investigation of The Electromagnetic Field Scattered by Inclusion in Multilayered Media
No ratings yet
Application of Tikhonov Regularization Technique To Investigation of The Electromagnetic Field Scattered by Inclusion in Multilayered Media
3 pages
NeurIPS 2021 The Inductive Bias of Quantum Kernels Paper
No ratings yet
NeurIPS 2021 The Inductive Bias of Quantum Kernels Paper
13 pages
HW 4
No ratings yet
HW 4
7 pages
RTV 4 Manual - Regu Tools
No ratings yet
RTV 4 Manual - Regu Tools
128 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
Support Vector Regression
No ratings yet
Support Vector Regression
23 pages
Ps and Solution CS229
No ratings yet
Ps and Solution CS229
55 pages
Linear Regression: Volker Tresp 2017
No ratings yet
Linear Regression: Volker Tresp 2017
25 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Speeding Up Kernel Methods, and Intro To Unsupervised Learning
No ratings yet
Speeding Up Kernel Methods, and Intro To Unsupervised Learning
103 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
Ridge Regression and Other Kernels For Genomic Selection With R Package Rrblup
No ratings yet
Ridge Regression and Other Kernels For Genomic Selection With R Package Rrblup
6 pages
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
No ratings yet
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
10 pages
HW 1
No ratings yet
HW 1
8 pages
MULTICOLLINEARITY
No ratings yet
MULTICOLLINEARITY
12 pages
A Recurrent Neural Network For Modelling Dynamical Systems
No ratings yet
A Recurrent Neural Network For Modelling Dynamical Systems
18 pages
Machine Learning Slides
No ratings yet
Machine Learning Slides
281 pages