0% found this document useful (0 votes)
28 views

Expectation Maximization

The document discusses the Expectation-Maximization (EM) algorithm, an iterative method to find maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables. It explains the E-step and M-step of the algorithm and provides an example of using EM for Gaussian mixture modeling and clustering. The EM algorithm is guaranteed to increase the likelihood at each iteration and converge, though it may find a local rather than global maximum.

Uploaded by

Khyati Chhabra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Expectation Maximization

The document discusses the Expectation-Maximization (EM) algorithm, an iterative method to find maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables. It explains the E-step and M-step of the algorithm and provides an example of using EM for Gaussian mixture modeling and clustering. The EM algorithm is guaranteed to increase the likelihood at each iteration and converge, though it may find a local rather than global maximum.

Uploaded by

Khyati Chhabra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

EM

The Expectation-Maximization
Algorithm

27
Last lecture:
How to estimate μ given data

For this problem, we got a nice, closed


form, solution, allowing calculation of the
μ, σ that maximize the likelihood of the
observed data.
µ ± !1
We’re not always so lucky...

X X XX X XXX X
Observed Data
-3 -2 -1
μ
0 1 2

28
More Complex Example
This?

Or this?

(A modeling decision, not a math problem...,


but if later, what math?)
29
A Real Example:
CpG content of human gene promoters

“A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two


distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417
30
©2006 by National Academy of Sciences
Gaussian Mixture Models / Model-based Clustering

Parameters θ
means µ1 µ2
variances σ12 σ22
mixing parameters τ1 τ2 = 1 − τ1
P.D.F. f (x|µ1 , σ12 ) f (x|µ2 , σ22 )
Likelihood
L(x1 , x2 , . . . , xn |µ1 , µ2 , σ12 , σ22 , τ1 , τ2 ) No
closed-
�n �2 form
= i=1 j=1 τ j f (xi |µj , σ j)
2
max
31
Likelihood Surface

0.15

0.1 20

0.05
10

0
-20

μ2
0

-10

0 -10

μ1 10

20 -20 32
0.15

0.1 20

0.05
10

0
-20

μ2
0

xi =
-10

−10.2, −10, −9.8 0 -10 σ 2 = 1.0


−0.2, 0, 0.2 μ1 10 τ1 = .5
11.8, 12, 12.2 20 -20
τ2 = .5 33
(-5,12)

(-10,6)
(12,-5)
0.15

0.1 20

0.05
(6,-10) 10

0
-20

μ2
0

xi =
-10

−10.2, −10, −9.8 0 -10 σ 2 = 1.0


−0.2, 0, 0.2 μ1 10 τ1 = .5
11.8, 12, 12.2 20 -20
τ2 = .5 34
A What-If Puzzle

Messy: no closed form solution known for


finding θ maximizing L
But what if we
knew the
hidden data?
35
EM as Egg vs Chicken
IF zij known, could estimate parameters θ
E.g., only points in cluster 2 influence µ2, σ2
IF parameters θ known, could estimate zij
E.g., if |xi - µ1|/σ1 << |xi - µ2|/σ2, then zi1 >> zi2

But we know neither; (optimistically) iterate:


E: calculate expected zij, given parameters
M: calc “MLE” of parameters, given E(zij)
Overall, a clever “hill-climbing” strategy

36
Simple Version:
“Classification EM”
If zij < .5, pretend it’s 0; zij > .5, pretend it’s 1
I.e., classify points as component 0 or 1
Now recalc θ, assuming that partition
Then recalc zij , assuming that θ
Then re-recalc θ, assuming new zij, etc., etc.
“Full EM” is a bit more involved, but this is the crux.

37
Full EM

38
The E-step:
Find E(Zij), i.e. P(Zij=1)
Assume θ known & fixed
A (B): the event that xi was drawn from f1 (f2) P (1)
+ 1·
D: the observed datum xi 0 )
0 · P(
E=
Expected value of zi1 is P(A|D)

} Repeat
for
each
xi

39
Complete Data
Likelihood

(Better):
40
M-step:
Find θ maximizing E(log(Likelihood))

41
2 Component Mixture
σ1 = σ2 = 1; τ = 0.5

Essentially converged in 2 iterations

(Excel spreadsheet on course web) 42


Applications
Clustering is a remarkably successful exploratory data
analysis tool
Web-search, information retrieval, gene-expression, ...
Model-based approach above is one of the leading ways to do it
Gaussian mixture models widely used
With many components, empirically match arbitrary distribution
Often well-justified, due to “hidden parameters” driving the
visible data
EM is extremely widely used for “hidden-data” problems
Hidden Markov Models

43
EM Summary
Fundamentally a maximum likelihood parameter
estimation problem
Useful if hidden data, and if analysis is more
tractable when 0/1 hidden data z known
Iterate:
E-step: estimate E(z) for each z, given θ
M-step: estimate θ maximizing E(log likelihood)
given E(z) [where “E(logL)” is wrt random z ~ E(z) = p(z=1)]

44
EM Issues
Under mild assumptions, EM is guaranteed to
increase likelihood with every E-M iteration,
hence will converge.
But it may converge to a local, not global, max.
(Recall the 4-bump surface...)
Issue is intrinsic (probably), since EM is often
applied to problems (including clustering,
above) that are NP-hard (next 3 weeks!)
Nevertheless, widely used, often effective

45

You might also like