Expectation Maximization
Expectation Maximization
The Expectation-Maximization
Algorithm
27
Last lecture:
How to estimate μ given data
X X XX X XXX X
Observed Data
-3 -2 -1
μ
0 1 2
28
More Complex Example
This?
Or this?
Parameters θ
means µ1 µ2
variances σ12 σ22
mixing parameters τ1 τ2 = 1 − τ1
P.D.F. f (x|µ1 , σ12 ) f (x|µ2 , σ22 )
Likelihood
L(x1 , x2 , . . . , xn |µ1 , µ2 , σ12 , σ22 , τ1 , τ2 ) No
closed-
�n �2 form
= i=1 j=1 τ j f (xi |µj , σ j)
2
max
31
Likelihood Surface
0.15
0.1 20
0.05
10
0
-20
μ2
0
-10
0 -10
μ1 10
20 -20 32
0.15
0.1 20
0.05
10
0
-20
μ2
0
xi =
-10
(-10,6)
(12,-5)
0.15
0.1 20
0.05
(6,-10) 10
0
-20
μ2
0
xi =
-10
36
Simple Version:
“Classification EM”
If zij < .5, pretend it’s 0; zij > .5, pretend it’s 1
I.e., classify points as component 0 or 1
Now recalc θ, assuming that partition
Then recalc zij , assuming that θ
Then re-recalc θ, assuming new zij, etc., etc.
“Full EM” is a bit more involved, but this is the crux.
37
Full EM
38
The E-step:
Find E(Zij), i.e. P(Zij=1)
Assume θ known & fixed
A (B): the event that xi was drawn from f1 (f2) P (1)
+ 1·
D: the observed datum xi 0 )
0 · P(
E=
Expected value of zi1 is P(A|D)
} Repeat
for
each
xi
39
Complete Data
Likelihood
(Better):
40
M-step:
Find θ maximizing E(log(Likelihood))
41
2 Component Mixture
σ1 = σ2 = 1; τ = 0.5
43
EM Summary
Fundamentally a maximum likelihood parameter
estimation problem
Useful if hidden data, and if analysis is more
tractable when 0/1 hidden data z known
Iterate:
E-step: estimate E(z) for each z, given θ
M-step: estimate θ maximizing E(log likelihood)
given E(z) [where “E(logL)” is wrt random z ~ E(z) = p(z=1)]
44
EM Issues
Under mild assumptions, EM is guaranteed to
increase likelihood with every E-M iteration,
hence will converge.
But it may converge to a local, not global, max.
(Recall the 4-bump surface...)
Issue is intrinsic (probably), since EM is often
applied to problems (including clustering,
above) that are NP-hard (next 3 weeks!)
Nevertheless, widely used, often effective
45