GMMEMNotes
GMMEMNotes
This problem occurs within the log likelihood for GMM, so it is difficult to max-
imize the likelihood. The Expectation-Maximization (EM) procedure is a way
!
to handle log . It uses Jensen’s inequality to!create a lower bound (called an
auxiliary function) for the likelihood that uses log instead. We can maximize
the auxiliary function, which leads to an increase in the likelihood. We repeat
this process at each iteration (constructing the auxiliary function and maximiz-
ing it), leading to a local maximum of the log likelihood for GMM. Let us walk
through this process, deriving the EM algorithm along the way.
zi |w ∼ Categorical(w)
which means that wk is the probability that i’s cluster is k. That is,
P (zi = k|w) = wk .
xi |zi = k ∼ N (µk , Σk )
1
Just to recap the notation:
xi → data
zi → cluster assignment for i
µ → center of cluster k
Σk → spread of cluster k
wk → proportion of data in cluster k (mixture weights)
2
On the second and third lines above, the sum is over possible cluster assignments
k for point i. Taking the log,
%&
log likelihood(θ) = log P (Xi = xi |zi = k, θ)P (zi = k|θ)
i k
& &
= log P (Xi = xi |zi = k, θ)P (zi = k|θ).
i k
As we know, we cannot pass the log through the sum.
You might think this problem is specific just to the one we’re working on (Gaus-
sian mixture models) but the problem is much more general! In fact, every time
you have a latent variable like z, the same problem happens. Latent variables
occur in lots of problems, not just clustering. For clustering, they happen almost
all the time since you do not know which cluster a point may really belong to, so
they cluster assignment is latent (hidden). Here is where we need a tool. That
tool is Expectation-Maximization (EM). We will get back to Gaussian Mixture
models after introducing EM.
Expectation Maximization
EM creates an iterative procedure where we update the zi ’s and then update µ,
Σ, and w. It is an alternating minimization scheme similar to k-means.
• E-step: compute cluster assignments (which are probabilistic)
• M-step: update θ (which are the clusters’ properties)
Incidentally, if we looked instead at the “complete” log likelihood p(x, |z, θ)
(meaning that you know the zi ’s), there is no sum and the issue with the sum
and the log goes away! This is because you no longer need to sum over k, you
already know which cluster k unit i is in.
Let’s start over from scratch. We are now in a very general setting. The data are
still drawn independently, and each data has a hidden variable associated with
it. Notation for data and hidden variables is:
x1 , ..., xn data
z1 , ..., zn hidden variables, taking values k = 1...K
θ parameters
3
Then,
log likelihood(θ) = log P (X1 , ..., Xn = x1 , ..., xn |θ)
&
= log P (Xi = xi |θ) (by independence)
i
& &
= log P (Xi = xi , Zi = k|θ) (hidden variables)
i k
& &
= log P (Zi = k|θ)P (Xi = xi |Zi = k, θ).
i k
The figure below illustrates a few iterations of EM. Starting at θt with iteration t
in orange, we construct the surrogate lower bound A(θ, θt ). When we maximize
it, our likelihood increases. The maximum of A(θ, θt ) occurs at θt+1 that we will
use at the next iteration. We evaluate the log likelihood of θt+1 , again construct
a surrogate lower bound A(θ, θt+1 ), and maximize it to get to the next iteration,
which occurs at point θt+2 , etc. At each iteration, the likelihood increases.
Note that this procedure leads to local maxima, not necessarily global maxima.
Let us write out the procedure for constructing A, starting with the log likelihood.
& &
log likelihood(θ) = log P (Xi = xi , Zi = k|θ) (from above)
i k
& & P (Xi = xi , Zi = k|θ)
= log P (Zi = k|xi , θt )
i
P (Zi = k|xi , θt )
k
4
where we have multiplied by 1 in disguise, namely P (Zi = k|xi , θt ) in both the
numerator and denominator. (This turns!out to be the best possible choice for
this 1 in disguise.) The weighted average k P (Zi = k|xi , θt )〈stuff〉 can be viewed
as an expectation because it’s a sum of elements weighted by probabilities that
add up to 1. We will call it Ez .
& P (Xi = xi , Zi = k|θ)
log likelihood(θ) = log Ez .
i
P (Zi = k|xi , θt )
We will now use Jensen’s inequality for convex functions, which allows us to
switch a log and an expectation. However, it is easy to forget which way Jensen’s
inequality goes. I have a picture that helps me remember. The distribution on
the x-axis is uniform. We first find E(X), then f (E(X)), which is fairly small.
Afterwards we note that E(f (X)) is larger, because it averages over f (x), which
has large values in it because f is convex.
5
Back to where we were:
& P (Xi = xi , Zi = k|θ)
log likelihood(θ) = log Ez
i
P (Zi = k|xi , θt )
& P (Xi = xi , Zi = k|θ)
≥ Ez log (Jensen’s inequality)
i
P (Z i = k|x i , θ t )
&& P (Xi = xi , Zi = k|θ)
= P (Zi = k|xi , θt ) log =: A(θ, θt ).
i
P (Z i = k|x i , θ t )
k
Sanity check
Let’s make sure that A(θt , θt ) is log likelihood(θt ).
6
Back to EM
Recall our auxiliary function, which is a function of θ.
&& P (Xi = xi , Zi = k|θ)
A(θ, θt ) := P (Zi = k|xi , θt ) log ,
i
P (Zi = k|xi , θt )
k
• M-step:
&& P (Xi = xi , Zi = k|θ)
max A(θ, θt ) = γik log .
θ
i j
γik
The term in the denominator doesn’t depend on θ so it is not involved in
the maximization. Thus it becomes:
&&
max γik log P (Xi = xi , Zi = k|θ).
θ
i j
This might not be very satisfying, but it’s too late to rename it I suppose.
Back to GMM
Let us now apply EM to GMM. Here is a reminder of the notation:
7
• E-step: Using Bayes Rule
P (Xi = xi |zi = k, θt )P (Zi = k|θt )
P (Zi = k|xi , θt ) = .
P (Xi = xi |θt )
The denominator equals a sum over k of terms like those in the numerator,
by the law of total probability. We can calculate all of the terms thanks to
our assumptions for GMM.
N (xi ; µkt , Σkt )wkt
P (Zi = k|xi , θt ) = ! =: γik .
k ′ N (xi ; µk ′ t , Σk ′ t )wk ′ t
which is the mean of the xi ’s, weighted by the probability of being in cluster
k. (It’s hard to imagine this calculation could turn out any other way.)
Again skipping steps, setting the derivatives of the auxiliary function to 0
to get Σk,t+1 :
! T
i γik (xi − µ!
k,t+1 )(xi − µk,t+1 )
Σk,t+1 = .
i γik
8
where λ is the Lagrange multiplier. Remember that wk is part of θ. Taking
the derivative, and using index k ′ so as not to be confused with the sum
over k:
∂L(θ, θt ) ∂A(θ, θt )
= −λ
∂wk′ ∂w'k′ (
∂ & &
= γik log P (Xi = xi , Zi = k|θ) − λ. (1)
∂wk′ i k
& ∂ ∂
= [γik′ log(wk′ ,t+1 )] + [N (x; µk′ ,t+1 , Σk′ ,t+1 )] − λ
i
∂w k ′ ∂w k ′
Here, N (x; µk′ ,t+1 , Σk′ ,t+1 ) does not depend on wk′ so we can remove that
term.
∂L(θ, θt ) & ∂
= [γik′ log(wk′ )] − λ
∂wk′ i
∂w k ′
& 1 1 &
= γik′ −λ= γik′ − λ
i
w k ′ w k ′
i
9
!
where k P (Zi = k|xi , θ) = 1 because it is the sum over the whole prob-
ability distribution. Thus, we finally have our last update for the iterative
procedure to optimize the parameters of GMM.
!
γik′
wk′ ,t+1 = i .
n
We are now done with Gaussian mixture models. I’ll leave you with a final big-
picture summary of the update procedure, which looks quite similar to k-means:
E: What is the current estimate of the probability that xi comes from cluster k?
It is γik .
10