0% found this document useful (0 votes)
5 views

GMMEMNotes

The document discusses Gaussian Mixture Models (GMM) and the Expectation-Maximization (EM) algorithm, highlighting GMM as a soft clustering method that assigns probabilities of cluster membership to data points. It explains the challenges in maximizing the log likelihood due to the presence of latent variables and introduces EM as a solution to iteratively update cluster assignments and parameters. The document details the E-step and M-step of the EM algorithm, providing formulas and derivations for updating cluster means, covariances, and mixture weights.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

GMMEMNotes

The document discusses Gaussian Mixture Models (GMM) and the Expectation-Maximization (EM) algorithm, highlighting GMM as a soft clustering method that assigns probabilities of cluster membership to data points. It explains the challenges in maximizing the log likelihood due to the presence of latent variables and introduces EM as a solution to iteratively update cluster assignments and parameters. The document details the E-step and M-step of the EM algorithm, providing formulas and derivations for updating cluster means, covariances, and mixture weights.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Gaussian Mixture Models and Expectation Maximization

Duke Course Notes


Cynthia Rudin
Gaussian Mixture Models is a “soft” clustering algorithm, where each point prob-
abilistically “belongs” to all clusters. This is different than k-means where each
point belongs to one cluster (“hard” cluster assignments).
!
We remind the reader of the following fact: log is not fun. In other words,
when we have log of a sum, there is no way to reduce it.

This problem occurs within the log likelihood for GMM, so it is difficult to max-
imize the likelihood. The Expectation-Maximization (EM) procedure is a way
!
to handle log . It uses Jensen’s inequality to!create a lower bound (called an
auxiliary function) for the likelihood that uses log instead. We can maximize
the auxiliary function, which leads to an increase in the likelihood. We repeat
this process at each iteration (constructing the auxiliary function and maximiz-
ing it), leading to a local maximum of the log likelihood for GMM. Let us walk
through this process, deriving the EM algorithm along the way.

Here is GMM’s generative model:


• First, generate which cluster i is going to be generated from:

zi |w ∼ Categorical(w)

which means that wk is the probability that i’s cluster is k. That is,

P (zi = k|w) = wk .

Here, wk are called


! the mixture weights, and they are a discrete probability
distribution: k wk = 1, 0 ≤ wk ≤ 1.
• Then, generate xi from the cluster’s distribution:

xi |zi = k ∼ N (µk , Σk )

1
Just to recap the notation:
xi → data
zi → cluster assignment for i
µ → center of cluster k
Σk → spread of cluster k
wk → proportion of data in cluster k (mixture weights)

As a reminder, here is the formula for the normal distribution:


# $
1 1
p(X = x|µ, Σ) = " exp − (X − µ)T Σ−1 (X − µ) .
(2π) p/2 |Σ| 2
Here is a picture of the generative process, where first I generated the cluster
centers and covariances, and then generated points for each cluster, where the
number of points I generated is proportional to the mixture weights.

Likelihood for GMM

likelihood = P ({X1 , ..., Xn } = {x1 , ..., xn }|w, µ, Σ) ,


where w = [w1 , ..., wk ], µ = [µ1 , .., µk ], Σ = [Σ1 , ..., Σk ]. I will denote the
collection of these variables as θ. Assuming independence of data points,
%
likelihood(θ) = P (Xi = xi |θ),
i
K
%&
= P (Xi = xi |zi = k, θ)P (zi = k|θ) (law of total probability)
ik=1
%&K
= N (xi ; µk , Σk )wk .
i k=1

2
On the second and third lines above, the sum is over possible cluster assignments
k for point i. Taking the log,
%&
log likelihood(θ) = log P (Xi = xi |zi = k, θ)P (zi = k|θ)
i k
& &
= log P (Xi = xi |zi = k, θ)P (zi = k|θ).
i k
As we know, we cannot pass the log through the sum.

You might think this problem is specific just to the one we’re working on (Gaus-
sian mixture models) but the problem is much more general! In fact, every time
you have a latent variable like z, the same problem happens. Latent variables
occur in lots of problems, not just clustering. For clustering, they happen almost
all the time since you do not know which cluster a point may really belong to, so
they cluster assignment is latent (hidden). Here is where we need a tool. That
tool is Expectation-Maximization (EM). We will get back to Gaussian Mixture
models after introducing EM.

Expectation Maximization
EM creates an iterative procedure where we update the zi ’s and then update µ,
Σ, and w. It is an alternating minimization scheme similar to k-means.
• E-step: compute cluster assignments (which are probabilistic)
• M-step: update θ (which are the clusters’ properties)
Incidentally, if we looked instead at the “complete” log likelihood p(x, |z, θ)
(meaning that you know the zi ’s), there is no sum and the issue with the sum
and the log goes away! This is because you no longer need to sum over k, you
already know which cluster k unit i is in.

Let’s start over from scratch. We are now in a very general setting. The data are
still drawn independently, and each data has a hidden variable associated with
it. Notation for data and hidden variables is:
x1 , ..., xn data
z1 , ..., zn hidden variables, taking values k = 1...K
θ parameters

3
Then,
log likelihood(θ) = log P (X1 , ..., Xn = x1 , ..., xn |θ)
&
= log P (Xi = xi |θ) (by independence)
i
& &
= log P (Xi = xi , Zi = k|θ) (hidden variables)
i k
& &
= log P (Zi = k|θ)P (Xi = xi |Zi = k, θ).
i k

The idea of Expectation Maximization (EM) is to find a lower bound on likelihood(θ)


that involves P (x, z|θ). Maximizing the lower bound always leads to higher val-
ues of likelihood(θ).

The figure below illustrates a few iterations of EM. Starting at θt with iteration t
in orange, we construct the surrogate lower bound A(θ, θt ). When we maximize
it, our likelihood increases. The maximum of A(θ, θt ) occurs at θt+1 that we will
use at the next iteration. We evaluate the log likelihood of θt+1 , again construct
a surrogate lower bound A(θ, θt+1 ), and maximize it to get to the next iteration,
which occurs at point θt+2 , etc. At each iteration, the likelihood increases.

Note that this procedure leads to local maxima, not necessarily global maxima.

Let us write out the procedure for constructing A, starting with the log likelihood.
& &
log likelihood(θ) = log P (Xi = xi , Zi = k|θ) (from above)
i k
& & P (Xi = xi , Zi = k|θ)
= log P (Zi = k|xi , θt )
i
P (Zi = k|xi , θt )
k

4
where we have multiplied by 1 in disguise, namely P (Zi = k|xi , θt ) in both the
numerator and denominator. (This turns!out to be the best possible choice for
this 1 in disguise.) The weighted average k P (Zi = k|xi , θt )〈stuff〉 can be viewed
as an expectation because it’s a sum of elements weighted by probabilities that
add up to 1. We will call it Ez .
& P (Xi = xi , Zi = k|θ)
log likelihood(θ) = log Ez .
i
P (Zi = k|xi , θt )

We will now use Jensen’s inequality for convex functions, which allows us to
switch a log and an expectation. However, it is easy to forget which way Jensen’s
inequality goes. I have a picture that helps me remember. The distribution on
the x-axis is uniform. We first find E(X), then f (E(X)), which is fairly small.
Afterwards we note that E(f (X)) is larger, because it averages over f (x), which
has large values in it because f is convex.

At this point, it is clear which way Jensen’s inequality goes.

Lemma (Jensen’s Inequality). If f is convex, then f (EX) ≤ E(f (X)).

If f is convex, −f is concave, thus −f (EX) ≥ −E(f (X)) = E(−f (X)). Here,


−f (x) = log(x) which is concave, thus, log(EX) ≥ E logX.

5
Back to where we were:
& P (Xi = xi , Zi = k|θ)
log likelihood(θ) = log Ez
i
P (Zi = k|xi , θt )
& P (Xi = xi , Zi = k|θ)
≥ Ez log (Jensen’s inequality)
i
P (Z i = k|x i , θ t )
&& P (Xi = xi , Zi = k|θ)
= P (Zi = k|xi , θt ) log =: A(θ, θt ).
i
P (Z i = k|x i , θ t )
k

A(·, θt ) is called the auxiliary function.

Sanity check
Let’s make sure that A(θt , θt ) is log likelihood(θt ).

&& P (Xi = xi , Zi = k|θt )


A(θt , θt ) = P (Zi = k|xi , θt ) log
i
P (Zi = k|xi , θt )
k

From the definition of conditional probability,


P (Xi = xi , Zi = k|θt ) = P (Zi = k|xi , θt )P (Xi = xi |θt ). Plugging this in,
&&
A(θt , θt ) = P (Zi = k|xi , θt ) log P (Xi = xi |θt ).
i k
!
Note that k P (Zi = k|xi , θt ) = 1 because this is a sum over a whole probability
distribution, and the other term doesn’t depend on k. So,
& %
A(θt , θt ) = log P (Xi = xi |θt ) = log P (Xi = xi |θt ) = log likelihood(θt ).
i i

Sanity check complete.

6
Back to EM
Recall our auxiliary function, which is a function of θ.
&& P (Xi = xi , Zi = k|θ)
A(θ, θt ) := P (Zi = k|xi , θt ) log ,
i
P (Zi = k|xi , θt )
k

where I have highlighted two terms that are the same.


• E-step: compute P (Zi = k|xi , θt ) =: γik for each i, k.

• M-step:
&& P (Xi = xi , Zi = k|θ)
max A(θ, θt ) = γik log .
θ
i j
γik
The term in the denominator doesn’t depend on θ so it is not involved in
the maximization. Thus it becomes:
&&
max γik log P (Xi = xi , Zi = k|θ).
θ
i j

To maximize, we take the derivative and set it to 0, as usual.


Why is the “E” step called “Expectation” rather than “Probability”? Let us
define indicator ξik = 1 if Zi = k and 0 otherwise. (Remember, I showed you in
a past lecture that expectations of indicator variables are probabilities.) Then,

P (Zi = k|xi , θt ) = 1 · P (ξik = 1|xi , θt ) + 0 · P (ξik = 0|xi , θt ) = Eξik ξik .

This might not be very satisfying, but it’s too late to rename it I suppose.

Back to GMM
Let us now apply EM to GMM. Here is a reminder of the notation:

wkt = probability to belong to cluster k at iteration t


µkt = mean of cluster k at iteration t
Σkt = covariance of k at iteration t

and θt is the collection of (wkt , µkt , Σkt )’s at iteration t.

7
• E-step: Using Bayes Rule
P (Xi = xi |zi = k, θt )P (Zi = k|θt )
P (Zi = k|xi , θt ) = .
P (Xi = xi |θt )
The denominator equals a sum over k of terms like those in the numerator,
by the law of total probability. We can calculate all of the terms thanks to
our assumptions for GMM.
N (xi ; µkt , Σkt )wkt
P (Zi = k|xi , θt ) = ! =: γik .
k ′ N (xi ; µk ′ t , Σk ′ t )wk ′ t

This is similar to k-means where we assign each point to a cluster at iteration


t. Here, though the cluster assignments are probabilistic. (I could have
indexed γik also by t since it changes at each t, but instead I will just
replace its value at each iteration.)
• M-step: Here is the auxiliary function we will maximize:
&&
max A(θ, θt ) = γik log P (Xi = xi , Zi = k|θ).
θ
i j

Update θ, which is the! collection w, µ, Σ, by setting derivatives of A to 0,


with one constraint: k wk = 1, so that the categorical distribution is well-
defined. After a small amount of calculation (skipping steps here, setting
the derivatives to zero and solving), the result for the cluster means is:
!
xi γik
µk,t+1 = !i .
i γik

which is the mean of the xi ’s, weighted by the probability of being in cluster
k. (It’s hard to imagine this calculation could turn out any other way.)
Again skipping steps, setting the derivatives of the auxiliary function to 0
to get Σk,t+1 :
! T
i γik (xi − µ!
k,t+1 )(xi − µk,t+1 )
Σk,t+1 = .
i γik

The update for w is tricker because of the constraint. We need to do con-


strained optimization. The Lagrangian is:
' (
&
L(θ, θt ) = A(θ, θt ) + λ 1 − wk
k

8
where λ is the Lagrange multiplier. Remember that wk is part of θ. Taking
the derivative, and using index k ′ so as not to be confused with the sum
over k:
∂L(θ, θt ) ∂A(θ, θt )
= −λ
∂wk′ ∂w'k′ (
∂ & &
= γik log P (Xi = xi , Zi = k|θ) − λ. (1)
∂wk′ i k

Aside, we know, by the probabilistic model for generating data according


to GMM (here we’re using the fact that we solved for some of θ already for
iteration t + 1, so I’ll refrain from coloring them),

P (Xi = xi , Zi = k|θ) = P (Zi = k|w) · P (Xi = x|Zi = k, µk,t+1 , Σk,t+1 ) − λ


= wk · N (x ; µk,t+1 , Σk,t+1 ) − λ.

Plugging back into (1)


∂L(θ, θt ) & ∂
= [γik′ log[wk′ N (x; µk′ ,t+1 , Σk′ ,t+1 )]] − λ
∂wk′ i
∂w k ′

& ∂ ∂
= [γik′ log(wk′ ,t+1 )] + [N (x; µk′ ,t+1 , Σk′ ,t+1 )] − λ
i
∂w k ′ ∂w k ′

Here, N (x; µk′ ,t+1 , Σk′ ,t+1 ) does not depend on wk′ so we can remove that
term.
∂L(θ, θt ) & ∂
= [γik′ log(wk′ )] − λ
∂wk′ i
∂w k ′

& 1 1 &
= γik′ −λ= γik′ − λ
i
w k ′ w k ′
i

Setting the derivative to 0, we can now solve for wk′ ,t+1 :


!
γik′
wk′ ,t+1 = i .
λ
!
We know that k′ wk′ ,t+1 =1, so λ is the normalization factor:
' (
&& & & &
λ= γik = P (Zi = k|xi , θ) = 1=n
k i i k i

9
!
where k P (Zi = k|xi , θ) = 1 because it is the sum over the whole prob-
ability distribution. Thus, we finally have our last update for the iterative
procedure to optimize the parameters of GMM.
!
γik′
wk′ ,t+1 = i .
n
We are now done with Gaussian mixture models. I’ll leave you with a final big-
picture summary of the update procedure, which looks quite similar to k-means:

E: What is the current estimate of the probability that xi comes from cluster k?
It is γik .

M: Update parameters µ, Σ and w.

10

You might also like