0% found this document useful (0 votes)
11 views

401 Week7 Part 2 EM Algorithm

The document describes the EM algorithm for Gaussian mixture models. It discusses how the EM algorithm is used to estimate missing data and unknown parameters when data is generated from a mixture of Gaussian distributions. The EM algorithm iterates between an E-step, where it calculates membership weights representing the probability that each point belongs to each cluster, and an M-step where it recalculates the parameters of each cluster distribution based on the membership weights.

Uploaded by

Starix Jajaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

401 Week7 Part 2 EM Algorithm

The document describes the EM algorithm for Gaussian mixture models. It discusses how the EM algorithm is used to estimate missing data and unknown parameters when data is generated from a mixture of Gaussian distributions. The EM algorithm iterates between an E-step, where it calculates membership weights representing the probability that each point belongs to each cluster, and an M-step where it recalculates the parameters of each cluster distribution based on the membership weights.

Uploaded by

Starix Jajaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Stats 401 - EM Algorithm

Miles Chen, PhD

Department of Statistics

Week 7 Part 2

Copyright Miles Chen. For personal use only. Do not distribute. 1 / 58


Section 1

Basic Ideas

Copyright Miles Chen. For personal use only. Do not distribute. 2 / 58


Basic Ideas

The EM algorithm is used in situations where you have both missing data and unknown
parameters.
Example from ritvikmath: https://www.youtube.com/watch?v=xy96ArOpntA

Copyright Miles Chen. For personal use only. Do not distribute. 3 / 58


Scenario 1: missing value but parameters known

Let’s say you are told the values (1, 2, x) are drawn from a known distribution: Normal(mean
= 1, sd = 1)
What is your best guess for the missing value x?
The answer is easy: best guess is the mean of the distribution = 1.

Copyright Miles Chen. For personal use only. Do not distribute. 4 / 58


Scenario 2: values known but parameters unknown

Let’s say you are given the values (1, 2, 0) are are told they are drawn from a distribution:
Normal(mean = µ, sd = 1)
What is your best guess for µ?
This time, the mean of the distribution is unknown.
Based on the data, your best guess (the maximum likelihood estimate) of the mean of the
distribution is the mean of your sample = (0+1+2)/3 = 1.

Copyright Miles Chen. For personal use only. Do not distribute. 5 / 58


Scenario 3: missing value and parameters unknown

Let’s say you have some data as well as a missing value (1, 2, x). You know they are drawn
from a distribution with an unknown mean: Normal(mean = µ, sd = 1)
What is your best guess for the missing value x and the unknown parameter µ?

Copyright Miles Chen. For personal use only. Do not distribute. 6 / 58


An iterative process

We can use an iterative approach:


We first take a guess for µ. Let’s guess µ0 = 0
Using this, we guess a value for x. Our best guess is x = 0.
Our data is now (1,2,0). Based on this, our best guess for µ1 = 1.
Now we guess a value for x. Our best guess is x = 1.
Our data is now (1,2,1). The MLE for µ1 = 1.333.
We iterate. . . . Eventually converging to x = 1.5 and µ = 1.5

Copyright Miles Chen. For personal use only. Do not distribute. 7 / 58


Section 2

EM Algorithm for Gaussian Mixtures

Copyright Miles Chen. For personal use only. Do not distribute. 8 / 58


EM Algorithm for Gaussian Mixtures

We apply the EM algorithm to Gaussian Mixtures. The data consists of points that are
generated from several Gaussian distributions.
In this case, the missing data is to which distribution a point belongs.
The parameters (mean and variance) of the Gaussian distributions are also unknown.
We will use the EM algorithm iteratively to figure out both the unknown parameters and the
unknown cluster assignments.

Copyright Miles Chen. For personal use only. Do not distribute. 9 / 58


Gaussian Mixtures

We generate some data from a mixture of three multivariate Gaussian distributions.


The data generating process:
Randomly select a value representing the cluster: 1, 2, 3 with probability vector
α = [0.5, 0.3, 0.2]
Generate a random value from the selected multivariate Gaussian distribution
Component 1: N2 (µ = [0, 0], Σ = [ 90 09 ])
Component 2: N2 (µ = [4, 4], Σ = [ .91 .91 ])
 h i
2 −1
Component 3: N2 µ = [−4, −4], Σ = −1 2

Copyright Miles Chen. For personal use only. Do not distribute. 10 / 58


Mixture Models

The mixture model can be represented with:


K
X
p(x|Θ) = αk pk (x|zk , θk )
k=1

In our example, we have a mixture of 2-dimensional Gaussian distributions.


x is the point in 2-dimensional space
Θ collects all of the parameters from the different mixtures (both alphak and all θk )
α
PkKare the mixture weights representing the probability that x came from component k, where
k=1 αk = 1
pk () are the probability density functions of component k with parameters θk . For our example
this is the multivariate Gaussian PDF, and θk represents the mean vector µk and variance matrix
Σk of component k.
zk is a vector of K indicator variables where one value is 1 and the rest are 0s. This represents the
class membership of x

Copyright Miles Chen. For personal use only. Do not distribute. 11 / 58


Generating the data: Specify the component properties

# Specifying properties of the individual components


alpha <- c(.5, .3, .2) # mixture proportions
mean_1 <- c(0, 0) # center of cluster 1
sigma_1 <- matrix(c(9, 0, 0, 9), 2) # sigma of cluster 1
mean_2 <- c(4, 4)
sigma_2 <- matrix(c(1, .9, .9, 1), 2)
mean_3 <- c(-4, -4)
sigma_3 <- matrix(c(2, -1, -1, 2), 2)
means <- list(mean_1, mean_2, mean_3)
sigmas <- list(sigma_1, sigma_2, sigma_3)

Copyright Miles Chen. For personal use only. Do not distribute. 12 / 58


Generating the mixture

# generating the mixtures


set.seed(1)
library(mvtnorm)
X <- matrix(0, ncol = 3, nrow = 1000)
for(i in 1:nrow(X)){
k = sample(1:3,1, prob = alpha)
X[i,] <- c(rmvnorm(1, means[[k]], sigmas[[k]]), k)
}

Copyright Miles Chen. For personal use only. Do not distribute. 13 / 58


The resulting mixture
plot(X[,1], X[,2], cex = 0.5, asp = 1, main = "Clusters are not labeled. This
Clusters are not labeled. This is how we see the data.
10
5
X[, 2]

0
−5
−10

−15 −10 −5 0 5 10 15

X[, 1]
Copyright Miles Chen. For personal use only. Do not distribute. 14 / 58
The mixture consists of three components
plot(X[,1], X[,2], col = X[,3], cex = 0.5, asp = 1, main = "Plot with cluster
Plot with cluster labels. Ususally, these are unknown to us.
10
5
X[, 2]

0
−5
−10

−15 −10 −5 0 5 10 15

X[, 1]
Copyright Miles Chen. For personal use only. Do not distribute. 15 / 58
Some points could be “assigned” to more than one cluster

Which cluster does this point belong to?


10
5
X[, 2]

0
−5
−10

−15 −10 −5 0 5 10 15

X[, 1]

Copyright Miles Chen. For personal use only. Do not distribute. 16 / 58


Solution: Use a probabilistic assignment

Rather than strictly assigning a point to one cluster like in K-means clustering, we use a Bayes
classifier to get a probabilistic assignment - what we might call a membership weight. Each
point now has a vector of probabilities (that add to 1).
Once we have the membership weights, we calculate the parameters of each cluster
distribution. We use a weighted mean and weighted variance-covariance matrix using the
membership weights of the points in the clusters. This maximize the likelihood of the values
“assigned” to the cluster.
We iterate back and forth between calculating membership weights (E-step) and recalculating
the parameters of the cluster distributions (M-step).
It could be useful to think of the EM algorithm as a blend between k-means clustering and a
Bayes classifier.

Copyright Miles Chen. For personal use only. Do not distribute. 17 / 58


The E-step: Membership Weights

The membership weights are calculated using a Bayes classifier:


Given the x-values of the point (stored in xi ) and all of the parameter values (the means and
sigma matrices of all clusters stored in Θ), the weight/probability that point i belongs to
cluster k is:

likelihood · prior pk (xi |zk , θk ) · αk


wik = P (zik = 1|xi , Θ) = = PK
marginal m=1 pm (xi |zm , θm ) · αm

Note that the denominator (the marginal) is equal to the sum of the numerators across all
possible clusters.

Copyright Miles Chen. For personal use only. Do not distribute. 18 / 58


E-step: Membership Weights - alphak

likelihood · prior pk (xi |zk , θk ) · αk


wik = P (zik = 1|xi , Θ) = = PK
marginal m=1 pm (xi |zm , θm ) · αm

αk is the prior probability (from the previous iteration) that a point belongs to cluster k. It is
equal to the proportion of “points that are assigned” to cluster k divided by the total number
of points.

Nk
αk =
N
The “number of points assigned” to cluster k, however, is not an integer. Rather it is the sum
of the membership weights. That is:

N
X
Nk = wik
i=1

Copyright Miles Chen. For personal use only. Do not distribute. 19 / 58


E-step: Membership Weights - pk

likelihood · prior pk (xi |zk , θk ) · αk


wik = P (zik = 1|xi , Θ) = = PK
marginal m=1 pm (xi |zm , θm ) · αm

The likelihood is the probability density function evaluated for the values in xi given the
current estimates of the cluster’s mean and sigma matrix. The PDF is the multivariate Normal
distribution.

1 1
 
pk (xi |µk , Σk ) = D/2 1/2
exp − (xi − µk )T Σ−1
k (xi − µk )
(2π) |Σk | 2
Using the mvtnorm package in R, we can easily evaluate the above:
likelihood_k = dmvnorm(X_new, mean = xbar_k, sigma = var_k);

Copyright Miles Chen. For personal use only. Do not distribute. 20 / 58


M-step - Calculating the estimates of µk

Once you have calculated the membership weights of each point to each cluster, we
recalculate the “centroid” of each cluster according to the probabilistic weights of the points
that have been “assigned” to each cluster:

N
1 X
µk = wik xi
Nk i=1
This is just a weighted mean of all the points.
Points that are highly likely to be in cluster k, will have a membership weight wik close to 1
and will contribute more to the calculation of the “mean”. While points that are unlikely to be
in cluster k, will have wik close to 0 and will not contribute much to the value of the “mean”.

Copyright Miles Chen. For personal use only. Do not distribute. 21 / 58


M-step - Calculating the estimates of Σk

We use the same idea to calculate the Σ matrices of each cluster, using the new µk values we
just calculated in the previous step:

N
1 X
Σk = wik (xi − µk )(xi − µk )T
Nk i=1

Copyright Miles Chen. For personal use only. Do not distribute. 22 / 58


Clusters are not labeled. This is how we see the data.
10
5
X[, 2]

0
−5
−10

−15 −10 −5 0 5 10 15

X[, 1]
Copyright Miles Chen. For personal use only. Do not distribute. 23 / 58
True Groups
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 24 / 58
# use these initial arbitrary values
N <- dim(dat)[1] # number of data points
alpha <- c(0.33,0.33,0.34) # arbitrary starting mixing parameters
mu <- matrix( # starting means
c(0,0,
-9,-9,
9,9),
nrow = 3, byrow=TRUE
)
sig1 <- matrix(c(1,0,0,1), nrow=2) # three arbitrary covariance matrices
sig2 <- matrix(c(1,0,0,1), nrow=2)
sig3 <- matrix(c(1,0,0,1), nrow=2)
I’ve intentionally hidden the rest of my code because you will code up the EM algorithm in
your HW assignment.

Copyright Miles Chen. For personal use only. Do not distribute. 25 / 58


EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 26 / 58
Before we continue

A few notes about the previous graph:


The circles/ellipses show the contour lines of the Multivariate Normal PDF. The initial
parameter estimates have variance matrices of ( 10 01 ). These are clearly bad estimates of the
variances.
Also, the coloring of points is misleading. The graph colors the points according to the cluster
which has the highest probability. However, each point has a probabilistic assignment. The
points that are on the border between the red and black regions are more like “51% red/49%
black” or “49% red/51% black”, but they are represented with a single color in the graph.

Copyright Miles Chen. For personal use only. Do not distribute. 27 / 58


EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 28 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 29 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 30 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 31 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 32 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 33 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 34 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 35 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 36 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 37 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 38 / 58
EM Final Clusters
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 39 / 58
True Groups
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 40 / 58
Different Starting Locations

# use these initial arbitrary values


N <- dim(dat)[1] # number of data points
alpha <- c(0.33,0.33,0.34) # arbitrary starting mixing parameters
mu <- matrix( # starting means
c(0,0,
0,10,
0,-10),
nrow = 3, byrow=TRUE
)
sig1 <- matrix(c(1,0,0,1), nrow=2) # three arbitrary covariance matrices
sig2 <- matrix(c(1,0,0,1), nrow=2)
sig3 <- matrix(c(1,0,0,1), nrow=2)

Copyright Miles Chen. For personal use only. Do not distribute. 41 / 58


EM − One iteration − Bad Start Weights
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 42 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 43 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 44 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 45 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 46 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 47 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 48 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 49 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 50 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 51 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 52 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 53 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 54 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 55 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 56 / 58
EM − One iteration
10
5
y

0
−5
−10

−15 −10 −5 0 5 10 15

x
Copyright Miles Chen. For personal use only. Do not distribute. 57 / 58
Fast forward to the end
Groups identified by EM
10
5
y

0
−5
−10

Copyright Miles Chen. For personal use only. Do not distribute. 58 / 58

You might also like