0% found this document useful (0 votes)
11 views

Week 7 - Latent Variable Models and Expectation Maximization

Week 7 - Latent Variable Models and Expectation Maximization

Uploaded by

lxrserily
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Week 7 - Latent Variable Models and Expectation Maximization

Week 7 - Latent Variable Models and Expectation Maximization

Uploaded by

lxrserily
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Assignment 1 due this week

FIT5201
Data Analysis Algorithms
Week 7 – Latent Variable Models and Expectation Maximization
Outline

• Clustering
• KMeans
• Gaussian Mixture Models and Expectation-Maximization

www.infotech.monash.edu 2
Data Clustering

• Is a method of unsupervised learning


• Find a sensible structure from unlabelled data
• A clustering algorithm
– Groups data into their natural categories
– Based on the similarities between them
– Without knowledge of their actual groups
– Revealing the structure of the data
> High intra-cluster similarity
> Low inter-cluster similarity

www.infotech.monash.edu 3
Data Clustering…

• What is a natural grouping among these objects?

What is the difference between these two clustering?

Simpsons Family School Employees Females Males

www.infotech.monash.edu 4
What is a good cluster?

www.infotech.monash.edu 5
What is a good cluster?

www.infotech.monash.edu 6
What is the difference between these three
clustering?

www.infotech.monash.edu 7
Clustering Algorithms

• Many algorithms exist


– Centre-based (KMeans)
– Density based (DBSCAN)
– Hierarchical clustering
– Graph based clustering

www.infotech.monash.edu 8
Soft vs Hard Clusters

• Hard Clusters
– Data points belong to only one cluster

• Soft Clusters
– Data points could belong to one or more clusters
– Probability of belonging to each cluster is given

www.infotech.monash.edu 9
The KMeans Algorithm

• The simplest centre-based algorithm to solve clustering problems is


KMeans

• 𝑁 unlabelled data points 𝑥! are given


• Goal: Partition the data points into 𝐾 distinct groups (clusters)
– Similar points are grouped together
– Similarity is based on a distance measure 𝑑 .

www.infotech.monash.edu 10
The KMeans Algorithm

• Is an iterative algorithm
• Starts with an initial random guess of 𝐾 cluster centres
($) ($) ($)
𝜇" , 𝜇& , … , 𝜇'
• Iterate the following two steps until a stopping criterion is met:
– Update assignment of data points to clusters
> Calculate the distance of each data point to all cluster centres
> Assign the data point to the cluster with the minimum distance
– Update centers of the clusters
> For each cluster, calculate the new centre as the average of all data points
assigned to it
"#$ ∑! &!"'!
> 𝜇! = ∑! &!"

1 if 𝑥( is assigned to cluster 𝑘
> 𝑟() = $
0 Otherwise

www.infotech.monash.edu 11
KMeans visualization

• A good visual simulation is available at


http://tech.nitoyon.com/en/blog/2013/11/07/k-means/

www.infotech.monash.edu 12
KMeans Remarks

• KMeans is sensitive to initial values


– which means the different execution of Kmeans with different initial
cluster centers may result in different solutions
• KMeans is a non-probabilistic algorithm
– which only supports hard-assignment
– a data point can only be assigned to one and only one of the clusters

www.infotech.monash.edu 13
Applications of KMeans

Ø Data points: pixels colors

Ø Cluster: similar pixel


colors

Ø Replace the colors in a


cluster with the centroid

Ø Store the centroid only:


reduced resolution and
storage space

www.infotech.monash.edu 14
Latent Variables

• We wanted to partition a set of training data points into K groups of


similar data points
• The label of the training data points are latent or hidden
• We call these latent variables

www.infotech.monash.edu 15
Gaussian Mixture Models (GMM)

• A Generative Story
– Consider the following hypothetical generative story for generating a
label-data point pair (𝑘, 𝒙)
– First
> generate a cluster label 𝑘, by tossing a dice with 𝐾 faces where each face of
the dice corresponds to a cluster label
– Second,
> generate the data point 𝒙, by sampling from the distribution 𝑝) .
corresponding to the cluster label 𝑘
– We are given data point 𝒙 but not labels
– We model it by 𝑧 ∈ 1, … , 𝐾
– Now given the training data,
> we would like to find the best value for the latent variables, and
> the best estimates for the parameters of the above generative story.

www.infotech.monash.edu 16
The Probabilistic Generative Model

• Tossing a dice with 𝐾 faces


– is the same as sampling from a multinomial distribution on 𝑘 elements
– the parameters of the multinomial are
$
𝜙! ≥ 0, 0 𝜙! = 1, 𝑝 𝑧% = 𝑘 = 𝜙!
!"#
• For each 𝑘,
– Assume data points are sampled from Gaussian distribution 𝑁 𝜇! , Σ!
– Mean 𝜇! and covariance matrix Σ!
– Note that we have a collection of these Gaussian distributions,
– each of which corresponds to one of 𝐾 dice faces
• We don’t know the labels and try to best guess the latent variables
𝓏" , … , 𝓏! where 𝓏! ∈ 1, … , 𝐾 represents the latent label for a data
point 𝑥!
• 𝜃 ≔ 𝜙, 𝜇" , Σ" , … , 𝜇( , Σ(
• Use the maximum likelihood estimation
www.infotech.monash.edu 17
Maximum Likelihood Estimation

• maximum likelihood estimation (MLE) is a method of estimating the


parameters of a statistical model given observations, by finding the
parameter values that maximize the likelihood of making the
observations
𝜇 that maximizes the
likelihood of observing this
data
Likelihood of
observing
the data

• You can find variance similarly

www.infotech.monash.edu 18
Gaussian Mixture Model

• If we are given a complete data point 𝑘, 𝑥


– Where the label was not hidden
– The probability of the pair according to our generative story would be
– 𝑝 𝑘, 𝑥% = 𝑝 𝑓𝑎𝑐𝑒 𝑘 𝑝 𝑥% |𝑓𝑎𝑐𝑒 𝑘 = 𝜑! 𝑁 𝑥% |𝜇! , Σ!
• In practice, we are given incomplete data (or observed data)
– 𝑝 𝑥% = ∑&!∈ #,…,& 𝑝 𝑧% , 𝑥% = ∑$
!"# 𝑝 𝑧% = 𝑘 𝑝 𝑥|𝑓𝑎𝑐𝑒 𝑘
$
=0 𝜑! 𝑁 𝑥% |𝜇! , Σ!
!"#
– This model is called the Gaussian Mixture Model

www.infotech.monash.edu 19
Gaussian Mixture Model

• We are only given 𝑥" , 𝑥& , … , 𝑥)


• The labels are hidden (latent)
• We aim to best guess 𝑧" , 𝑧& , … , 𝑧) , 𝑧! 𝜖 1, … , 𝐾
• 𝑧! is the latent label for a data point 𝑥!
• The parameter of this model
– 𝜃 = 𝜙, 𝜇# , Σ# , 𝜇' , Σ' , … , 𝜇$ , Σ$
– We like to best estimate these parameters

www.infotech.monash.edu 20
Latent variable models

• Use the maximum likelihood principle to do the parameter estimation


• Complete data likelihood function
– We are given the class label
– Gaussian classifier
(
– 𝑝 𝑋, 𝑍 = Π%"# 𝑝 𝑥% , 𝑧!
– Easy to get the analytical global solutions

• Likelihood function (incomplete data likelihood function)


( ( $
– 𝑝 𝑋 = Π%"# 𝑝 𝑥% = Π%"# Σ!"# 𝑝 𝑥% , 𝑧!
– Hard to get the analytical global solutions (sum inside log)
– Need an iterative optimization algorithm (EM method)
– EM: iterative optimization algorithm for problems with latent variables

www.infotech.monash.edu 21
Problem to be solved

• Why is it hard to find the global solution of imcomplete data likelihood


functions?
– use Gaussian mixture model as an example

• What EM algorithm is and why?


– Steps
– Theoretical support

www.infotech.monash.edu 22
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Prediction rule

– 𝛾 𝑧%! : the posterior probability of the cluster k assigned to a given data


𝑥%
– For a given data, what’s the prior probability of the cluster k assigned to
it?

www.infotech.monash.edu 23
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Prediction rule

– 𝛾 𝑧%! : the posterior probability of the cluster k assigned to a given data


𝑥%
– For a given data, what’s the prior probability of the cluster k assigned to
it? (𝜑! )

www.infotech.monash.edu 24
Gaussian Mixture Models
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Let’s try to find the global optimal solutions first


– Three types of parameters: the mean parameters (𝜇! ); the covariance
Matrices (Σ! ); the mixing coefficients (𝜑! )
– Compute the global optimal solutions by setting the partial derivatives
with respect to these parameters to 0 respectively.
– Refer to the handwritten materials
– Hard to get the global optimal solutions
> as all the solutions rely on 𝛾 𝑧() : the posterior probability of the cluster
assignment; and 𝛾 𝑧() itself relies on the three types of parameters in a
complex way
– Need an iterative algorithm!

www.infotech.monash.edu 25
Expectation Maximization (EM) for GMMs

Choose some initial values for the parameters;


Alternate between the following two steps until a stopping condition (the
change in the log likelihood function or parameters fall below some
threshold) is met (similar to KMeans):
• In the E (expectation) step, use the current values for the parameters
to calculate the posterior probabilities 𝛾 𝑧!(
• In the M (maximization) step, re-estimate the parameters (𝜇( , 𝛴( ,
and 𝜑( ) based on the 𝛾 𝑧!( result from the above step.
– Use the equations in the handwritten materials

www.infotech.monash.edu 26
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"

a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎

For each point calculate 𝑃 𝑏𝑙𝑢𝑒|𝑥 and 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥

www.infotech.monash.edu 27
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"

a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎

𝑃 𝑏𝑙𝑢𝑒|𝑥 << 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥 𝑃 𝑏𝑙𝑢𝑒|𝑥 ≈ 1

For each point calculate 𝑃 𝑏𝑙𝑢𝑒|𝑥 and 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥

www.infotech.monash.edu 28
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
Update means and variances 𝑏& + 𝑏$ + ⋯ + 𝑏'

𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $


𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'

www.infotech.monash.edu 29
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
Update means and variances 𝑏& + 𝑏$ + ⋯ + 𝑏'

𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $


𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'
a
b

www.infotech.monash.edu 30
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
𝑏& + 𝑏$ + ⋯ + 𝑏'
a
b
𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $
𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'

a b

www.infotech.monash.edu 31
The EM Algorithm: General Case

• Training objective: find maximum likelihood solution for models


having latent variables.
– Observed data 𝑋, Latent variable 𝑍, set of model parameters 𝜃
– Log likelihood function
ln 𝑝 𝑋|𝜃 = ln 5 𝑝 𝑋, 𝑍|𝜃
(

• Algorithm:
– Choose an initial setting for the parameters 𝜃 )*+
– While convergence is not met:
> E Step: Evaluate 𝑝 𝑍|𝑋, 𝜃 9:;
> M Step: Evaluate 𝜃 (<= given by
𝜃 (<= ← arg max I 𝑝 𝑍|𝑋, 𝜃 9:; ln 𝑝 𝑋, 𝑍|𝜃
>
?
@ >,>)*+
!,-
> 𝜃 9:;←>

www.infotech.monash.edu 32
The EM Algorithm: General Case

• Questions:
– Why do we use Q function instead of the log likelihood function as the
objective function in M step?

– Is each iteration guaranteed to increase the log likelihood function?

– What’s the relationship between the Q function and log likelihood


function?

www.infotech.monash.edu 33
The EM Algorithm: General Case

• Why Q function?
∑+ 𝑝 𝑍|𝑋, 𝜃 ,-. ln 𝑝 𝑋, 𝑍|𝜃

– Solving ln 𝑝(𝑋, 𝑍|𝜃) (complete data likelihood) is easy while solving


ln 𝑝(𝑋|𝜃) (incomplete data likelihood) is hard
– Focus on the complete data likelihood
– Intuitive explanation: the expected value of complete data likelihood
function under the posterior distribution of the latent variable
(𝑝(𝑍|𝑋, 𝜃 )*+ ))

www.infotech.monash.edu 34
The EM Algorithm: General Case
– Is each iteration guaranteed to increase the log likelihood function?
– What’s the relationship between the Q function and log likelihood function?

www.infotech.monash.edu 35
The EM Algorithm: General Case

– Is each iteration guaranteed to increase the log likelihood function?


> Yes, for the proof, refer to the text book (Bishop: Pattern Recognition and
Machine Learning)
– What’s the relationship between the Q function and log likelihood
function?
> Q function is a lower bound of the log likelihood function

www.infotech.monash.edu 36
The hard-EM Algorithm

• Each data is assigned to one class with the largest posterior


probability
𝑍 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥0 𝑝(𝑍|𝑋, 𝜃 ,-. )

• There is no expectation over the latent variables Z in Q function


ln 𝑝 𝑋, 𝑍 ∗|𝜃

www.infotech.monash.edu 37
EM Algorithm for GMMs with Q function

𝑄 𝜃 !12 , 𝜃 ,-. ≔ Σ+ 𝑝(𝑍|𝑋, 𝜃 ,-. ) ln 𝑝(𝑋, 𝑍|𝜃 !12 )


'
= ∑)
!*" ∑(*"(𝛾 𝑧!( ln 𝜑( + 𝛾 𝑧!( ln 𝒩(𝑥! |𝜇( , Σ( ))

• 𝛾 𝑧!( is given in E step

• No sum inside log

• Easy to optimize

www.infotech.monash.edu 38
EM Algorithm for GMMs with Q function

𝑄 𝜃 !12 , 𝜃 ,-. ≔ Σ+ 𝑝(𝑍|𝑋, 𝜃 ,-. ) ln 𝑝(𝑋, 𝑍|𝜃 !12 )


'
= ∑)
!*" ∑(*"(𝛾 𝑧!( ln 𝜑( + 𝛾 𝑧!( ln 𝒩(𝑥! |𝜇( , Σ( ))

• Maximizing the Q function, we get:

www.infotech.monash.edu 39

You might also like