Week 7 - Latent Variable Models and Expectation Maximization
Week 7 - Latent Variable Models and Expectation Maximization
FIT5201
Data Analysis Algorithms
Week 7 – Latent Variable Models and Expectation Maximization
Outline
• Clustering
• KMeans
• Gaussian Mixture Models and Expectation-Maximization
www.infotech.monash.edu 2
Data Clustering
www.infotech.monash.edu 3
Data Clustering…
www.infotech.monash.edu 4
What is a good cluster?
www.infotech.monash.edu 5
What is a good cluster?
www.infotech.monash.edu 6
What is the difference between these three
clustering?
www.infotech.monash.edu 7
Clustering Algorithms
www.infotech.monash.edu 8
Soft vs Hard Clusters
• Hard Clusters
– Data points belong to only one cluster
• Soft Clusters
– Data points could belong to one or more clusters
– Probability of belonging to each cluster is given
www.infotech.monash.edu 9
The KMeans Algorithm
www.infotech.monash.edu 10
The KMeans Algorithm
• Is an iterative algorithm
• Starts with an initial random guess of 𝐾 cluster centres
($) ($) ($)
𝜇" , 𝜇& , … , 𝜇'
• Iterate the following two steps until a stopping criterion is met:
– Update assignment of data points to clusters
> Calculate the distance of each data point to all cluster centres
> Assign the data point to the cluster with the minimum distance
– Update centers of the clusters
> For each cluster, calculate the new centre as the average of all data points
assigned to it
"#$ ∑! &!"'!
> 𝜇! = ∑! &!"
1 if 𝑥( is assigned to cluster 𝑘
> 𝑟() = $
0 Otherwise
www.infotech.monash.edu 11
KMeans visualization
www.infotech.monash.edu 12
KMeans Remarks
www.infotech.monash.edu 13
Applications of KMeans
www.infotech.monash.edu 14
Latent Variables
www.infotech.monash.edu 15
Gaussian Mixture Models (GMM)
• A Generative Story
– Consider the following hypothetical generative story for generating a
label-data point pair (𝑘, 𝒙)
– First
> generate a cluster label 𝑘, by tossing a dice with 𝐾 faces where each face of
the dice corresponds to a cluster label
– Second,
> generate the data point 𝒙, by sampling from the distribution 𝑝) .
corresponding to the cluster label 𝑘
– We are given data point 𝒙 but not labels
– We model it by 𝑧 ∈ 1, … , 𝐾
– Now given the training data,
> we would like to find the best value for the latent variables, and
> the best estimates for the parameters of the above generative story.
www.infotech.monash.edu 16
The Probabilistic Generative Model
www.infotech.monash.edu 18
Gaussian Mixture Model
www.infotech.monash.edu 19
Gaussian Mixture Model
www.infotech.monash.edu 20
Latent variable models
www.infotech.monash.edu 21
Problem to be solved
www.infotech.monash.edu 22
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )
• Prediction rule
www.infotech.monash.edu 23
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )
• Prediction rule
www.infotech.monash.edu 24
Gaussian Mixture Models
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )
www.infotech.monash.edu 25
Expectation Maximization (EM) for GMMs
www.infotech.monash.edu 26
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"
a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎
www.infotech.monash.edu 27
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"
a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎
www.infotech.monash.edu 28
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$
a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎
𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#
www.infotech.monash.edu 29
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$
a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎
𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#
www.infotech.monash.edu 30
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$
a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎
𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#
a b
www.infotech.monash.edu 31
The EM Algorithm: General Case
• Algorithm:
– Choose an initial setting for the parameters 𝜃 )*+
– While convergence is not met:
> E Step: Evaluate 𝑝 𝑍|𝑋, 𝜃 9:;
> M Step: Evaluate 𝜃 (<= given by
𝜃 (<= ← arg max I 𝑝 𝑍|𝑋, 𝜃 9:; ln 𝑝 𝑋, 𝑍|𝜃
>
?
@ >,>)*+
!,-
> 𝜃 9:;←>
www.infotech.monash.edu 32
The EM Algorithm: General Case
• Questions:
– Why do we use Q function instead of the log likelihood function as the
objective function in M step?
www.infotech.monash.edu 33
The EM Algorithm: General Case
• Why Q function?
∑+ 𝑝 𝑍|𝑋, 𝜃 ,-. ln 𝑝 𝑋, 𝑍|𝜃
www.infotech.monash.edu 34
The EM Algorithm: General Case
– Is each iteration guaranteed to increase the log likelihood function?
– What’s the relationship between the Q function and log likelihood function?
www.infotech.monash.edu 35
The EM Algorithm: General Case
www.infotech.monash.edu 36
The hard-EM Algorithm
www.infotech.monash.edu 37
EM Algorithm for GMMs with Q function
• Easy to optimize
www.infotech.monash.edu 38
EM Algorithm for GMMs with Q function
www.infotech.monash.edu 39