0% found this document useful (0 votes)
11 views

Week 7 - Latent Variable Models and Expectation Maximization

Week 7 - Latent Variable Models and Expectation Maximization

Uploaded by

lxrserily
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Week 7 - Latent Variable Models and Expectation Maximization

Week 7 - Latent Variable Models and Expectation Maximization

Uploaded by

lxrserily
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Assignment 1 due this week

FIT5201
Data Analysis Algorithms
Week 7 – Latent Variable Models and Expectation Maximization
Outline

• Clustering
• KMeans
• Gaussian Mixture Models and Expectation-Maximization

www.infotech.monash.edu 2
Data Clustering

• Is a method of unsupervised learning


• Find a sensible structure from unlabelled data
• A clustering algorithm
– Groups data into their natural categories
– Based on the similarities between them
– Without knowledge of their actual groups
– Revealing the structure of the data
> High intra-cluster similarity
> Low inter-cluster similarity

www.infotech.monash.edu 3
Data Clustering…

• What is a natural grouping among these objects?

What is the difference between these two clustering?

Simpsons Family School Employees Females Males

www.infotech.monash.edu 4
What is a good cluster?

www.infotech.monash.edu 5
What is a good cluster?

www.infotech.monash.edu 6
What is the difference between these three
clustering?

www.infotech.monash.edu 7
Clustering Algorithms

• Many algorithms exist


– Centre-based (KMeans)
– Density based (DBSCAN)
– Hierarchical clustering
– Graph based clustering

www.infotech.monash.edu 8
Soft vs Hard Clusters

• Hard Clusters
– Data points belong to only one cluster

• Soft Clusters
– Data points could belong to one or more clusters
– Probability of belonging to each cluster is given

www.infotech.monash.edu 9
The KMeans Algorithm

• The simplest centre-based algorithm to solve clustering problems is


KMeans

• 𝑁 unlabelled data points 𝑥! are given


• Goal: Partition the data points into 𝐾 distinct groups (clusters)
– Similar points are grouped together
– Similarity is based on a distance measure 𝑑 .

www.infotech.monash.edu 10
The KMeans Algorithm

• Is an iterative algorithm
• Starts with an initial random guess of 𝐾 cluster centres
($) ($) ($)
𝜇" , 𝜇& , … , 𝜇'
• Iterate the following two steps until a stopping criterion is met:
– Update assignment of data points to clusters
> Calculate the distance of each data point to all cluster centres
> Assign the data point to the cluster with the minimum distance
– Update centers of the clusters
> For each cluster, calculate the new centre as the average of all data points
assigned to it
"#$ ∑! &!"'!
> 𝜇! = ∑! &!"

1 if 𝑥( is assigned to cluster 𝑘
> 𝑟() = $
0 Otherwise

www.infotech.monash.edu 11
KMeans visualization

• A good visual simulation is available at


http://tech.nitoyon.com/en/blog/2013/11/07/k-means/

www.infotech.monash.edu 12
KMeans Remarks

• KMeans is sensitive to initial values


– which means the different execution of Kmeans with different initial
cluster centers may result in different solutions
• KMeans is a non-probabilistic algorithm
– which only supports hard-assignment
– a data point can only be assigned to one and only one of the clusters

www.infotech.monash.edu 13
Applications of KMeans

Ø Data points: pixels colors

Ø Cluster: similar pixel


colors

Ø Replace the colors in a


cluster with the centroid

Ø Store the centroid only:


reduced resolution and
storage space

www.infotech.monash.edu 14
Latent Variables

• We wanted to partition a set of training data points into K groups of


similar data points
• The label of the training data points are latent or hidden
• We call these latent variables

www.infotech.monash.edu 15
Gaussian Mixture Models (GMM)

• A Generative Story
– Consider the following hypothetical generative story for generating a
label-data point pair (𝑘, 𝒙)
– First
> generate a cluster label 𝑘, by tossing a dice with 𝐾 faces where each face of
the dice corresponds to a cluster label
– Second,
> generate the data point 𝒙, by sampling from the distribution 𝑝) .
corresponding to the cluster label 𝑘
– We are given data point 𝒙 but not labels
– We model it by 𝑧 ∈ 1, … , 𝐾
– Now given the training data,
> we would like to find the best value for the latent variables, and
> the best estimates for the parameters of the above generative story.

www.infotech.monash.edu 16
The Probabilistic Generative Model

• Tossing a dice with 𝐾 faces


– is the same as sampling from a multinomial distribution on 𝑘 elements
– the parameters of the multinomial are
$
𝜙! ≥ 0, 0 𝜙! = 1, 𝑝 𝑧% = 𝑘 = 𝜙!
!"#
• For each 𝑘,
– Assume data points are sampled from Gaussian distribution 𝑁 𝜇! , Σ!
– Mean 𝜇! and covariance matrix Σ!
– Note that we have a collection of these Gaussian distributions,
– each of which corresponds to one of 𝐾 dice faces
• We don’t know the labels and try to best guess the latent variables
𝓏" , … , 𝓏! where 𝓏! ∈ 1, … , 𝐾 represents the latent label for a data
point 𝑥!
• 𝜃 ≔ 𝜙, 𝜇" , Σ" , … , 𝜇( , Σ(
• Use the maximum likelihood estimation
www.infotech.monash.edu 17
Maximum Likelihood Estimation

• maximum likelihood estimation (MLE) is a method of estimating the


parameters of a statistical model given observations, by finding the
parameter values that maximize the likelihood of making the
observations
𝜇 that maximizes the
likelihood of observing this
data
Likelihood of
observing
the data

• You can find variance similarly

www.infotech.monash.edu 18
Gaussian Mixture Model

• If we are given a complete data point 𝑘, 𝑥


– Where the label was not hidden
– The probability of the pair according to our generative story would be
– 𝑝 𝑘, 𝑥% = 𝑝 𝑓𝑎𝑐𝑒 𝑘 𝑝 𝑥% |𝑓𝑎𝑐𝑒 𝑘 = 𝜑! 𝑁 𝑥% |𝜇! , Σ!
• In practice, we are given incomplete data (or observed data)
– 𝑝 𝑥% = ∑&!∈ #,…,& 𝑝 𝑧% , 𝑥% = ∑$
!"# 𝑝 𝑧% = 𝑘 𝑝 𝑥|𝑓𝑎𝑐𝑒 𝑘
$
=0 𝜑! 𝑁 𝑥% |𝜇! , Σ!
!"#
– This model is called the Gaussian Mixture Model

www.infotech.monash.edu 19
Gaussian Mixture Model

• We are only given 𝑥" , 𝑥& , … , 𝑥)


• The labels are hidden (latent)
• We aim to best guess 𝑧" , 𝑧& , … , 𝑧) , 𝑧! 𝜖 1, … , 𝐾
• 𝑧! is the latent label for a data point 𝑥!
• The parameter of this model
– 𝜃 = 𝜙, 𝜇# , Σ# , 𝜇' , Σ' , … , 𝜇$ , Σ$
– We like to best estimate these parameters

www.infotech.monash.edu 20
Latent variable models

• Use the maximum likelihood principle to do the parameter estimation


• Complete data likelihood function
– We are given the class label
– Gaussian classifier
(
– 𝑝 𝑋, 𝑍 = Π%"# 𝑝 𝑥% , 𝑧!
– Easy to get the analytical global solutions

• Likelihood function (incomplete data likelihood function)


( ( $
– 𝑝 𝑋 = Π%"# 𝑝 𝑥% = Π%"# Σ!"# 𝑝 𝑥% , 𝑧!
– Hard to get the analytical global solutions (sum inside log)
– Need an iterative optimization algorithm (EM method)
– EM: iterative optimization algorithm for problems with latent variables

www.infotech.monash.edu 21
Problem to be solved

• Why is it hard to find the global solution of imcomplete data likelihood


functions?
– use Gaussian mixture model as an example

• What EM algorithm is and why?


– Steps
– Theoretical support

www.infotech.monash.edu 22
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Prediction rule

– 𝛾 𝑧%! : the posterior probability of the cluster k assigned to a given data


𝑥%
– For a given data, what’s the prior probability of the cluster k assigned to
it?

www.infotech.monash.edu 23
Gaussian Mixture Models
) ) ) '
• 𝐿 Θ = 𝑙𝑛𝑝 𝑋 = 𝑙𝑛Π!*" 𝑝 𝑥! = Σ!*" 𝑙𝑛𝑝 𝑥! = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑥! , 𝑧(
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Prediction rule

– 𝛾 𝑧%! : the posterior probability of the cluster k assigned to a given data


𝑥%
– For a given data, what’s the prior probability of the cluster k assigned to
it? (𝜑! )

www.infotech.monash.edu 24
Gaussian Mixture Models
) ' ) '
𝐿 Θ = Σ!*" 𝑙𝑛Σ(*" 𝑝 𝑧( 𝑝 𝑥! 𝑧( = Σ!*" 𝑙𝑛Σ(*" 𝜑( 𝒩(𝑥! |𝜇( , Σ( )

• Let’s try to find the global optimal solutions first


– Three types of parameters: the mean parameters (𝜇! ); the covariance
Matrices (Σ! ); the mixing coefficients (𝜑! )
– Compute the global optimal solutions by setting the partial derivatives
with respect to these parameters to 0 respectively.
– Refer to the handwritten materials
– Hard to get the global optimal solutions
> as all the solutions rely on 𝛾 𝑧() : the posterior probability of the cluster
assignment; and 𝛾 𝑧() itself relies on the three types of parameters in a
complex way
– Need an iterative algorithm!

www.infotech.monash.edu 25
Expectation Maximization (EM) for GMMs

Choose some initial values for the parameters;


Alternate between the following two steps until a stopping condition (the
change in the log likelihood function or parameters fall below some
threshold) is met (similar to KMeans):
• In the E (expectation) step, use the current values for the parameters
to calculate the posterior probabilities 𝛾 𝑧!(
• In the M (maximization) step, re-estimate the parameters (𝜇( , 𝛴( ,
and 𝜑( ) based on the 𝛾 𝑧!( result from the above step.
– Use the equations in the handwritten materials

www.infotech.monash.edu 26
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"

a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎

For each point calculate 𝑃 𝑏𝑙𝑢𝑒|𝑥 and 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥

www.infotech.monash.edu 27
Example
"
1 𝑥! − 𝜇#
𝑃 𝑥! |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 " 2𝜎#"

a b 𝑃 𝑥! |𝑏 𝑃 𝑏
𝑃 𝑏|𝑥! =
𝑃 𝑥! |𝑏 𝑃 𝑏 + 𝑃 𝑥! |𝑎 𝑃 𝑎

𝑃 𝑏𝑙𝑢𝑒|𝑥 << 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥 𝑃 𝑏𝑙𝑢𝑒|𝑥 ≈ 1

For each point calculate 𝑃 𝑏𝑙𝑢𝑒|𝑥 and 𝑃 𝑦𝑒𝑙𝑙𝑜𝑤|𝑥

www.infotech.monash.edu 28
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
Update means and variances 𝑏& + 𝑏$ + ⋯ + 𝑏'

𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $


𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'

www.infotech.monash.edu 29
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
Update means and variances 𝑏& + 𝑏$ + ⋯ + 𝑏'

𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $


𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'
a
b

www.infotech.monash.edu 30
Example
$
1 𝑥# − 𝜇%
𝑃 𝑥# |𝑏 = 𝑒𝑥𝑝 −
2𝜋𝜎 $ 2𝜎%$

a b 𝑃 𝑥# |𝑏 𝑃 𝑏
𝑏# = 𝑃 𝑏|𝑥# =
𝑃 𝑥# |𝑏 𝑃 𝑏 + 𝑃 𝑥# |𝑎 𝑃 𝑎

𝑎# = 𝑃 𝑎|𝑥# = 1 − 𝑏#

𝑏& 𝑥& + 𝑏$ 𝑥$ + ⋯ + 𝑏' 𝑥'


𝜇% =
𝑏& + 𝑏$ + ⋯ + 𝑏'
a
b
𝑏& 𝑥& − 𝜇% $ + ⋯ + 𝑏' 𝑥' − 𝜇% $
𝜎%$ =
𝑏& + 𝑏$ + ⋯ + 𝑏'

a b

www.infotech.monash.edu 31
The EM Algorithm: General Case

• Training objective: find maximum likelihood solution for models


having latent variables.
– Observed data 𝑋, Latent variable 𝑍, set of model parameters 𝜃
– Log likelihood function
ln 𝑝 𝑋|𝜃 = ln 5 𝑝 𝑋, 𝑍|𝜃
(

• Algorithm:
– Choose an initial setting for the parameters 𝜃 )*+
– While convergence is not met:
> E Step: Evaluate 𝑝 𝑍|𝑋, 𝜃 9:;
> M Step: Evaluate 𝜃 (<= given by
𝜃 (<= ← arg max I 𝑝 𝑍|𝑋, 𝜃 9:; ln 𝑝 𝑋, 𝑍|𝜃
>
?
@ >,>)*+
!,-
> 𝜃 9:;←>

www.infotech.monash.edu 32
The EM Algorithm: General Case

• Questions:
– Why do we use Q function instead of the log likelihood function as the
objective function in M step?

– Is each iteration guaranteed to increase the log likelihood function?

– What’s the relationship between the Q function and log likelihood


function?

www.infotech.monash.edu 33
The EM Algorithm: General Case

• Why Q function?
∑+ 𝑝 𝑍|𝑋, 𝜃 ,-. ln 𝑝 𝑋, 𝑍|𝜃

– Solving ln 𝑝(𝑋, 𝑍|𝜃) (complete data likelihood) is easy while solving


ln 𝑝(𝑋|𝜃) (incomplete data likelihood) is hard
– Focus on the complete data likelihood
– Intuitive explanation: the expected value of complete data likelihood
function under the posterior distribution of the latent variable
(𝑝(𝑍|𝑋, 𝜃 )*+ ))

www.infotech.monash.edu 34
The EM Algorithm: General Case
– Is each iteration guaranteed to increase the log likelihood function?
– What’s the relationship between the Q function and log likelihood function?

www.infotech.monash.edu 35
The EM Algorithm: General Case

– Is each iteration guaranteed to increase the log likelihood function?


> Yes, for the proof, refer to the text book (Bishop: Pattern Recognition and
Machine Learning)
– What’s the relationship between the Q function and log likelihood
function?
> Q function is a lower bound of the log likelihood function

www.infotech.monash.edu 36
The hard-EM Algorithm

• Each data is assigned to one class with the largest posterior


probability
𝑍 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥0 𝑝(𝑍|𝑋, 𝜃 ,-. )

• There is no expectation over the latent variables Z in Q function


ln 𝑝 𝑋, 𝑍 ∗|𝜃

www.infotech.monash.edu 37
EM Algorithm for GMMs with Q function

𝑄 𝜃 !12 , 𝜃 ,-. ≔ Σ+ 𝑝(𝑍|𝑋, 𝜃 ,-. ) ln 𝑝(𝑋, 𝑍|𝜃 !12 )


'
= ∑)
!*" ∑(*"(𝛾 𝑧!( ln 𝜑( + 𝛾 𝑧!( ln 𝒩(𝑥! |𝜇( , Σ( ))

• 𝛾 𝑧!( is given in E step

• No sum inside log

• Easy to optimize

www.infotech.monash.edu 38
EM Algorithm for GMMs with Q function

𝑄 𝜃 !12 , 𝜃 ,-. ≔ Σ+ 𝑝(𝑍|𝑋, 𝜃 ,-. ) ln 𝑝(𝑋, 𝑍|𝜃 !12 )


'
= ∑)
!*" ∑(*"(𝛾 𝑧!( ln 𝜑( + 𝛾 𝑧!( ln 𝒩(𝑥! |𝜇( , Σ( ))

• Maximizing the Q function, we get:

www.infotech.monash.edu 39

You might also like