EM-converted
EM-converted
Machine Learning
Lecture 6
Spring 2010
Dr. Jianjun Hu
Outline
The EM Algorithm and Derivation
EM Clustering as a special case of Mixture Modeling
EM for Mixture Estimations
Hierarchical Clustering
Introduction
In the last class the K-means algorithm for clustering was
introduced.
The two steps of K-means: assignment and update appear
frequently in data mining tasks.
In fact a whole framework under the title “EM Algorithm”
where EM stands for Expectation and Maximization is
now a standard part of the data mining toolkit
A Mixture Distribution
Missing Data
We think of clustering as a problem of estimating missing
data.
The missing data are the cluster labels.
Clustering is only one example of a missing data problem.
Several other problems can be formulated as missing data
problems.
Missing Data Problem (in clustering)
Let D = {x(1),x(2),…x(n)} be a set of n observations.
Let H = {z(1),z(2),..z(n)} be a set of n values of a hidden
variable Z.
z(i) corresponds to x(i)
Assume Z is discrete.
EM Algorithm
The log-likelihood of the observed data is
l ( ) = log p ( D | ) = log p ( D, H | )
H
• how to find r?
p( xn | i )
i =1
set provided it comes from kth mixture
Distribution by Ѳk
n i
i t
and variance for each gaussian h
model
• We get new Ѳ, set it as the new Sli +1 = n i (
h x − m i )( x − m i )
T
n n l +1 n l +1
• from the missing data and observed data, find the most
likely parameters (maximization step) MLE
E zi X , =
t l p x t
| G i , (
l
P (Gi ) )
j p x t
| G j , l
(
P (G j ) )
M-step:
( )
= P Gi | x t , l hit
P (Gi ) =
t i
h t
m l +1
=
t i x
h t t
Use estimated labels
h in place of
i t
N i
t unknown labels
Sil +1 =
( )(
t hit x t − mil +1 x t − mil +1 ) T
17
t i
h t
EM and K-means
Notice the similarity between EM for Normal mixtures and
K-means.
( ) = (x )
1/ p
r s d r s p
dm x , x j =1 j −x j
City-block distance
(
dcb x , x r s
)= d
x
j =1 j
r
− x s
j
19
Agglomerative Clustering
Start with N groups each with one instance and merge two
closest groups at each iteration
Distance between two groups Gi and Gj:
Single-link:
d (Gi ,G j ) = r
min
s
d x r
x Gi ,x G j
, x(s
)
Complete-link:
d (Gi ,G j ) = r max
s
d x r
x Gi ,x G j
, (
x s
)
Average-link, centroid
20
Example: Single-Link Clustering
Dendrogram
21
Choosing k
Defined by the application, e.g., image quantization
Plot data (after PCA) and check for clusters
Incremental (leader-cluster) algorithm: Add one at a time
until “elbow” (reconstruction error/log
likelihood/intergroup distances)
Manual check for meaning
22