5 Clustering
5 Clustering
Sameer Maskey
1
Topics for Today
2
Document Clustering
3
Classification vs. Clustering
4
Clusters for Classification
5
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?
Hockey Docs
6
Document Clustering
7
Document Clustering Application
Even though we do not know human labels automatically
induced clusters could be useful News Clusters
8
Document Clustering Application
9
How to Cluster Documents with No
Labeled Data?
Treat cluster IDs or class labels as hidden variables
Maximize the likelihood of the unlabeled data
Cannot simply count for MLE as we do not know
which point belongs to which class
User Iterative Algorithm such as K-Means, EM
10
K-Means in Words
11
K-Means Clustering
12
Distortion Measure
N
K
J= rnk ||xn − µk ||2
n=1 k=1
We want to minimize J
13
Estimating Parameters
14
Minimize J with respect to rnk
Step 1
Keep µk fixed
15
Minimize J with respect to µk
Step 2
Keep rnk fixed
16
Document Clustering with K-means
17
K-Means Example
20
Mixture of Gaussians with Hidden Variables
p(x) = z p(x, z) = z p(z)p(x|z)
Component of
K Mixture
p(x) = πk N (x|µk , k)
k=1
Mixing Covariance
Mean
Component
K −1
p(x) = k=1 πk
(2π) D/2
1√
(|
|
exp(− 12 (x−µk )T k
(x−µk ))
k
N 1
l= n=1 log y=0 N (xn , y|π, µ, )
N
= n=1 log (π0 N (xn |µ0 , 0 )+π1 N (xn |µ1 , 1 ))
26
Log-likelihood for Mixture of Gaussians
N k
log p(X|π, µ, ) = n=1 log ( k=1 N (x|µk , k ))
27
Explaining Expectation Maximization
Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization
28
Expectation Maximization
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.
E-Step
π N (xn |µk , k )
γ(znk ) = K k
j=1 πj N (xn |µj , j)
30
Estimating Parameters
M-step
1
N
µ′k = Nk n=1 γ(znk )xn
′ 1
N ′ ′ T
k = Nk n=1 γ(znk )(xn − µk )(x n − µk)
πk′ = Nk
N
N
where Nk = n=1 γ(znk )
Iterate until convergence of log likelihood
N k
log p(X|π, µ, ) = n=1 log ( k=1 N (x|µk , k ))
31
EM Iterations
EM iterations [1]
32
Clustering Documents with EM
33
Clustering Algorithms
39
Similarity
40
Similarity for Words
Edit distance
Insertion, deletion, substitution
Dynamic programming algorithm
Longest Common Subsequence
Bigram overlap of characters
Similarity based on meaning
WordNet synonyms
Similarity based on collocation
41
Similarity of Text : Surface, Syntax and
Semantics
Cosine Similarity
Binary Vectors
Multinomial Vectors
Edit distance
Insertion, deletion, substitution
Semantic similarity
Look beyond surface forms
WordNet, semantic classes
Syntactic similarity
Syntactic structure
Tree Kernels
Many ways to look at similarity and choice of the metric is
important for the type of clustering algorithm we are using
42
Clustering Documents
43
Automatic Labeling of Clusters
Cluster Label 2
Cluster Label 1
44
Clustering Sentences by Topic
45
Summary
46