0% found this document useful (0 votes)
9 views

5 Clustering

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

5 Clustering

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Statistical Methods for NLP

Document and Topic Clustering, K-Means,


Mixture Models, Expectation-Maximization

Sameer Maskey

Week 8, March 2010

1
Topics for Today

 Document, Topic Clustering


 K-Means
 Mixture Models
 Expectation Maximization

2
Document Clustering

 Previously we classified Documents into Two Classes


 Hockey (Class1) and Baseball (Class2)
 We had human labeled data
 Supervised learning
 What if we do not have manually tagged documents
 Can we still classify documents?
 Document clustering
 Unsupervised Learning

3
Classification vs. Clustering

Supervised Training Unsupervised Training


of Classification Algorithm of Clustering Algorithm

4
Clusters for Classification

Automatically Found Clusters


can be used for Classification

5
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?

Hockey Docs

6
Document Clustering

 Cluster the documents in ‘N’ clusters/categories

 For classification we were able to estimate parameters using


labeled data
 Perceptrons – find the parameters that decide the separating
hyperplane
 Naïve Bayes – count the number of times word occurs in the
given class and normalize

 Not evident on how to find separating hyperplane when no


labeled data available
 Not evident how many classes we have for data when we do
not have labels

7
Document Clustering Application
 Even though we do not know human labels automatically
induced clusters could be useful News Clusters

8
Document Clustering Application

A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines


Magazine, February 2000. Smartmoney [2]

9
How to Cluster Documents with No
Labeled Data?
 Treat cluster IDs or class labels as hidden variables
 Maximize the likelihood of the unlabeled data
 Cannot simply count for MLE as we do not know
which point belongs to which class
 User Iterative Algorithm such as K-Means, EM

10
K-Means in Words

 Parameters to estimate for K classes Baseball


 Let us assume we can model this data Hockey
 with mixture of two Gaussians

 Start with 2 Gaussians (initialize mu values)

 Compute distance of each point to the mu of 2 Gaussians and


assign it to the closest Gaussian (class label (Ck))

 Use the assigned points to recompute mu for 2 Gaussians

11
K-Means Clustering

Let us define Dataset in D dimension{x1 , x2 , ..., xN }

We want to cluster the data in Kclusters


Let µk be D dimension vector representing clusterK

Let us define rnk for each xn such that


rnk ∈ {0, 1} where k = 1, ..., K and
rnk = 1 if xn is assigned to cluster k

12
Distortion Measure

N 
 K
J= rnk ||xn − µk ||2
n=1 k=1

Represents sum of squares of distances to mu_k from each data point

We want to minimize J

13
Estimating Parameters

 We can estimate parameters by doing 2 step


iterative process

 Minimize J with respect to rnk


Step 1
 Keep µk fixed

 Minimize J with respect to µk


Step 2
 Keep rnk fixed

14
 Minimize J with respect to rnk
Step 1
 Keep µk fixed

 Optimize for each n separately by choosing rnk for k that


gives minimum ||x − r ||2
n nk

rnk = 1 if k = argminj ||xn − µj ||2


= 0 otherwise

 Assign each data point to the cluster that is the closest


 Hard decision to cluster assignment

15
 Minimize J with respect to µk
Step 2
 Keep rnk fixed

 J is quadratic in µk . Minimize by setting derivative w.rt. µk to


zero

n rnk xn
µk = 
n rnk

 Take all the points assigned to cluster K and re-estimate the


mean for cluster K

16
Document Clustering with K-means

 Assuming we have data from Homework 1 but with no labels


for Hockey and Baseball data
 We want to be able to categorize a new document into one of
the 2 classes (K=2)
 We can extract represent document as feature vectors
 Features can be word id or other NLP features such as POS
tags, word context etc (D=total dimension of Feature vectors)
 N documents are available
 Randomly initialize 2 class means
 Compute square distance of each point (xn)(D dimension) to
class means (µk)
 Assign the point to K for which µk is lowest
 Re-compute µk and re-iterate

17
K-Means Example

K-means algorithm Illustration [1]


18
Mixture Models

Mixture of Gaussians [1]

 1 Gaussian may not fit the data


 2 Gaussians may fit the data better
 Each Gaussian can be a class category
 When labeled data not available we can treat class category
as hidden variable

20
Mixture of Gaussians with Hidden Variables
 
p(x) = z p(x, z) = z p(z)p(x|z)
Component of
K Mixture
 
p(x) = πk N (x|µk , k)

k=1

Mixing Covariance
Mean
Component

K −1
p(x) = k=1 πk
(2π) D/2
1√
(|

|
exp(− 12 (x−µk )T k
(x−µk ))
k

• Mixture models can be linear combinations of other distributions as well


• Mixture of binomial distribution for example
23
Conditional Probability of Label Given
Data
 Mixture model with parameters mu, sigma and prior can
represent the parameter
 We can maximize the data given the model parameters to find
the best parameters
 If we know the best parameters we can estimate
p(z =1)p(x|zk =1)
p(zk = 1|x) = K k
j=1 p(zj =1)p(x|zj =1)

π N (x|µk , k )
= K k 
j=1 πj N (x|µj , j)

This essentially gives us probability of class given the data


i.e label for the given data point
24
Maximizing Likelihood
 If we had labeled data we could maximize likelihood simply by
counting and normalizing to get mean and variance of
Gaussians for the given classes
N 
l = n=1 log p(xn , yn |π, µ, )
N 
l = n=1 log πyn N (xn |µyn , yn )
(30, 1)
(55, 2)
(24, 1)
 If we have two classes C1 and C2 (40, 1)
 Let’s say we have a feature x (35, 2)

 x = number of words ‘field’
 And class label (y) 
 y = 1 hockey or 2 baseball documents
 N (µ1 , 1 )
Find out µi and i from data N (µ2 , 2 )
for both classes
25
Maximizing Likelihood for Mixture Model with
Hidden Variables
 For a mixture model with a hidden variable
representing 2 classes, log likelihood is
N 
l= n=1 logp(xn |π, µ, )

N 1 
l= n=1 log y=0 N (xn , y|π, µ, )

N  
= n=1 log (π0 N (xn |µ0 , 0 )+π1 N (xn |µ1 , 1 ))

26
Log-likelihood for Mixture of Gaussians

 N k 
log p(X|π, µ, ) = n=1 log ( k=1 N (x|µk , k ))

 We want to find maximum likelihood of the above log-


likelihood function to find the best parameters that maximize
the data given the model
 We can again do iterative process for estimating the log-
likelihood of the above function
 This 2-step iterative process is called Expectation-Maximization

27
Explaining Expectation Maximization

 EM is like fuzzy K-means Baseball


Hockey
 Parameters to estimate for K classes

 Let us assume we can model this data


with mixture of two Gaussians (K=2)

 Start with 2 Gaussians (initialize mu and sigma values) Expectation

 Compute distance of each point to the mu of 2 Gaussians and assign it a soft


class label (Ck)

 Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization

28
Expectation Maximization
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.

EM alternates between performing an expectation (E) step, which


computes an expectation of the likelihood by including the latent
variables as if they were observed, and a maximization (M) step,
which computes the maximum likelihood estimates of the
parameters by maximizing the expected likelihood found on the E
step. The parameters found on the M step are then used to begin
another E step, and the process is repeated.

The EM algorithm was explained and given its name in a classic


1977 paper by A. Dempster and D. Rubin in the Journal of the
Royal Statistical Society.
29
Estimating Parameters
γ(znk ) = E(znk |xn ) = p(zk = 1|xn )

 E-Step


π N (xn |µk , k )
γ(znk ) = K k 
j=1 πj N (xn |µj , j)

30
Estimating Parameters
 M-step

1
N
µ′k = Nk n=1 γ(znk )xn
′ 1
N ′ ′ T
k = Nk n=1 γ(znk )(xn − µk )(x n − µk)

πk′ = Nk
N
N
where Nk = n=1 γ(znk )
 Iterate until convergence of log likelihood
 N k 
log p(X|π, µ, ) = n=1 log ( k=1 N (x|µk , k ))
31
EM Iterations

EM iterations [1]
32
Clustering Documents with EM

 Clustering documents requires representation of


documents in a set of features
 Set of features can be bag of words model
 Features such as POS, word similarity, number of
sentences, etc
 Can we use mixture of Gaussians for any kind of
features?
 How about mixture of multinomial for document
clustering?
 How do we get EM algorithm for mixture of
multinomial?

33
Clustering Algorithms

 We just described two kinds of clustering algorithms


 K-means
 Expectation Maximization
 Expectation-Maximization is a general way to
maximize log likelihood for distributions with hidden
variables
 For example, EM for HMM, state sequences were hidden
 For document clustering other kinds of clustering
algorithm exists

39
Similarity

 While clustering documents we are essentially


finding ‘similar’ documents
 How we compute similarity makes a difference in the
performance of clustering algorithm
 Some similarity metrics
 Euclidean distance
 Cross Entropy
 Cosine Similarity
 Which similarity metric to use?

40
Similarity for Words

 Edit distance
 Insertion, deletion, substitution
 Dynamic programming algorithm
 Longest Common Subsequence
 Bigram overlap of characters
 Similarity based on meaning
 WordNet synonyms
 Similarity based on collocation

41
Similarity of Text : Surface, Syntax and
Semantics
 Cosine Similarity
 Binary Vectors
 Multinomial Vectors
 Edit distance
 Insertion, deletion, substitution
 Semantic similarity
 Look beyond surface forms
 WordNet, semantic classes
 Syntactic similarity
 Syntactic structure
 Tree Kernels
 Many ways to look at similarity and choice of the metric is
important for the type of clustering algorithm we are using

42
Clustering Documents

 Represent documents as feature vectors

 Decide on Similarity Metric for computing similarity


across feature vectors

 Use Iterative algorithm that maximize the log-


likelihood of the function with hidden variables that
represent the cluster IDs

43
Automatic Labeling of Clusters

 How do you automatically label the clusters


 For example, how do you find the headline that represent the
news pieces in given topic
 One possible way is to find the most similar sentence to the
centroid of the cluster

Cluster Label 2

Cluster Label 1
44
Clustering Sentences by Topic

 We can cluster documents, sentences or any


segment of text
 Similarity across text segments can take account of
topic similarity
 We can still use our unsupervised clustering
algorithm based on K-means or EM
 Similarity needs to be computed at the sentence level
 Useful for summarization, question answering, text
categorization

45
Summary

 Unsupervised clustering algorithms


 K-means
 Expectation Maximization
 EM is a general algorithm that can be used to
estimate maximum likelihood of functions with
hidden variables
 Similarity Metric is important when clustering
segments of text

46

You might also like