0% found this document useful (0 votes)
13 views26 pages

L 12 Flat Cluster

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

L 12 Flat Cluster

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Flat Clustering

Adapted from Slides by Prabhakar


Raghavan, Christopher Manning, Ray
Mooney and Soumen Chakrabarti

Prasad L12FlatCluster 1
Today’s Topic: Clustering
 Document clustering
 Motivations
 Document representations

 Success criteria

 Clustering algorithms
 Partitional
 Hierarchical

2
What is clustering?
 Clustering: the process of grouping a set
of objects into classes of similar objects
 The commonest form of unsupervised
learning
 Unsupervised learning = learning from raw
data, as opposed to supervised data where
a classification of examples is given
 A common and important task that finds
many applications in IR and other places
3
Why cluster documents?
 Whole corpus analysis/navigation
 Better user interface
 For improving recall in search applications
 Better search results
 For better navigation of search results
 Effective “user recall” will be higher
 For speeding up vector space retrieval
 Faster search
4
Yahoo! Hierarchy
www.yahoo.com/Science
… (30)

agriculture biology physics CS space

... ... ... ...


...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI missions
relativity

5
Scatter/Gather: Cutting, Karger, and Pedersen

6
For visualizing a document collection and
its themes
 Wise et al, “Visualizing the non-visual” PNNL
 ThemeScapes, Cartia
 [Mountain height = cluster size]

7
For improving search recall
 Cluster hypothesis - Documents with similar text are
related
 Therefore, to improve search recall:
 Cluster docs in corpus a priori

 When a query matches a doc D, also return other

docs in the cluster containing D


 Example: The query “car” will also return docs
containing automobile
 Because clustering grouped together docs
containing car with those containing automobile.

8
Why might this happen?
For better navigation of search results
 For grouping search results thematically
 clusty.com / Vivisimo

9
Issues for clustering
 Representation for clustering
 Document representation

Vector space? Normalization?
 Need a notion of similarity/distance
 How many clusters?
 Fixed a priori?
 Completely data driven?

Avoid “trivial” clusters - too large or small

10
What makes docs “related”?
 Ideal: semantic similarity.
 Practical: statistical similarity
 Docs as vectors.
 For many algorithms, easier to think

in terms of a distance (rather than


similarity) between docs.
 We will use cosine similarity.

11
Clustering Algorithms
 Partitional algorithms
 Usually start with a random (partial)
partition
 Refine it iteratively

K means clustering

Model based clustering
 Hierarchical algorithms
 Bottom-up, agglomerative
 Top-down, divisive
12
Partitioning Algorithms
 Partitioning method: Construct a partition of n
documents into a set of K clusters
 Given: a set of documents and the number K
 Find: a partition of K clusters that optimizes
the chosen partitioning criterion
 Globally optimal: exhaustively enumerate

all partitions
 Effective heuristic methods: K-means and

K-medoids algorithms
13
K-Means
 Assumes documents are real-valued vectors.
 Clusters based on centroids (aka the center
of gravity or mean) of points in a cluster, c.
 Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
 (Or one can equivalently phrase it in terms of
similarities)

14
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges or other stopping
criterion:
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is
minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
15
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!

16
Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Doc partition unchanged.

 Centroid positions don’t change.

Does this mean that the


docs in a cluster are
unchanged? 17
Convergence
 Why should the K-means algorithm ever
reach a fixed point?
 A state in which clusters don’t change.
 K-means is a special case of a general
procedure known as the Expectation
Maximization (EM) algorithm.
 EM is known to converge.
 Number of iterations could be large.

18
Lower case

Convergence of K-Means
 Define goodness measure of cluster k as
sum of squared distances from cluster
centroid:
 Gk = Σi (di – ck)2 (sum over all di in
cluster k)
 G = Σk Gk
 Reassignment monotonically decreases G
since each vector is assigned to the
closest centroid.
19
Time Complexity
 Computing distance between two docs is O(m)
where m is the dimensionality of the vectors.
 Reassigning clusters: O(Kn) distance
computations, or O(Knm).
 Computing centroids: Each doc gets added once
to some centroid: O(nm).
 Assume these two steps are each done once for
I iterations: O(IKnm).

20
Seed Choice
 Results can vary based on Example showing
random seed selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings. In the above, if you start
with B and E as centroids
 Select good seeds using a you converge to {A,B,C}
heuristic (e.g., doc least similar and {D,E,F}
to any existing mean) If you start with D and F
you converge to
 Try out multiple starting points
{A,B,D,E} {C,F}
 Initialize with the results of
another method.
21
How Many Clusters?
 Number of clusters K is given

Partition n docs into predetermined number of
clusters
 Finding the “right” number of clusters is part of
the problem
 Given docs, partition into an “appropriate” number
of subsets.
 E.g., for query results - ideal value of K not known
up front - though UI may impose limits.

22
K not specified in advance
 Say, the results of a query.
 Solve an optimization problem: penalize
having lots of clusters
 application dependent, e.g., compressed
summary of search results list.
 Tradeoff between having more clusters
(better focus within each cluster) and
having too many clusters

23
K not specified in advance
 Given a clustering, define the Benefit
for a doc to be the cosine similarity to
its centroid
 Define the Total Benefit to be the sum
of the individual doc Benefits.

Why is there always a clustering of Total Benefit n?

24
Penalize lots of clusters
 For each cluster, we have a Cost C.
 Thus for a clustering with K clusters, the Total
Cost is KC.
 Define the Value of a clustering to be =
Total Benefit - Total Cost.
 Find the clustering of highest value, over all
choices of K.
 Total benefit increases with increasing K. But can
stop when it doesn’t increase by “much”. The Cost
term enforces this.
25
K-means issues, variations, etc.
 Recomputing the centroid after every
assignment (rather than after all points are
re-assigned) can improve speed of
convergence of K-means
 Assumes clusters are spherical in vector
space

Sensitive to coordinate changes, weighting etc.
 Disjoint and exhaustive
 Doesn’t have a notion of “outliers”
26

You might also like