6 Text Clustering
6 Text Clustering
Information
retrieval(3170718)
Prepared by:
Ms. Twinkle P. Kosambia
Computer Engineering Department
C. K. Pithawalla College of Engineering and Technology
Introduction to Information Retrieval
it has labels so there is need of training and testing there is no need of training and
Need
dataset for verifying the model created testing dataset
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Introduction to Information Retrieval
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Introduction to Information Retrieval
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Introduction to Information Retrieval
Notion of similarity/distance
Ideal: semantic similarity.
Practical: term-statistical similarity
We will use cosine similarity.
Docs as vectors.
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
Introduction to Information Retrieval
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
Introduction to Information Retrieval
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
Introduction to Information Retrieval
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Introduction to Information Retrieval
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Introduction to Information Retrieval
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Introduction to Information Retrieval
Lower case!
Convergence of K-Means
Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
Gk = Σi (di – ck)2 (sum over all di in cluster k)
G = Σk Gk
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Introduction to Information Retrieval
Convergence of K-Means
Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
Σ (di – a)2 reaches minimum for:
Σ –2(di – a) = 0
Σ di = Σ a
mK a = Σ di
a = (1/ mk) Σ di = ck
K-means typically converges quickly
Introduction to Information Retrieval
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Introduction to Information Retrieval
Seed Choice
Results can vary based on Example showing
random seed selection. sensitivity to seeds
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
Hierarchical Clustering
The number of dendrograms with n Since we cannot test all possible
leafs = (2n -3)!/[(2(n -2)) (n -2)!] trees we will have to heuristic
search of all possible trees. We
Number Number of Possible
of Leafs Dendrograms could do this..
2 1
3 3
4 15
Bottom-Up (agglomerative):
5 105 Starting with each item in its own
... …
10 34,459,425
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
… the best
merges…
39
Introduction to Information Retrieval
Note: the resulting clusters are still “hard” and induce a partition
Introduction to Information Retrieval
Iteration 1
Introduction to Information Retrieval
Iteration 2
Introduction to Information Retrieval
Iteration 3
• Builds up a sequence of
clusters (“hierarchical”)
Dendrogram
Introduction to Information Retrieval
Cluster Distances
Complete Link
Use minimum similarity of pairs:
Ci Cj Ck
Introduction to Information Retrieval
Computational Complexity
In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Introduction to Information Retrieval
Group Average
Similarity of two clusters = average similarity of all pairs
within merged cluster.
Purity example
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Introduction to Information Retrieval
EM Algorithm: E-step
• Start with parameters describing each cluster
• Mean ¹c, Covariance §c, “size” ¼c
• E-step (“Expectation”)
– For each datum (example) x_i,
– Compute “r_{ic}”, the probability that it belongs to cluster c
• Compute its probability under model c
• Normalize to sum to one (over clusters c)
– If x_I is very likely under the cth Gaussian, it gets high weight
– Denominator just makes r’s sum to one
Introduction to Information Retrieval
EM Algorithm: M-step
• Start with assignment probabilities ric
• Update parameters: mean ¹c, Covariance §c, “size” ¼c
• M-step (“Maximization”)
– For each cluster (Gaussian) x_c,
– Update its parameters using the (weighted) data points
Total responsibility allocated to cluster c
Expectation-Maximization
• Each step increases the log-likelihood of our model
• What should we do
– If we want to choose a single cluster for an “answer”?
– With new data we didn’t see during training?