0% found this document useful (0 votes)

13 views26 pages

L 12 Flat Cluster

Uploaded by

Harsha Vardhan sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views26 pages

L 12 Flat Cluster

Uploaded by

Harsha Vardhan sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

Flat Clustering

Adapted from Slides by Prabhakar

Raghavan, Christopher Manning, Ray
Mooney and Soumen Chakrabarti

Prasad L12FlatCluster 1
Today’s Topic: Clustering
 Document clustering
 Motivations
 Document representations

 Success criteria

 Clustering algorithms
 Partitional
 Hierarchical

2
What is clustering?
 Clustering: the process of grouping a set
of objects into classes of similar objects
 The commonest form of unsupervised
learning
 Unsupervised learning = learning from raw
data, as opposed to supervised data where
a classification of examples is given
 A common and important task that finds
many applications in IR and other places
3
Why cluster documents?
 Whole corpus analysis/navigation
 Better user interface
 For improving recall in search applications
 Better search results
 For better navigation of search results
 Effective “user recall” will be higher
 For speeding up vector space retrieval
 Faster search
4
Yahoo! Hierarchy
www.yahoo.com/Science
… (30)

agriculture biology physics CS space

... ... ... ...

...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI missions
relativity

5
Scatter/Gather: Cutting, Karger, and Pedersen

6
For visualizing a document collection and
its themes
 Wise et al, “Visualizing the non-visual” PNNL
 ThemeScapes, Cartia
 [Mountain height = cluster size]

7
For improving search recall
 Cluster hypothesis - Documents with similar text are
related
 Therefore, to improve search recall:
 Cluster docs in corpus a priori

 When a query matches a doc D, also return other

docs in the cluster containing D

 Example: The query “car” will also return docs
containing automobile
 Because clustering grouped together docs
containing car with those containing automobile.

8
Why might this happen?
For better navigation of search results
 For grouping search results thematically
 clusty.com / Vivisimo

9
Issues for clustering
 Representation for clustering
 Document representation

Vector space? Normalization?
 Need a notion of similarity/distance
 How many clusters?
 Fixed a priori?
 Completely data driven?

Avoid “trivial” clusters - too large or small

10
What makes docs “related”?
 Ideal: semantic similarity.
 Practical: statistical similarity
 Docs as vectors.
 For many algorithms, easier to think

in terms of a distance (rather than

similarity) between docs.
 We will use cosine similarity.

11
Clustering Algorithms
 Partitional algorithms
 Usually start with a random (partial)
partition
 Refine it iteratively

K means clustering

Model based clustering
 Hierarchical algorithms
 Bottom-up, agglomerative
 Top-down, divisive
12
Partitioning Algorithms
 Partitioning method: Construct a partition of n
documents into a set of K clusters
 Given: a set of documents and the number K
 Find: a partition of K clusters that optimizes
the chosen partitioning criterion
 Globally optimal: exhaustively enumerate

all partitions
 Effective heuristic methods: K-means and

K-medoids algorithms
13
K-Means
 Assumes documents are real-valued vectors.
 Clusters based on centroids (aka the center
of gravity or mean) of points in a cluster, c.
 Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
 (Or one can equivalently phrase it in terms of
similarities)

14
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges or other stopping
criterion:
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is
minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
15
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!

16
Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Doc partition unchanged.

 Centroid positions don’t change.

Does this mean that the

docs in a cluster are
unchanged? 17
Convergence
 Why should the K-means algorithm ever
reach a fixed point?
 A state in which clusters don’t change.
 K-means is a special case of a general
procedure known as the Expectation
Maximization (EM) algorithm.
 EM is known to converge.
 Number of iterations could be large.

18
Lower case

Convergence of K-Means
 Define goodness measure of cluster k as
sum of squared distances from cluster
centroid:
 Gk = Σi (di – ck)2 (sum over all di in
cluster k)
 G = Σk Gk
 Reassignment monotonically decreases G
since each vector is assigned to the
closest centroid.
19
Time Complexity
 Computing distance between two docs is O(m)
where m is the dimensionality of the vectors.
 Reassigning clusters: O(Kn) distance
computations, or O(Knm).
 Computing centroids: Each doc gets added once
to some centroid: O(nm).
 Assume these two steps are each done once for
I iterations: O(IKnm).

20
Seed Choice
 Results can vary based on Example showing
random seed selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings. In the above, if you start
with B and E as centroids
 Select good seeds using a you converge to {A,B,C}
heuristic (e.g., doc least similar and {D,E,F}
to any existing mean) If you start with D and F
you converge to
 Try out multiple starting points
{A,B,D,E} {C,F}
 Initialize with the results of
another method.
21
How Many Clusters?
 Number of clusters K is given

Partition n docs into predetermined number of
clusters
 Finding the “right” number of clusters is part of
the problem
 Given docs, partition into an “appropriate” number
of subsets.
 E.g., for query results - ideal value of K not known
up front - though UI may impose limits.

22
K not specified in advance
 Say, the results of a query.
 Solve an optimization problem: penalize
having lots of clusters
 application dependent, e.g., compressed
summary of search results list.
 Tradeoff between having more clusters
(better focus within each cluster) and
having too many clusters

23
K not specified in advance
 Given a clustering, define the Benefit
for a doc to be the cosine similarity to
its centroid
 Define the Total Benefit to be the sum
of the individual doc Benefits.

Why is there always a clustering of Total Benefit n?

24
Penalize lots of clusters
 For each cluster, we have a Cost C.
 Thus for a clustering with K clusters, the Total
Cost is KC.
 Define the Value of a clustering to be =
Total Benefit - Total Cost.
 Find the clustering of highest value, over all
choices of K.
 Total benefit increases with increasing K. But can
stop when it doesn’t increase by “much”. The Cost
term enforces this.
25
K-means issues, variations, etc.
 Recomputing the centroid after every
assignment (rather than after all points are
re-assigned) can improve speed of
convergence of K-means
 Assumes clusters are spherical in vector
space

Sensitive to coordinate changes, weighting etc.
 Disjoint and exhaustive
 Doesn’t have a notion of “outliers”
26

Yellow: Blue Curve
100% (1)
Yellow: Blue Curve
28 pages
04 CS316 Algorithms Recursive Algorithms
No ratings yet
04 CS316 Algorithms Recursive Algorithms
33 pages
16 Flat
No ratings yet
16 Flat
88 pages
DATA STRUCTURE Update 2
No ratings yet
DATA STRUCTURE Update 2
118 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Agents and Environment
No ratings yet
Agents and Environment
35 pages
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
No ratings yet
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
5 pages
Mid Term Past Papers 701
No ratings yet
Mid Term Past Papers 701
4 pages
Chapter7 Rock Mechanics Interactions
No ratings yet
Chapter7 Rock Mechanics Interactions
7 pages
6 Suffix-Tree
No ratings yet
6 Suffix-Tree
20 pages
Class 1 Introduction To Automatic Control Systems
No ratings yet
Class 1 Introduction To Automatic Control Systems
23 pages
Simplex Method
No ratings yet
Simplex Method
35 pages
Flat It Gate 2
No ratings yet
Flat It Gate 2
33 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Probst DRFP
No ratings yet
Probst DRFP
21 pages
Clustering
No ratings yet
Clustering
28 pages
Determination of The State-Space Form of A Differential Equation & Solving It Using MATLAB's Ode45-Solver
No ratings yet
Determination of The State-Space Form of A Differential Equation & Solving It Using MATLAB's Ode45-Solver
4 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
In-Class Problems: Introduction To Scientific Computation
No ratings yet
In-Class Problems: Introduction To Scientific Computation
3 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering
No ratings yet
Clustering
84 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Causal Loop Diagram
No ratings yet
Causal Loop Diagram
4 pages
AI WK 11 Lec 21 22 Student
No ratings yet
AI WK 11 Lec 21 22 Student
23 pages
Clustering
No ratings yet
Clustering
125 pages
Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete
No ratings yet
Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete
4 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Unit IV
No ratings yet
Unit IV
96 pages
A Review of Artificial Intelligence in Security An
No ratings yet
A Review of Artificial Intelligence in Security An
18 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Week 11
No ratings yet
Week 11
49 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
MCL705 Intro 2023
No ratings yet
MCL705 Intro 2023
2 pages
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
No ratings yet
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Clustering
No ratings yet
Clustering
32 pages
K Mean Clustering
No ratings yet
K Mean Clustering
48 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
2 Sensitivity Analysis
No ratings yet
2 Sensitivity Analysis
40 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Course Outline MTS 202 - Statistical Inference
No ratings yet
Course Outline MTS 202 - Statistical Inference
5 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
2024 Module Test 2 - 2
No ratings yet
2024 Module Test 2 - 2
6 pages
Physical Layer
No ratings yet
Physical Layer
82 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
COMP9517 Lab3 - Theory
No ratings yet
COMP9517 Lab3 - Theory
16 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Transportation Model FInal
No ratings yet
Transportation Model FInal
21 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Numpy Operations
No ratings yet
Numpy Operations
55 pages
Cluster
100% (1)
Cluster
72 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Chapter4 Clustering Compressed
No ratings yet
Chapter4 Clustering Compressed
48 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Clustering TNP
No ratings yet
Clustering TNP
53 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Lecture 4 - Spectral Theorem For Symmetric Matrix
No ratings yet
Lecture 4 - Spectral Theorem For Symmetric Matrix
5 pages
IR - BTech Model Paper
100% (1)
IR - BTech Model Paper
2 pages
Monte Carlo Simulation Handouts
No ratings yet
Monte Carlo Simulation Handouts
8 pages
Pizza Price Prediction 5thquestion
No ratings yet
Pizza Price Prediction 5thquestion
3 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
37 Application of K Means Clustering
No ratings yet
37 Application of K Means Clustering
38 pages
IR Lec 36
No ratings yet
IR Lec 36
29 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Wine Quality Questions
No ratings yet
Wine Quality Questions
2 pages
Lec 48
No ratings yet
Lec 48
12 pages
AI March - 2024
No ratings yet
AI March - 2024
1 page
Lecture 18 Clustering 19092024 091909am
No ratings yet
Lecture 18 Clustering 19092024 091909am
33 pages
Privasea Whitepaper
No ratings yet
Privasea Whitepaper
44 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Lec 37
No ratings yet
Lec 37
13 pages
Clustering
No ratings yet
Clustering
80 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Notes 1149 Unit 3
No ratings yet
Notes 1149 Unit 3
32 pages
Chapter 6
No ratings yet
Chapter 6
54 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering
No ratings yet
Clustering
29 pages
CS352 - Lab Syllabus
No ratings yet
CS352 - Lab Syllabus
2 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

L 12 Flat Cluster

Uploaded by

L 12 Flat Cluster

Uploaded by

Flat Clustering

Adapted from Slides by Prabhakar

agriculture biology physics CS space

... ... ... ...

 When a query matches a doc D, also return other

docs in the cluster containing D

in terms of a distance (rather than

 Centroid positions don’t change.

Does this mean that the

Why is there always a clustering of Total Benefit n?

You might also like