0% found this document useful (0 votes)

28 views

Lecture 2 - Clustering Methods

The document discusses various clustering methods. It describes partitioning methods like k-means and k-medoids which assign data points to clusters to minimize distances between points and cluster centers or medoids. It also covers hierarchical methods that create cluster hierarchies, density-based methods based on connectivity and density, grid-based methods using multi-level grids, and model-based methods fitting clusters to hypothesized models.

Uploaded by

Manikandan M

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Lecture 2 - Clustering Methods

Uploaded by

Manikandan M

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 1

Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

11/1/22 Data Mining: Concepts and Techniques 2

Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
11/1/22 Data Mining: Concepts and Techniques 3
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
11/1/22 Data Mining: Concepts and Techniques 4
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point of the

cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

11/1/22 Data Mining: Concepts and Techniques 5

Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 6

Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance

 km1tmiKm (Cm  tmi ) 2

 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

11/1/22 Data Mining: Concepts and Techniques 7

The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment

11/1/22 Data Mining: Concepts and Techniques 8

The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 9

Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

11/1/22 Data Mining: Concepts and Techniques 10

Variations of the K-Means Method

 A few variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

11/1/22 Data Mining: Concepts and Techniques 11

What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 12

The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters

 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

11/1/22 Data Mining: Concepts and Techniques 13

A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 14

PAM (Partitioning Around Medoids) (1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
11/1/22 Data Mining: Concepts and Techniques 15
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

h
8
8

j
7
7
6
6

5
i
i
5

4
t
4

3
h j
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Cjih
Cjih = d(j, t) - d(j, i)Data Mining: Concepts and = d(j, h) - d(j, t)
Techniques 16
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)

11/1/22 Data Mining: Concepts and Techniques 17

CLARA (Clustering Large Applications) (1990)

 CLARA (Kaufmann and Rousseeuw in 1990)

 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
11/1/22 Data Mining: Concepts and Techniques 18
CLARANS (“Randomized” CLARA) (1994)

 CLARANS (A Clustering Algorithm based on Randomized

Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
11/1/22 Data Mining: Concepts and Techniques 19

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6134)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4973)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1988)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
AIG202C
No ratings yet
AIG202C
50 pages
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1994)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
Multivoxel Pattern Analysis Presentation
No ratings yet
Multivoxel Pattern Analysis Presentation
12 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
ML Dec 22
No ratings yet
ML Dec 22
3 pages
Customer Review Analysis Using Data Science
No ratings yet
Customer Review Analysis Using Data Science
31 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Project Presentation
No ratings yet
Project Presentation
19 pages
JPM Trading Via Image Classification
No ratings yet
JPM Trading Via Image Classification
6 pages
DWDM File
No ratings yet
DWDM File
26 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
Data Mining, Klasifikasi
No ratings yet
Data Mining, Klasifikasi
88 pages
Lesson Plan F1.1-DMDW
No ratings yet
Lesson Plan F1.1-DMDW
3 pages
Stress Detection With Machine Learning and Deep
No ratings yet
Stress Detection With Machine Learning and Deep
7 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Instant Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
100% (3)
Instant Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
50 pages
Effective Feature Enginerring Technique For Heart Disease Prediction With Machine Learning
No ratings yet
Effective Feature Enginerring Technique For Heart Disease Prediction With Machine Learning
48 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
unit V
No ratings yet
unit V
67 pages
Major Premise: All Students Attend School Regularly Minor Premise: John Is A Student Conclusion: John Attends School Regularly
No ratings yet
Major Premise: All Students Attend School Regularly Minor Premise: John Is A Student Conclusion: John Attends School Regularly
41 pages
Btech Ec 6 Sem Artificial Neural Network Nec 013 2017
No ratings yet
Btech Ec 6 Sem Artificial Neural Network Nec 013 2017
1 page
Lab-Practice-I (ML) - Lab Manual-sknIT
No ratings yet
Lab-Practice-I (ML) - Lab Manual-sknIT
57 pages
Logistic Regression
No ratings yet
Logistic Regression
50 pages
Assessing Urban Sidewalk Networks Based On Three Constructs A Synthesis of Pedestrian Level of Service Literature
No ratings yet
Assessing Urban Sidewalk Networks Based On Three Constructs A Synthesis of Pedestrian Level of Service Literature
38 pages
Chapter 2 - Texture Analysis
No ratings yet
Chapter 2 - Texture Analysis
18 pages
LEaggue
No ratings yet
LEaggue
41 pages
Mastering Machine Learning with scikit learn 2nd edition Gavin Hackeling - The full ebook with all chapters is available for download
100% (2)
Mastering Machine Learning with scikit learn 2nd edition Gavin Hackeling - The full ebook with all chapters is available for download
46 pages
Towards Automatically Extracting UML Class Diagram
No ratings yet
Towards Automatically Extracting UML Class Diagram
8 pages
Forex Algorithm
No ratings yet
Forex Algorithm
5 pages
Remote Sensing and Digital Image Processing
No ratings yet
Remote Sensing and Digital Image Processing
27 pages
Smart_Agriculture_Harnessing_Machine_Learning_for_Crop_Management
No ratings yet
Smart_Agriculture_Harnessing_Machine_Learning_for_Crop_Management
183 pages
Dam Crack Detection Based On Deep Learning Cascade Detection Algorithm
No ratings yet
Dam Crack Detection Based On Deep Learning Cascade Detection Algorithm
4 pages

Lecture 2 - Clustering Methods

Uploaded by

Lecture 2 - Clustering Methods

Uploaded by

Chapter 7.

11/1/22 Data Mining: Concepts and Techniques 1

11/1/22 Data Mining: Concepts and Techniques 2

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

 Radius: square root of average distance from any point of the

11/1/22 Data Mining: Concepts and Techniques 5

11/1/22 Data Mining: Concepts and Techniques 6

 km1tmiKm (Cm  tmi ) 2

11/1/22 Data Mining: Concepts and Techniques 7

 Given k, the k-means algorithm is implemented in

11/1/22 Data Mining: Concepts and Techniques 8

cluster center 4 Update 4

11/1/22 Data Mining: Concepts and Techniques 9

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

11/1/22 Data Mining: Concepts and Techniques 10

 A few variants of the k-means which differ in

11/1/22 Data Mining: Concepts and Techniques 11

 The k-means algorithm is sensitive to outliers !

11/1/22 Data Mining: Concepts and Techniques 12

 Find representative objects, called medoids, in clusters

11/1/22 Data Mining: Concepts and Techniques 13

K=2 Randomly select a

11/1/22 Data Mining: Concepts and Techniques 14

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

Cjih = d(j, h) - d(j, i) Cjih = 0

 Pam is more robust than k-means in the presence of

11/1/22 Data Mining: Concepts and Techniques 17

 CLARA (Kaufmann and Rousseeuw in 1990)

 CLARANS (A Clustering Algorithm based on Randomized

You might also like