0% found this document useful (0 votes)

28 views18 pages

Understanding Cluster Analysis in Data Mining

The document provides an overview of cluster analysis, a method used in data mining to group similar data objects into clusters based on their characteristics. It discusses various clustering techniques, including k-means and k-medoids, and their applications in fields such as biology, marketing, and city planning. Additionally, it highlights the importance of clustering quality and the challenges associated with different methods.

Uploaded by

offadarsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views18 pages

Understanding Cluster Analysis in Data Mining

Uploaded by

offadarsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Data Mining

Clustering

PPT Adapted from

Data Mining Concepts and Techniques by
Han, Kamber & Pei

1
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

2
Clustering for Data Understanding
and Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
3
Clustering as a Preprocessing Tool
(Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

4
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

5
Partitioning Algorithms: Basic
Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
6
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four

steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

7
8
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update
the
point) for each partition cluster
 Assign each object to the centroids
cluster of its nearest centroid
 Until no change
9
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is

# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes 10
Variations of the K-Means Method

 Most of the variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

11
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

12
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5 each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

1 1
initial to
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping 7 total cost 7

Until no O and 6
of 6

Oramdom
change
5 5

4
swapping 4

If quality is 3

2
3

improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

13
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
14
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
15
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

16
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each
connected component forms a cluster

17
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
18

Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
37 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
40 pages
K-Medoids Clustering Overview
No ratings yet
K-Medoids Clustering Overview
24 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
42 pages
Clustering Techniques and Applications
No ratings yet
Clustering Techniques and Applications
93 pages
Cluster Analysis: Methods & Applications
No ratings yet
Cluster Analysis: Methods & Applications
47 pages
Cluster Analysis Methods Overview
No ratings yet
Cluster Analysis Methods Overview
64 pages
Clustering
No ratings yet
Clustering
32 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
AGNES vs. DIANA Clustering Methods
No ratings yet
AGNES vs. DIANA Clustering Methods
36 pages
Clustering Techniques by Dr. Tundjungsari
No ratings yet
Clustering Techniques by Dr. Tundjungsari
101 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
69 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
41 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
95 pages
Cluster Analysis Techniques Overview
No ratings yet
Cluster Analysis Techniques Overview
89 pages
Clustering Methods and Similarity Metrics
No ratings yet
Clustering Methods and Similarity Metrics
76 pages
Cluster Analysis Techniques Overview
No ratings yet
Cluster Analysis Techniques Overview
66 pages
Cluster Analysis Techniques in Data Mining
No ratings yet
Cluster Analysis Techniques in Data Mining
50 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
31 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
16 pages
Clustering Methods: K-Medoids & Hierarchical
No ratings yet
Clustering Methods: K-Medoids & Hierarchical
28 pages
Cluster Analysis Methods and Applications
No ratings yet
Cluster Analysis Methods and Applications
21 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
23 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
96 pages
Overview of Clustering Methods
No ratings yet
Overview of Clustering Methods
83 pages
Cluster Analysis: Methods & Applications
No ratings yet
Cluster Analysis: Methods & Applications
23 pages
Understanding Clustering Methods and Techniques
No ratings yet
Understanding Clustering Methods and Techniques
27 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
93 pages
Understanding Cluster Analysis
No ratings yet
Understanding Cluster Analysis
77 pages
Clustering Techniques and Methods Overview
No ratings yet
Clustering Techniques and Methods Overview
20 pages
Cluster Analysis: Methods and Concepts
No ratings yet
Cluster Analysis: Methods and Concepts
38 pages
Partitioning Methods in Clustering
No ratings yet
Partitioning Methods in Clustering
27 pages
Partitioning Methods in Clustering
No ratings yet
Partitioning Methods in Clustering
20 pages
K-Medoids Clustering Overview
No ratings yet
K-Medoids Clustering Overview
29 pages
Cluster Analysis Techniques Explained
No ratings yet
Cluster Analysis Techniques Explained
12 pages
Cluster Analysis Methods and Techniques
No ratings yet
Cluster Analysis Methods and Techniques
96 pages
Understanding Cluster Analysis
No ratings yet
Understanding Cluster Analysis
14 pages
Clustering Methods in Data Mining
No ratings yet
Clustering Methods in Data Mining
28 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Partitioning Methods in Cluster Analysis
No ratings yet
Partitioning Methods in Cluster Analysis
26 pages
Cluster Analysis in R with Bank Data
No ratings yet
Cluster Analysis in R with Bank Data
31 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
14 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Cluster Analysis Methods Overview
No ratings yet
Cluster Analysis Methods Overview
44 pages
Introduction to Clustering in Data Mining
No ratings yet
Introduction to Clustering in Data Mining
54 pages
Agglomerative Clustering Steps Explained
No ratings yet
Agglomerative Clustering Steps Explained
80 pages
Partitioning Methods in Clustering
100% (1)
Partitioning Methods in Clustering
7 pages
Overview of Cluster Analysis Methods
No ratings yet
Overview of Cluster Analysis Methods
25 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
K-Means Clustering in Python
No ratings yet
K-Means Clustering in Python
47 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
50 pages
Types and Applications of Clustering
No ratings yet
Types and Applications of Clustering
84 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
9 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
90 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
37 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
85 pages
ASL Recognition System Using CNNs
No ratings yet
ASL Recognition System Using CNNs
6 pages
Deep Learning for Fault Detection in Cold Forging
No ratings yet
Deep Learning for Fault Detection in Cold Forging
11 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
3 pages
Meta-Meta Classification for One-Shot Learning
No ratings yet
Meta-Meta Classification for One-Shot Learning
12 pages
Hybrid RAG for Multilingual Legal Retrieval
No ratings yet
Hybrid RAG for Multilingual Legal Retrieval
8 pages
Survey on Pareto Front Learning Techniques
No ratings yet
Survey on Pareto Front Learning Techniques
7 pages
AI's Impact on Korea's Labor Market
No ratings yet
AI's Impact on Korea's Labor Market
200 pages
Transformers in Machine Learning Expansion
No ratings yet
Transformers in Machine Learning Expansion
9 pages
Machine Learning: Classification & Clustering
No ratings yet
Machine Learning: Classification & Clustering
105 pages
ACCV 2018: Selected Computer Vision Papers
No ratings yet
ACCV 2018: Selected Computer Vision Papers
504 pages
Generalization Bounds for RNN Variants
No ratings yet
Generalization Bounds for RNN Variants
30 pages
Generative AI Course Overview and Modules
No ratings yet
Generative AI Course Overview and Modules
7 pages
Lung Cancer Classification with AI
No ratings yet
Lung Cancer Classification with AI
2 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
34 pages
Logistic Regression and Regularization in Python
No ratings yet
Logistic Regression and Regularization in Python
19 pages
1D Convolution Explained with Examples
No ratings yet
1D Convolution Explained with Examples
88 pages
Machine Learning Lab Course Overview
No ratings yet
Machine Learning Lab Course Overview
1 page
AI and Machine Learning Course Handout
No ratings yet
AI and Machine Learning Course Handout
2 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
4 pages
PCOAM16 ML Unit 1 Notes & Questions
No ratings yet
PCOAM16 ML Unit 1 Notes & Questions
95 pages
Recommender Systems
No ratings yet
Recommender Systems
10 pages
Overview of Generative AI Models
No ratings yet
Overview of Generative AI Models
11 pages
Pixel Efficiency Challenge Insights
No ratings yet
Pixel Efficiency Challenge Insights
10 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
18 pages
Deep Learning in Healthcare Applications
No ratings yet
Deep Learning in Healthcare Applications
38 pages
CNN Concepts and Techniques MCQs
No ratings yet
CNN Concepts and Techniques MCQs
26 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
18 pages
Real-Time Korean Vishing Detection Using ML
No ratings yet
Real-Time Korean Vishing Detection Using ML
12 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
3 pages
Roadmap to Becoming an AI Generalist
No ratings yet
Roadmap to Becoming an AI Generalist
10 pages

Understanding Cluster Analysis in Data Mining

Uploaded by

Understanding Cluster Analysis in Data Mining

Uploaded by

Data Mining

PPT Adapted from

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

characteristics found in the data and grouping similar

 As a preprocessing step for other algorithms

 Given k, the k-means algorithm is implemented in four

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is

 Most of the variants of the k-means which differ in

 The k-means algorithm is sensitive to outliers !

 K-Medoids Clustering: Find representative objects (medoids) in clusters

A clustering of the data objects is obtained by cutting

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

You might also like