0% found this document useful (0 votes)
28 views

Lecture 2 - Clustering Methods

The document discusses various clustering methods. It describes partitioning methods like k-means and k-medoids which assign data points to clusters to minimize distances between points and cluster centers or medoids. It also covers hierarchical methods that create cluster hierarchies, density-based methods based on connectivity and density, grid-based methods using multi-level grids, and model-based methods fitting clusters to hypothesized models.

Uploaded by

Manikandan M
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Lecture 2 - Clustering Methods

The document discusses various clustering methods. It describes partitioning methods like k-means and k-medoids which assign data points to clusters to minimize distances between points and cluster centers or medoids. It also covers hierarchical methods that create cluster hierarchies, density-based methods based on connectivity and density, grid-based methods using multi-level grids, and model-based methods fitting clusters to hypothesized models.

Uploaded by

Manikandan M
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 1


Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

11/1/22 Data Mining: Concepts and Techniques 2


Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
11/1/22 Data Mining: Concepts and Techniques 3
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,


Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
11/1/22 Data Mining: Concepts and Techniques 4
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point of the


cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

11/1/22 Data Mining: Concepts and Techniques 5


Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 6


Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance

 km1tmiKm (Cm  tmi ) 2


 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

11/1/22 Data Mining: Concepts and Techniques 7


The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in


four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment

11/1/22 Data Mining: Concepts and Techniques 8


The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 9


Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

11/1/22 Data Mining: Concepts and Techniques 10


Variations of the K-Means Method

 A few variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

11/1/22 Data Mining: Concepts and Techniques 11


What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 12


The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters


 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

11/1/22 Data Mining: Concepts and Techniques 13


A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 14


PAM (Partitioning Around Medoids) (1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus


 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
11/1/22 Data Mining: Concepts and Techniques 15
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

h
8
8

j
7
7
6
6

5
i
i
5

4
t
4

3
h j
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Cjih
Cjih = d(j, t) - d(j, i)Data Mining: Concepts and = d(j, h) - d(j, t)
Techniques 16
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence of


noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)

11/1/22 Data Mining: Concepts and Techniques 17


CLARA (Clustering Large Applications) (1990)

 CLARA (Kaufmann and Rousseeuw in 1990)


 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
11/1/22 Data Mining: Concepts and Techniques 18
CLARANS (“Randomized” CLARA) (1994)

 CLARANS (A Clustering Algorithm based on Randomized


Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
11/1/22 Data Mining: Concepts and Techniques 19

You might also like