0% found this document useful (0 votes)
38 views

Clustering

Here are the steps for k-means clustering of the given data set with k=2: 1) Randomly partition the data into two clusters: C1: {1, 2, 3, 8, 9, 10} C2: {25} 2) Calculate the centroids (means) of the clusters: Centroid of C1 = (1+2+3+8+9+10)/6 = 6 Centroid of C2 = 25 3) Assign each point to the cluster with the closest centroid: Assign all points to C1 since they are closer to 6 than 25. 4) Recalculate the centroids: New centroid of C1 =
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Clustering

Here are the steps for k-means clustering of the given data set with k=2: 1) Randomly partition the data into two clusters: C1: {1, 2, 3, 8, 9, 10} C2: {25} 2) Calculate the centroids (means) of the clusters: Centroid of C1 = (1+2+3+8+9+10)/6 = 6 Centroid of C2 = 25 3) Assign each point to the cluster with the closest centroid: Assign all points to C1 since they are closer to 6 than 25. 4) Recalculate the centroids: New centroid of C1 =
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Cluster Analysis

1
Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.
• Dissimilarities or similarities are most often assessed using distance
measures.
• Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
• Clustering is a form of learning by observation rather than learning by
example (as is the case with classification)

2
Cluster Analysis
• Because a cluster is a collection of data objects that are similar to one another
within the cluster and dissimilar to objects in other clusters, a cluster of data
objects can be treated as an implicit class. In this sense, clustering is
sometimes called automatic classification.
• Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity.
• Clustering can also be used for outlier detection, where outliers (values that
are “far away” from any cluster) may be more interesting than common cases.
• Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of learning by
observation, rather than learning by examples

3
Requirements for Cluster Analysis(I)
Following are typical requirements of clustering in data mining
• Scalability: should be able to handle large datasets in order to avoid biased results.
• Ability to deal with different types of attributes
―Numeric, binary, nominal (categorical), and ordinal data, or mixtures of these data types
• Discovery of clusters with arbitrary shape
―clusters based on Euclidean or Manhattan distance measures are most of the time spherical
in shape. Other complex shapes like non-convex should also be synthesized.
• Requirements for domain knowledge to determine input parameters
―such as the desired number of clusters should not burden the user
• Ability to deal with noisy data
―Clustering algorithms should not be sensitive to noise

4
Requirements for Cluster Analysis(II)
• Incremental clustering and insensitivity to input order
―Recomputation of clustering when new data objects are introduced and
introduced in specific order should be avoided
• Capability of clustering high-dimensionality data
―It is a challenging task, such as when clustering documents, each keyword can
be regarded as a dimension. These kinds of data are highly skewed and
sparse.
• Constraint-based clustering
• Interpretability and usability

5
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
―high intra-class similarity: cohesive within clusters
―low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
―the similarity measure used by the method
―its implementation, and
―Its ability to discover some or all of the hidden patterns

6
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
― Similarity is expressed in terms of a distance function, typically metric: d(i, j)
― The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
― Weights should be associated with different variables based on applications
and data semantics
• Quality of clustering:
― There is usually a separate “quality” function that measures the “goodness” of
a cluster.
― It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
7
Comparison criteria for various Clustering
approaches
• Partitioning criteria
―Single level ( all clusters are considered at same level conceptually)vs. hierarchical (clusters
are at different semantic levels. Often, multi-level hierarchical partitioning is desirable)
• Separation of clusters
―Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
―Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
―Full space (often with low dimensional data) vs. subspace (often in high-dimensional data)
8
Major Clustering Approaches (I)
• Partitioning approach:
―Given a set of n objects, a partitioning method constructs one level partitioning dividing the
objects into k partitions where each partition represents a cluster and k≤n.
―Typically adopt exclusive cluster separation.
―Most methods are distance based
―May use mean or medoid (etc.) to represent cluster center.
―Effective for small to medium size data sets.
―Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
―Create a hierarchical decomposition of the set of data (or objects).
―Two types: agglomerative (bottom-up) and divisive(top-down).
―Cannot correct erroneous merges or splits
―Typical methods: Diana, Agnes, BIRCH, CAMELEON
9
Major Clustering Approaches (II)
• Density-based approach:
―Based on connectivity and density functions
―Each point must have minimum number of points within its neighborhood
―Can find arbitrary shaped clusters.
―May filter out outliers
―Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
―Based on a multiple-level granularity structure
―Fast processing
―Typical methods: STING, WaveCluster, CLIQUE

10
Overview of various clustering approaches

11
Partitioning approach

12
Partitioning Algorithms: Basic Concept
• These methods organize the objects into several exclusive groups called clusters.
• No of clusters is given as background knowledge.
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that
the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
• The clusters are formed to optimize an objective partitioning criterion, such as dissimilarity
function based on distance, so that objects in one cluster are ‘similar” to each other and
“dissimilar” to objects in other clusters
―k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the
cluster
―k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster
is represented by one of the objects in the cluster
13
K-Means
• Suppose a data set D contains n objects.
• k-means method distributes objects in D into k clusters C1, C2, …,Ck
such that Ci ⊂ D and Ci ⋂ Cj= ∅
• It uses centroid of a cluster to represent the cluster
―Can be mean (for nominal data mode can be used)
• The distance between an object p Ci and centroid ci is measured by
Eucledian distance between p and ci and is denoted by

14
The K-Means: Centroid based Clustering
Method
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partitioning
(the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
Quality of cluster is measured by within cluster variation,
E=
Where Ci is a cluster, ci is centroid and p is a data object belonging to C i
15
16
An Example of K-Means Clustering
K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
17
Example1
For the data set 1, 2, 3, 8, 9, 10, and 25. Classify them using k-means clustering algorithm
where k=2
Let’s suppose the mean of C1 is 1 and C2 is 2
Intitial Clusters
C1={1}, C2={2,3,8,9,10,25}
Recompute the mean of both clusters
c1=1, c2=9.1
Next clusters
C1={1,2,3}, C2={8,9,10,25}
Recompute the means of both clusters
c1=2, c2=13
Next clusters
C1={1,2,3}, C2={8,9,10,25}
As no change is observed, these are the final clusters 18
Example 2
• Given the points P1(2,10), P2(2,5), P3(8,4), M1(5,8), M2(7,5), M3(6,4),
and N1(1,2), N2(4,9). Suppose k=3, distance function is Eucledian
distance and initial centroids are P1, M1 and N1 of the three clusters
• (a) Show the three cluster centers after first round of execution.
• (b) The final three clusters.
Solution:
(a) C1={P1}, C2{M1, P3, M2, M3, N2}, C3{N1, P2}
c1=(2,10), c2=(6,6), c3=(1.5,3.5)
(b) After calculation it is found that no change is observed, so the initial
clusters are the final clusters.
19
Weaknesses of K-Means Method
• Often terminates at a local optimal.
• Results may depend on initial random selection of cluster center.
• Applicable only to objects in a continuous n-dimensional space Using
the k-modes method for categorical data
• Need to specify k, the number of clusters, in advance (there are ways
to automatically determine the best k)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes

20
Sensitive to noisy data and outliers

• For the data set 1, 2, 3, 8, 9, 10, and 25. Classify them using k-means
clustering algorithm where k=2
• Solution: C1={1,2,3}, C2={8,9,10,25} c1=2, c2=13
• Calculate E=(1-2)2+(2-2)2+(3-2)2+(8-13)2+(9-13)2+(10-13)2+(25-13)2
=196
• Consider another partitioning C1={1,2,3,8} and C2={9,10,25} c1=3.5 and
c2=14.67
• Now recompute E=189.67
• The reason is because of outlier 25, placement of 8 was wrong in the initial
clustering.
21
Variations of the K-Means Method
• Most of the variants of the k-means which differ in
―Selection of the initial k means
―Dissimilarity calculations
―Strategies to calculate cluster means
• Handling categorical data: k-modes
―Replacing means of clusters with modes
―Using new dissimilarity measures to deal with categorical objects
―Using a frequency-based method to update modes of clusters
―A mixture of categorical and numerical data: k-prototype method

22
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
―Since an object with an extremely large value may substantially distort the
distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
24
PAM(Partitioning Around Medoids): A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

25
Minimizing dissimilarity
• The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object. That is, an absolute-error
criterion is used, defined as

where E is the sum of the absolute error for all objects p in the data set,
and Oi is the representative object of Ci . This is the basis for the k-
medoids method, which groups n objects into k clusters by minimizing
the absolute error

26
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
―PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
• Efficiency improvement on PAM
―CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
―CLARANS (Ng & Han, 1994): Randomized re-sampling
27
Example
• Let us consider the following data objects with k=2 and distance
metric as Manhattan distance
Manhattan distance between two points (x1,y1) and (x2,y2) is |x1-x2|
+|y1-y2|

28
• Consider initial medoid of C1 as (3,4) and C2 as (7,4)
No Data Objects Dissimilarity with C1 Dissimilarity with C2
1 (7,6) 6 2
2 (2,6) 3 7
3 (3,8) 4 8
4 (8,5) 6 2
5 (7,4) 4 0
6 (4,7) 4 6
7 (6,2) 5 3
8 (7,3) 5 1
9 (6,4) 3 1
10 (3,4) 0 4
29
• So C1={2,3,6,10} and C2={1,4,5,7,8,9}
• Cost =(3+4+4+0)+(2+2+0+3+1+1)=20
• Now let us choose orandom =(7,3) as the new medoid for C2. Now
repeating the above procedure, you will see that the C1 and C2 come
out to be same but the error in cost S=22-20=2>0.
• This means that replacing cluster medoid is incurring more cost than
the previous iteration, so we will not do that.

30
CLARA-Clustering LARge Applications)
• Instead of taking the whole data set into consideration, CLARA uses a random
sample of the data set.
• The PAM algorithm is then applied to compute the best medoids from the
sample.
• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• CLARA builds clusterings from multiple random samples and returns the best
clustering as the output.
• The complexity of computing the medoids on a random sample is where s is the
size of the sample, k is the number of clusters, and n is the total number of
objects. CLARA can deal with larger data sets than PAM.
31
Hierarchical Clustering
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• Agglomerative clustering vs divisive clustering depending on whether
hierarchical decomposition is formed in a bottom-up (merging) or
top-down (splitting) fashion.

32
Agglomerative Hierarchical Clustering
• It uses a bottom-up strategy.
• It starts by letting each object form its own cluster.
• It then iteratively merges clusters into larger clusters until all the objects
are in a single cluster or certain termination condition is satisfied.
• The single cluster becomes hierarchy’s root .
• Two clusters that are closest to each other according to some similarity
measure are combined to form one cluster.
• Because two clusters are merged per iteration, where each cluster contains
at least on object, an agglomerative method requires at most n iterations.

33
Divisive Hierarchical Clustering
• It employs top-down strategy.
• It starts by placing all objects in one cluster, which is the root of the
hierarchy.
• It then recursively partitions the root cluster into several smaller sub
clusters.
• The partitioning process continues until each cluster at the lowest
level is coherent enough---either contains a single data object, or
objects within the cluster are sufficiently similar.

34
Example
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a (AGNES-AGglomerative NESting)
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA-DIvisive ANAlysis)

35
Distance/ Linkage Measures in Hierarchical
Clustering
• Given objects p and p’, |p-p’| is the distance between p and p’, mi is the mean of
cluster Ci, and ni is the number of objects in cluster Ci
• Minimum Distance
distmin(Ci,Cj) = min {|p-p’|}
p Ci,p’ Cj
• Maximum Distance
distmax(Ci,Cj) = max {|p-p’|}
p Ci,p’ Cj
• Mean Distance
distmean(Ci,Cj) = |mi-mj|
• Average Distance
distavg(Ci,Cj) = 36
• When a method used minimum distance , it is called nearest
neighbour clustering algorithm/single linkage methods. In these
methods merging of two clusters Ci and Cj corresponds to adding an
edge between the nearest pair of data point in Ci and Cj
• When a method uses maximum distance. It is called farthest-
neighbour clustering/ complete linkage methods. The distance
between two clusters is determined by the most distant data points in
the two clusters.

37
AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)


• Use the single-link method and the dissimilarity matrix
―Each cluster is represented by all the objects in the cluster, and the similarity between two
clusters is measured by the similarity of the closest pair of data points belonging to different
clusters.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

38
DIANA (DIvisive ANAlysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Inverse order of AGNES
• Eventually each node forms a cluster on its own
• All the objects are used to form one initial cluster.
• The cluster is split according to some principle such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.
• The cluster-splitting process repeats until, eventually, each new cluster contains
only a single object.

39
Dendrogram: Shows How Clusters are
Merged
―Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram
―A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster

40
Example: Agglomerative Clustering with
single linkage and Euclidean Distance

41
Plot graph

42
Create Distance Matrix

43
Distance Matrix
• Min distance comes between P3 and P6. So they will be merged to
form one cluster.

44
Update the distance matrix

45
Updated Distance Matrix when P3 and P6
form cluster

46
Updated Distance Matrix when P2 and P5 are
merged in one cluster

47
Updated Distance matrix when (P3,P6) and
(P2,P5) are merged into one cluster

48
X X

Distance between Clusters


• Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(K i,
Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i,
Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i, Kj) =
avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
―Medoid: a chosen, centrally located object in the cluster

49
Centroid, Radius and Diameter of a Cluster
(for numerical data sets) N i  1(t )
• Centroid: the “middle” of a cluster Cm  N ip

• Radius: square root of average distance from any point of the cluster
to its centroid  N (t  c ) 2
i 1 ip m
Rm 
N
• Diameter: square root of average mean squared distance between all
pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

50
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
―Can never undo what was done previously
―Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• Integration of hierarchical & distance-based clustering
―BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
―CHAMELEON (1999): hierarchical clustering using dynamic
modeling
51
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
―Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
―Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record 52
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed and
scalability in large or even streaming databases, and also make it
effective for incremental and dynamic clustering of incoming objects.

53
BIRCH- clustering feature (CF)
• Consider a cluster of n d-dimensional data objects or points.
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
information about clusters of objects. It is defined as
CF={n,LS,SS}
• Where LS is the linear sum of the n points (i.e, and SS is the square
sum of the data points (i.e.
• A clustering feature is essentially a summary of the statistics for the
given cluster.Using a clustering feature, we can easily derive many
useful statistics of a cluster

54
BIRCH
• the cluster’s centroid, x0, radius, R, and diameter, D, are

55
BIRCH- Summarizing a cluster using the
clustering feature
• Summarizing a cluster using the clustering feature can avoid storing
the detailed information about individual objects or points.
• Instead, we only need a constant size of space to store the clustering
feature. This is the key to BIRCH efficiency in space.
• Moreover, clustering features are additive.
―That is, for two disjoint clusters, C1 and C2, with the clustering features
CF1={n1,LS1,SS1} and CF2={n2,LS2,SS2}, respectively, the clustering feature for
the cluster that formed by merging C1 and C2 is simply
CF1+CF2 ={n1+n2,LS1+LS2,SS1+SS2}.

56
BIRCH
• A CF-tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
• The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
• A CF-tree has two parameters: branching factor, B, and threshold, T.
―The branching factor specifies the maximum number of children per nonleaf
node.
―The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
• These two parameters implicitly control the resulting tree’s size.
57
The CF Tree Structure Root

B=7 CF1 CF2 CF3 CF6


L=6 child1 child2 child3 child6

Non-leaf node
CF11 CF12 CF13 CF15
child1 child2 child3 child5

Leaf node Leaf node


prev CF111 CF112 CF116 next prev CF121 CF122 CF124 next

58
The Birch Algorithm


2
• Cluster Diameter
𝐷=
𝑛 𝑛
( 𝑥𝑖 −𝑥 𝑗 )
∑ ∑ 𝑛 ( 𝑛−1 ) =
√ 2𝑛𝑆𝑆−2 𝐿𝑆
2

𝑖=1 𝑗=1 𝑛 (𝑛−1 )


• For each point in the input
―Find closest leaf entry
―Add point to leaf entry and update CF
―If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
―Sensitive to insertion order of data points
―Since we fix the size of leaf nodes, so clusters may not be so natural
―Clusters tend to be spherical given the radius and diameter measures
59
BIRCH- Effectiveness
• The time complexity of the algorithm is O(n) where n is the number of
objects to be clustered.
• Experiments have shown the linear scalability of the algorithm with
respect to the number of objects, and good quality of clustering of the
data.
• However, since each node in a CF-tree can hold only a limited number of
entries due to its size, a CF-tree node does not always correspond to
what a user may consider a natural cluster.
• Moreover, if the clusters are not spherical in shape, BIRCH does not
perform well because it uses the notion of radius or diameter to control
the boundary of a cluster.
60
CHAMELEON: Hierarchical Clustering Using Dynamic
Modeling (1999)
• CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
• Measures the similarity based on a dynamic model
― Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
• Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters
61
CHAMELEON: Hierarchical Clustering Using Dynamic
Modeling (1999)
• Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where
each vertex of the graph represents a data object, and there exists an edge between two
vertices (objects) if one object is among the k-most similar objects to the other.
• The edges are weighted to reflect the similarity between objects.
• Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor graph
into a large number of relatively small subclusters such that it minimizes the edge cut.
• That is, a cluster C is partitioned into subclusters Ci and Cj so as to minimize the weight of
the edges that would be cut should C be bisected into Ci and Cj .
• It assesses the absolute interconnectivity between clusters Ci and Cj .

62
CHAMELEON

• Chameleon then uses an agglomerative hierarchical clustering algorithm that


iteratively merges subclusters based on their similarity.
• To determine the pairs of most similar subclusters, it takes into account both
the interconnectivity and the closeness of the clusters.
• Specifically, Chameleon determines the similarity between each pair of clusters
Ci and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their
relative closeness,RC(Ci ,Cj).

63
CHAMELEON
• The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is defined
as the absolute interconnectivity between Ci and Cj , normalized with respect to the
internal interconnectivity of the two clusters, Ci and Cj . That is,

• where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci
and Cj .
• Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj)
into two roughly equal parts.
64
CHAMELEON
• The relative closeness, RC (Ci ,Cj), between a pair of clusters, Ci and
Cj , is the absolute closeness between Ci and Cj , normalized with
respect to the internal closeness of the two clusters, Ci and Cj . It is
defined as

• where SEC{Ci ,Cj} is the average weight of the edges that connect
vertices in Ci to vertices in Cj , and SECCi (or SECCj ) is the average
weight of the edges that belong to the mincut bisector of cluster Ci
(or Cj ).
65
CHAMELEON
• Chameleon has been shown to have greater power at discovering
arbitrarily shaped clusters of high quality than several well-known
algorithms such as BIRCH and density based DBSCAN.
• However, the processing cost for high-dimensional data may require
O(n2)time for n objects in the worst case.

66
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
67
internal closeness
CHAMELEON (Clustering Complex Objects)

68
Density-Based Methods

69
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-connected
points
• Major features:
―Discover clusters of arbitrary shape
―Handle noise
―One scan
―Need density parameters as termination condition
• Several interesting studies:
―DBSCAN: Ester, et al. (KDD’96)
―OPTICS: Ankerst, et al (SIGMOD’99).
―DENCLUE: Hinneburg & D. Keim (KDD’98)
―CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
70
Density-Based Clustering: Basic Concepts

• Two parameters:
―ϵ-neighbourhood: Maximum radius of the neighbourhood
―MinPts: Minimum number of points in an ϵ -neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps} where q is a core point and p is an object
point
• Directly density-reachable: A point p is directly density-reachable from a point q
w.r.t. ϵ-neighbourhood, MinPts if
―p belongs to NEps(q)
p MinPts = 5
―core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
71
Density-Reachable and Density-Connected

• Density-reachable:
―A point p is density-reachable from a point
p
q w.r.t. ϵ-neighbourhood, MinPts if there is p1
q
a chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-reachable
from pi
• Density-connected
―A point p is density-connected to a point q p q
w.r.t. ϵ-neighbourhood, MinPts if there is a
point o such that both, p and q are density- o
reachable from o w.r.t. ϵ-neighbourhood
and MinPts
72
DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set
of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

73
DBSCAN: The Algorithm

• Arbitrary select a point p


• Retrieve all points density-reachable from p w.r.t. ϵ-neighbourhood and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
• Continue the process until all of the points have been processed

74
75
DBSCAN: Sensitive to Parameters

76
Disadvantages of DBSCAN
• responsibility of
―selecting parameter values that will lead to the discovery of acceptable
clusters
―real-world,high-dimensional data sets often have very skewed distributions
such that their intrinsic clustering structure may not be well characterized by
a single set of global density parameters.

77
OPTICS: A Cluster-Ordering Method (1999)

• OPTICS: Ordering Points To Identify the Clustering Structure


―Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
―Produces a special order of the database wrt its density-based clustering
structure
―This is a linear list of all objects under analysis and represents the density-
based clustering structure of the data.
―Objects in a denser cluster are listed closer to each other in the cluster
ordering.
―This cluster-ordering contains info equiv to the density-based clusterings
corresponding to a broad range of parameter settings
―Good for both automatic and interactive cluster analysis, including finding
intrinsic clustering structure.
78
OPTICS: Some Extension from DBSCAN
• This ordering is equivalent to density-based clustering obtained from
a wide range of parameter settings.
• OPTICS does not require the user to provide a specific density
threshold.
• The cluster ordering can be used to extract basic clustering
information (e.g., cluster centers, or arbitrary-shaped clusters), derive
the intrinsic clustering structure, as well as provide a visualization of
the clustering.

79
80
81
Density-Based Clustering: OPTICS & Its Applications

82
DENCLUE: Clustering Based on Density
Distribution Functions
• DENCLUE (DENsity-based CLUstEring) is a clustering method based on
a set of density distribution functions.
• In DBSCAN and OPTICS, density is calculated by counting the number
of objects in a neighborhood defined by a radius parameter,ϵ. Such
density estimates can be highly sensitive to the radius value used.
• To overcome this problem, kernel density estimation can be used,
which is a nonparametric density estimation approach from statistics.

83
84
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually contain data points and
manages these cells in a tree-based access structure
• Influence function: describes the impact of a data point within its neighborhood
• Overall density of the data space can be calculated as the sum of the influence function of all data
points
• Clusters can be determined mathematically by identifying density attractors
• Density attractors are local maximal of the overall density function
• Center defined clusters: assign to each density attractor the points density attracted to it
• Arbitrary shaped cluster: merge density attractors that are connected through paths of high
density (> threshold)

85
Grid-Based Methods

86
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
―STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
―WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
―CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering

87
STING: A Statistical Information Grid Approach

• Wang, Yang and Muntz (VLDB’97)


• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of resolution
1st layer

(i-1)st layer

i-th layer

88
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller cells in the next
lower level
• Statistical info of each cell is calculated and stored beforehand and is used to
answer queries
• Parameters of higher level cells can be easily calculated from parameters of lower
level cell
―count, mean, s, min, max
―type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of cells
• For each cell in the current level compute the confidence interval
89
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
―Query-independent, easy to parallelize, incremental update
―O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
―All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected

90
CLIQUE (Clustering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


• Automatically identifying subspaces of a high dimensional data space that allow better clustering
than original space
• CLIQUE can be considered as both density-based and grid-based
―It partitions each dimension into the same number of equal length interval
―It partitions an m-dimensional data space into non-overlapping rectangular units
―A unit is dense if the fraction of total data points contained in the unit exceeds the input
model parameter
―A cluster is a maximal set of connected dense units within a subspace

91
CLIQUE: The Major Steps

• Partition the data space and find the number of points that lie inside each cell of
the partition.
• Identify the candidate search space
― Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
―Determine dense units in all subspaces of interests
―Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
―Determine maximal regions that cover a cluster of connected dense units for
each cluster
―Determination of minimal cover for each cluster

92
• CLIQUE performs clustering in two steps.
• In the first step, CLIQUE partitions the d-dimensional data space into
nonoverlapping rectangular units, identifying the dense units among
these.
• CLIQUE finds dense cells in all of the subspaces.
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing at least l points, where l is the density
threshold.

93
• In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape.
• The idea is to apply the Minimum Description Length (MDL) to use
the maximal regions to cover connected dense cells,
• where a maximal region is a hyperrectangle where every cell falling
into this region is dense, and the region cannot be extended further in
any dimension in the subspace

94
Salary
(10,000)

=3
0 1 2 3 4 5 6 7

20
30
40
50

Sa
l ar
Vacation
y
60
age

30

Vacation
(week)
50

0 1 2 3 4 5 6 7
20
30
40

age
50
60
age

95
Strength and Weakness of CLIQUE

• Strength
―automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
―insensitive to the order of records in input and does not presume some
canonical data distribution
―scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
―The accuracy of the clustering result may be degraded at the expense of
simplicity of the method

96
Evaluation of Clustering

97
Assessing Clustering Tendency

• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
―Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
―Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D:
xi = min{dist (pi, v)} where v in D
―Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
―Calculate the Hopkins Statistic:

―If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0

98
Determine the Number of Clusters (I)

• Empirical method
―# of clusters ≈for a dataset of n points
with each cluster having points.
• Elbow method
―Use the turning point in the curve of
sum of within cluster variance w.r.t the
# of clusters

99
Determine the Number of Clusters(II)

• Cross validation method


―Divide a given data set into m parts
―Use m – 1 parts to obtain a clustering model
―Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use the sum of squared
distance between all points in the test set and the closest centroids to measure how well
the model fits the test set
―For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and
find # of clusters that fits the data the best

100
Determine the Number of Clusters(III)

• The average silhouette approach determines how well each


object lies within its cluster. A high average sil
• Average silhouette method computes the average silhouette
of observations for different values of k. The optimal number
of clusters k is the one that maximize the average silhouette
over a range of possible values for k (Kaufman and
Rousseeuw [1990]).houette width indicates a good clustering.
• The algorithm is similar to the elbow method and can be
computed as follow:
― Compute clustering algorithm (e.g., k-means clustering) for different
values of k. For instance, by varying k from 1 to 10 clusters
― For each k, calculate the average silhouette of observations (avg.sil)
― Plot the curve of avg.sil according to the number of clusters k.
― The location of the maximum is considered as the appropriate number
of clusters.

101
Measuring Clustering Quality

• Two methods: extrinsic vs. intrinsic


• Extrinsic: supervised, i.e., the ground truth is available
―Compare a clustering against the ground truth using certain clustering quality
measure
―Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
―Evaluate the goodness of a clustering by considering how well the clusters are
separated, and how compact the clusters are
―Ex. Silhouette coefficient

102
103
Measuring Clustering Quality: Extrinsic Methods

• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
―Cluster homogeneity: the purer, the better
―Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
―Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
―Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces

104

You might also like