Clustering
Clustering
1
Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.
• Dissimilarities or similarities are most often assessed using distance
measures.
• Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
• Clustering is a form of learning by observation rather than learning by
example (as is the case with classification)
2
Cluster Analysis
• Because a cluster is a collection of data objects that are similar to one another
within the cluster and dissimilar to objects in other clusters, a cluster of data
objects can be treated as an implicit class. In this sense, clustering is
sometimes called automatic classification.
• Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity.
• Clustering can also be used for outlier detection, where outliers (values that
are “far away” from any cluster) may be more interesting than common cases.
• Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of learning by
observation, rather than learning by examples
3
Requirements for Cluster Analysis(I)
Following are typical requirements of clustering in data mining
• Scalability: should be able to handle large datasets in order to avoid biased results.
• Ability to deal with different types of attributes
―Numeric, binary, nominal (categorical), and ordinal data, or mixtures of these data types
• Discovery of clusters with arbitrary shape
―clusters based on Euclidean or Manhattan distance measures are most of the time spherical
in shape. Other complex shapes like non-convex should also be synthesized.
• Requirements for domain knowledge to determine input parameters
―such as the desired number of clusters should not burden the user
• Ability to deal with noisy data
―Clustering algorithms should not be sensitive to noise
4
Requirements for Cluster Analysis(II)
• Incremental clustering and insensitivity to input order
―Recomputation of clustering when new data objects are introduced and
introduced in specific order should be avoided
• Capability of clustering high-dimensionality data
―It is a challenging task, such as when clustering documents, each keyword can
be regarded as a dimension. These kinds of data are highly skewed and
sparse.
• Constraint-based clustering
• Interpretability and usability
5
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
―high intra-class similarity: cohesive within clusters
―low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
―the similarity measure used by the method
―its implementation, and
―Its ability to discover some or all of the hidden patterns
6
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
― Similarity is expressed in terms of a distance function, typically metric: d(i, j)
― The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
― Weights should be associated with different variables based on applications
and data semantics
• Quality of clustering:
― There is usually a separate “quality” function that measures the “goodness” of
a cluster.
― It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
7
Comparison criteria for various Clustering
approaches
• Partitioning criteria
―Single level ( all clusters are considered at same level conceptually)vs. hierarchical (clusters
are at different semantic levels. Often, multi-level hierarchical partitioning is desirable)
• Separation of clusters
―Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
―Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
―Full space (often with low dimensional data) vs. subspace (often in high-dimensional data)
8
Major Clustering Approaches (I)
• Partitioning approach:
―Given a set of n objects, a partitioning method constructs one level partitioning dividing the
objects into k partitions where each partition represents a cluster and k≤n.
―Typically adopt exclusive cluster separation.
―Most methods are distance based
―May use mean or medoid (etc.) to represent cluster center.
―Effective for small to medium size data sets.
―Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
―Create a hierarchical decomposition of the set of data (or objects).
―Two types: agglomerative (bottom-up) and divisive(top-down).
―Cannot correct erroneous merges or splits
―Typical methods: Diana, Agnes, BIRCH, CAMELEON
9
Major Clustering Approaches (II)
• Density-based approach:
―Based on connectivity and density functions
―Each point must have minimum number of points within its neighborhood
―Can find arbitrary shaped clusters.
―May filter out outliers
―Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
―Based on a multiple-level granularity structure
―Fast processing
―Typical methods: STING, WaveCluster, CLIQUE
10
Overview of various clustering approaches
11
Partitioning approach
12
Partitioning Algorithms: Basic Concept
• These methods organize the objects into several exclusive groups called clusters.
• No of clusters is given as background knowledge.
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that
the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
• The clusters are formed to optimize an objective partitioning criterion, such as dissimilarity
function based on distance, so that objects in one cluster are ‘similar” to each other and
“dissimilar” to objects in other clusters
―k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the
cluster
―k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster
is represented by one of the objects in the cluster
13
K-Means
• Suppose a data set D contains n objects.
• k-means method distributes objects in D into k clusters C1, C2, …,Ck
such that Ci ⊂ D and Ci ⋂ Cj= ∅
• It uses centroid of a cluster to represent the cluster
―Can be mean (for nominal data mode can be used)
• The distance between an object p Ci and centroid ci is measured by
Eucledian distance between p and ci and is denoted by
14
The K-Means: Centroid based Clustering
Method
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partitioning
(the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
Quality of cluster is measured by within cluster variation,
E=
Where Ci is a cluster, ci is centroid and p is a data object belonging to C i
15
16
An Example of K-Means Clustering
K=2
20
Sensitive to noisy data and outliers
• For the data set 1, 2, 3, 8, 9, 10, and 25. Classify them using k-means
clustering algorithm where k=2
• Solution: C1={1,2,3}, C2={8,9,10,25} c1=2, c2=13
• Calculate E=(1-2)2+(2-2)2+(3-2)2+(8-13)2+(9-13)2+(10-13)2+(25-13)2
=196
• Consider another partitioning C1={1,2,3,8} and C2={9,10,25} c1=3.5 and
c2=14.67
• Now recompute E=189.67
• The reason is because of outlier 25, placement of 8 was wrong in the initial
clustering.
21
Variations of the K-Means Method
• Most of the variants of the k-means which differ in
―Selection of the initial k means
―Dissimilarity calculations
―Strategies to calculate cluster means
• Handling categorical data: k-modes
―Replacing means of clusters with modes
―Using new dissimilarity measures to deal with categorical objects
―Using a frequency-based method to update modes of clusters
―A mixture of categorical and numerical data: k-prototype method
22
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
―Since an object with an extremely large value may substantially distort the
distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
24
PAM(Partitioning Around Medoids): A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
25
Minimizing dissimilarity
• The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object. That is, an absolute-error
criterion is used, defined as
where E is the sum of the absolute error for all objects p in the data set,
and Oi is the representative object of Ci . This is the basis for the k-
medoids method, which groups n objects into k clusters by minimizing
the absolute error
26
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
―PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
• Efficiency improvement on PAM
―CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
―CLARANS (Ng & Han, 1994): Randomized re-sampling
27
Example
• Let us consider the following data objects with k=2 and distance
metric as Manhattan distance
Manhattan distance between two points (x1,y1) and (x2,y2) is |x1-x2|
+|y1-y2|
28
• Consider initial medoid of C1 as (3,4) and C2 as (7,4)
No Data Objects Dissimilarity with C1 Dissimilarity with C2
1 (7,6) 6 2
2 (2,6) 3 7
3 (3,8) 4 8
4 (8,5) 6 2
5 (7,4) 4 0
6 (4,7) 4 6
7 (6,2) 5 3
8 (7,3) 5 1
9 (6,4) 3 1
10 (3,4) 0 4
29
• So C1={2,3,6,10} and C2={1,4,5,7,8,9}
• Cost =(3+4+4+0)+(2+2+0+3+1+1)=20
• Now let us choose orandom =(7,3) as the new medoid for C2. Now
repeating the above procedure, you will see that the C1 and C2 come
out to be same but the error in cost S=22-20=2>0.
• This means that replacing cluster medoid is incurring more cost than
the previous iteration, so we will not do that.
30
CLARA-Clustering LARge Applications)
• Instead of taking the whole data set into consideration, CLARA uses a random
sample of the data set.
• The PAM algorithm is then applied to compute the best medoids from the
sample.
• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• CLARA builds clusterings from multiple random samples and returns the best
clustering as the output.
• The complexity of computing the medoids on a random sample is where s is the
size of the sample, k is the number of clusters, and n is the total number of
objects. CLARA can deal with larger data sets than PAM.
31
Hierarchical Clustering
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• Agglomerative clustering vs divisive clustering depending on whether
hierarchical decomposition is formed in a bottom-up (merging) or
top-down (splitting) fashion.
32
Agglomerative Hierarchical Clustering
• It uses a bottom-up strategy.
• It starts by letting each object form its own cluster.
• It then iteratively merges clusters into larger clusters until all the objects
are in a single cluster or certain termination condition is satisfied.
• The single cluster becomes hierarchy’s root .
• Two clusters that are closest to each other according to some similarity
measure are combined to form one cluster.
• Because two clusters are merged per iteration, where each cluster contains
at least on object, an agglomerative method requires at most n iterations.
33
Divisive Hierarchical Clustering
• It employs top-down strategy.
• It starts by placing all objects in one cluster, which is the root of the
hierarchy.
• It then recursively partitions the root cluster into several smaller sub
clusters.
• The partitioning process continues until each cluster at the lowest
level is coherent enough---either contains a single data object, or
objects within the cluster are sufficiently similar.
34
Example
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a (AGNES-AGglomerative NESting)
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA-DIvisive ANAlysis)
35
Distance/ Linkage Measures in Hierarchical
Clustering
• Given objects p and p’, |p-p’| is the distance between p and p’, mi is the mean of
cluster Ci, and ni is the number of objects in cluster Ci
• Minimum Distance
distmin(Ci,Cj) = min {|p-p’|}
p Ci,p’ Cj
• Maximum Distance
distmax(Ci,Cj) = max {|p-p’|}
p Ci,p’ Cj
• Mean Distance
distmean(Ci,Cj) = |mi-mj|
• Average Distance
distavg(Ci,Cj) = 36
• When a method used minimum distance , it is called nearest
neighbour clustering algorithm/single linkage methods. In these
methods merging of two clusters Ci and Cj corresponds to adding an
edge between the nearest pair of data point in Ci and Cj
• When a method uses maximum distance. It is called farthest-
neighbour clustering/ complete linkage methods. The distance
between two clusters is determined by the most distant data points in
the two clusters.
37
AGNES (Agglomerative Nesting)
38
DIANA (DIvisive ANAlysis)
39
Dendrogram: Shows How Clusters are
Merged
―Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram
―A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster
40
Example: Agglomerative Clustering with
single linkage and Euclidean Distance
41
Plot graph
42
Create Distance Matrix
43
Distance Matrix
• Min distance comes between P3 and P6. So they will be merged to
form one cluster.
44
Update the distance matrix
45
Updated Distance Matrix when P3 and P6
form cluster
46
Updated Distance Matrix when P2 and P5 are
merged in one cluster
47
Updated Distance matrix when (P3,P6) and
(P2,P5) are merged into one cluster
48
X X
• Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i,
Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i, Kj) =
avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
―Medoid: a chosen, centrally located object in the cluster
49
Centroid, Radius and Diameter of a Cluster
(for numerical data sets) N i 1(t )
• Centroid: the “middle” of a cluster Cm N ip
• Radius: square root of average distance from any point of the cluster
to its centroid N (t c ) 2
i 1 ip m
Rm
N
• Diameter: square root of average mean squared distance between all
pairs of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
50
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
―Can never undo what was done previously
―Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• Integration of hierarchical & distance-based clustering
―BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
―CHAMELEON (1999): hierarchical clustering using dynamic
modeling
51
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
―Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
―Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record 52
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed and
scalability in large or even streaming databases, and also make it
effective for incremental and dynamic clustering of incoming objects.
53
BIRCH- clustering feature (CF)
• Consider a cluster of n d-dimensional data objects or points.
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
information about clusters of objects. It is defined as
CF={n,LS,SS}
• Where LS is the linear sum of the n points (i.e, and SS is the square
sum of the data points (i.e.
• A clustering feature is essentially a summary of the statistics for the
given cluster.Using a clustering feature, we can easily derive many
useful statistics of a cluster
54
BIRCH
• the cluster’s centroid, x0, radius, R, and diameter, D, are
55
BIRCH- Summarizing a cluster using the
clustering feature
• Summarizing a cluster using the clustering feature can avoid storing
the detailed information about individual objects or points.
• Instead, we only need a constant size of space to store the clustering
feature. This is the key to BIRCH efficiency in space.
• Moreover, clustering features are additive.
―That is, for two disjoint clusters, C1 and C2, with the clustering features
CF1={n1,LS1,SS1} and CF2={n2,LS2,SS2}, respectively, the clustering feature for
the cluster that formed by merging C1 and C2 is simply
CF1+CF2 ={n1+n2,LS1+LS2,SS1+SS2}.
56
BIRCH
• A CF-tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
• The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
• A CF-tree has two parameters: branching factor, B, and threshold, T.
―The branching factor specifies the maximum number of children per nonleaf
node.
―The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
• These two parameters implicitly control the resulting tree’s size.
57
The CF Tree Structure Root
Non-leaf node
CF11 CF12 CF13 CF15
child1 child2 child3 child5
58
The Birch Algorithm
√
2
• Cluster Diameter
𝐷=
𝑛 𝑛
( 𝑥𝑖 −𝑥 𝑗 )
∑ ∑ 𝑛 ( 𝑛−1 ) =
√ 2𝑛𝑆𝑆−2 𝐿𝑆
2
62
CHAMELEON
63
CHAMELEON
• The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is defined
as the absolute interconnectivity between Ci and Cj , normalized with respect to the
internal interconnectivity of the two clusters, Ci and Cj . That is,
• where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci
and Cj .
• Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj)
into two roughly equal parts.
64
CHAMELEON
• The relative closeness, RC (Ci ,Cj), between a pair of clusters, Ci and
Cj , is the absolute closeness between Ci and Cj , normalized with
respect to the internal closeness of the two clusters, Ci and Cj . It is
defined as
• where SEC{Ci ,Cj} is the average weight of the edges that connect
vertices in Ci to vertices in Cj , and SECCi (or SECCj ) is the average
weight of the edges that belong to the mincut bisector of cluster Ci
(or Cj ).
65
CHAMELEON
• Chameleon has been shown to have greater power at discovering
arbitrarily shaped clusters of high quality than several well-known
algorithms such as BIRCH and density based DBSCAN.
• However, the processing cost for high-dimensional data may require
O(n2)time for n objects in the worst case.
66
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
67
internal closeness
CHAMELEON (Clustering Complex Objects)
68
Density-Based Methods
69
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-connected
points
• Major features:
―Discover clusters of arbitrary shape
―Handle noise
―One scan
―Need density parameters as termination condition
• Several interesting studies:
―DBSCAN: Ester, et al. (KDD’96)
―OPTICS: Ankerst, et al (SIGMOD’99).
―DENCLUE: Hinneburg & D. Keim (KDD’98)
―CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
70
Density-Based Clustering: Basic Concepts
• Two parameters:
―ϵ-neighbourhood: Maximum radius of the neighbourhood
―MinPts: Minimum number of points in an ϵ -neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps} where q is a core point and p is an object
point
• Directly density-reachable: A point p is directly density-reachable from a point q
w.r.t. ϵ-neighbourhood, MinPts if
―p belongs to NEps(q)
p MinPts = 5
―core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
71
Density-Reachable and Density-Connected
• Density-reachable:
―A point p is density-reachable from a point
p
q w.r.t. ϵ-neighbourhood, MinPts if there is p1
q
a chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-reachable
from pi
• Density-connected
―A point p is density-connected to a point q p q
w.r.t. ϵ-neighbourhood, MinPts if there is a
point o such that both, p and q are density- o
reachable from o w.r.t. ϵ-neighbourhood
and MinPts
72
DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set
of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
73
DBSCAN: The Algorithm
74
75
DBSCAN: Sensitive to Parameters
76
Disadvantages of DBSCAN
• responsibility of
―selecting parameter values that will lead to the discovery of acceptable
clusters
―real-world,high-dimensional data sets often have very skewed distributions
such that their intrinsic clustering structure may not be well characterized by
a single set of global density parameters.
77
OPTICS: A Cluster-Ordering Method (1999)
79
80
81
Density-Based Clustering: OPTICS & Its Applications
82
DENCLUE: Clustering Based on Density
Distribution Functions
• DENCLUE (DENsity-based CLUstEring) is a clustering method based on
a set of density distribution functions.
• In DBSCAN and OPTICS, density is calculated by counting the number
of objects in a neighborhood defined by a radius parameter,ϵ. Such
density estimates can be highly sensitive to the radius value used.
• To overcome this problem, kernel density estimation can be used,
which is a nonparametric density estimation approach from statistics.
83
84
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually contain data points and
manages these cells in a tree-based access structure
• Influence function: describes the impact of a data point within its neighborhood
• Overall density of the data space can be calculated as the sum of the influence function of all data
points
• Clusters can be determined mathematically by identifying density attractors
• Density attractors are local maximal of the overall density function
• Center defined clusters: assign to each density attractor the points density attracted to it
• Arbitrary shaped cluster: merge density attractors that are connected through paths of high
density (> threshold)
85
Grid-Based Methods
86
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
―STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
―WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
―CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
87
STING: A Statistical Information Grid Approach
(i-1)st layer
i-th layer
88
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller cells in the next
lower level
• Statistical info of each cell is calculated and stored beforehand and is used to
answer queries
• Parameters of higher level cells can be easily calculated from parameters of lower
level cell
―count, mean, s, min, max
―type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of cells
• For each cell in the current level compute the confidence interval
89
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
―Query-independent, easy to parallelize, incremental update
―O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
―All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
90
CLIQUE (Clustering In QUEst)
91
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of
the partition.
• Identify the candidate search space
― Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
―Determine dense units in all subspaces of interests
―Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
―Determine maximal regions that cover a cluster of connected dense units for
each cluster
―Determination of minimal cover for each cluster
92
• CLIQUE performs clustering in two steps.
• In the first step, CLIQUE partitions the d-dimensional data space into
nonoverlapping rectangular units, identifying the dense units among
these.
• CLIQUE finds dense cells in all of the subspaces.
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing at least l points, where l is the density
threshold.
93
• In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape.
• The idea is to apply the Minimum Description Length (MDL) to use
the maximal regions to cover connected dense cells,
• where a maximal region is a hyperrectangle where every cell falling
into this region is dense, and the region cannot be extended further in
any dimension in the subspace
94
Salary
(10,000)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l ar
Vacation
y
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
95
Strength and Weakness of CLIQUE
• Strength
―automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
―insensitive to the order of records in input and does not presume some
canonical data distribution
―scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
―The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
96
Evaluation of Clustering
97
Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
―Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
―Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D:
xi = min{dist (pi, v)} where v in D
―Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
―Calculate the Hopkins Statistic:
―If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
98
Determine the Number of Clusters (I)
• Empirical method
―# of clusters ≈for a dataset of n points
with each cluster having points.
• Elbow method
―Use the turning point in the curve of
sum of within cluster variance w.r.t the
# of clusters
99
Determine the Number of Clusters(II)
100
Determine the Number of Clusters(III)
101
Measuring Clustering Quality
102
103
Measuring Clustering Quality: Extrinsic Methods
• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
―Cluster homogeneity: the purer, the better
―Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
―Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
―Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
104