Chap8-Cluster Analysis
Chap8-Cluster Analysis
𝑘=1 𝑥𝑖 ∈𝐶𝑘
• Problem definition: Given K, find a partition of K clusters that optimizes the
chosen partitioning criterion
• Global optimal: Needs to exhaustively enumerate all partitions
• Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc.
The K-Means Clustering Method
• K-Means (MacQueen’67, Lloyd’57/’82)
• Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means clustering algorithm is
outlined as follows
• Select K points as initial centroids
• Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
• Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
Example: K-Means Clustering
Assign points
to clusters
Recompute
cluster centers
9 9 9
8 8 8
7
Arbitrary 7
Assign 7
choose K each
6 6 6
remaining
5 5
4 object as 4 4
3 initial 3 object to 3
2
medoids 2
nearest 2
1 1
medoids 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10
Repeat
Compute
9 9
Swapping O 8 8
6
swapping 6
4
5
improved
improves the clustering quality 3
2
3
1 1
x1 0 0 1 42 +42 𝑒 −1 𝑒 −1 𝑒 −1
−
𝑒 2⋅42 = 𝑒 −1
x2 4 4
𝑒 −1 1 𝑒 −2 𝑒 −4 𝑒 −2
x3 −4 4
𝑒 −1 𝑒 −2 1 𝑒 −2 𝑒 −4
x4 −4 −4
𝑒 −1 𝑒 −4 𝑒 −2 1 𝑒 −2
x5 4 −4
𝑒 −1 𝑒 −2 𝑒 −4 𝑒 −2 1
Example: Kernel K-Means Clustering
The original data The result of K-Means The result of Gaussian Kernel K-Means
set clustering clustering
The above data set cannot generate quality clusters by K-Means since it contains non-
covex clusters
Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points
|| X i X j || 2 /2 2
xi, xj: K x x ( xi ) ( x j ) and Gaussian kernel: K(Xi, Xj) = e
i j
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Dendrogram: How Clusters are Merged
• Dendrogram: Decompose a set of data objects into a tree of clusters
by multi-level nested partitioning
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in Hierarchical
Clustering X
X
• Sensitive to outliers
Agglomerative Clustering: Average vs.
Centroid Links X X
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down Approach
• The process starts at the root with all the points as one cluster
• It recursively splits the higher level clusters to build the dendrogram
• Can be considered as a global approach
• More efficient when compared with agglomerative clustering
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for Divisive
Clustering
• Choosing which cluster to split
• Check the sums of squared errors of the clusters and choose the one with the
largest value
• Splitting criterion: Determining how to split
• One may use Ward’s criterion to chase for greater reduction in the difference
in the SSE criterion as a result of a split
• For categorical data, Gini-index can be used
• Handling the noise
• Use a threshold to determine the termination criterion (do not generate
clusters that are too small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Weakness of the agglomerative & divisive hierarchical clustering
methods
• No revisit: cannot undo any merge/split decisions made before
• Scalability bottleneck: Each merge/split needs to examine many possible
options
• Time complexity: at least O(n2), where n is the number of total objects
• Several other hierarchical clustering algorithms
• BIRCH (1996): Use CF-tree and incrementally adjust the quality of sub-clusters
• CURE (1998): Represent a cluster using a set of well-scattered representative
points
• CHAMELEON (1999): Use graph partitioning methods on the K-nearest
neighbor graph of the data
BIRCH: A Multi-Phase Hierarchical Clustering
Method
• BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
• Developed by Zhang, Ramakrishnan & Livny (SIGMOD’96)
• Impact many new clustering methods and applications (received 2006 SIGMOD Test
of Time award)
• Major innovation
• Integrating hierarchical clustering (initial micro-clustering phase) and other clustering
methods (at the later macro-clustering phase)
• Multi-phase hierarchical clustering
• Phase1 (initial micro-clustering): Scan DB to build an initial CF tree, a multi-level
compression of the data to preserve the inherent clustering structure of the data
• Phase 2 (later macro-clustering): Use an arbitrary clustering algorithm (e.g., iterative
partitioning) to cluster flexibly the leaf nodes of the CF-tree
Clustering Feature Vector
• Consider a cluster of multi-dimensional data objects/points
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
info about clusters of objects
• Register the 0-th, 1st, and 2nd moments of a cluster
• Clustering Feature (CF): CF = <n, LS, SS> CF1 = <5, (16,30), 244>
9 (3,4)
8
7 (2,6)
6
5 (4,5)
• SS: square sum of n points: 𝑆𝑆 = σ𝑛𝑖=1 𝒙𝑖 2 4
3 (4,7)
2
1 (3,8)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30); 0
0 1 2 3 4 5 6 7 8 9 10
10
9 (3,4)
8
3 (4,7)
2
1 (3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Essential Measures of Cluster: Centroid,
Radius and Diameter 𝒙 = = 0
σ𝑛
𝑖=1 𝒙𝑖
𝑛
𝑳𝑺
𝑛
• Centroid: 𝑥0 σ𝑛
𝑖=1 𝒙𝑖 −𝒙0
2 𝑆𝑆 𝑳𝑺 2
𝑅= = −
• The “middle” of a cluster X
𝑛 𝑛 𝑛2
Non-leaf node
CF11 CF12 CF13 CF15
child11 child12 child child15
13
Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
OPTICS: Ordering Points To Identify Clustering
Structure
• OPTICS (Ankerst, Breunig, Kriegel, and Sander, SIGMOD’99)
• DBSCAN is sensitive to parameter setting
• An extension: finding clustering structure
• Observation: Given a MinPts, density-based clusters w.r.t. a higher
density are completely contained in clusters w.r.t. to a lower density
• Idea: Higher density points should be processed first—find high-
density clusters first
• OPTICS stores such a clustering order using two pieces of information:
• Core distance and reachability distance
Visualization
• Since points belonging to a cluster have a low reachability distance to
their nearest neighbor, valleys correspond to clusters
• The deeper the valley, the denser the cluster
Reachability-distance
Reachability plot for a dataset
undefined
’
’
• Examine how well the clustering results match the ground truth in partitioning the
objects in the data set
• Information theory-based methods
• Compare the distribution of the clustering results and that of the ground truth
• Information theory (e.g., entropy) used to quantify the comparison
• Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
• Treat each group in the ground truth as a class, and then check the pairwise
consistency of the objects in the clustering results
• Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods Ground Truth G1 G2
Cluster C1 C2 C3
• The matching based methods compare clusters in the clustering results and
the groups in the ground truth
• Suppose a clustering method partitions D = {o1, …, on} into m clusters C =
{C1, …, Cm}. The ground truth G partitions D into l groups G = {G1, …, Gl}
• Purity: the extent that cluster 𝐶𝑖 contains points only from one (ground
truth) partition
|𝐶𝑖 ∩𝐺𝑗 |
• Purity for cluster 𝐶𝑖 : , where 𝐺𝑗 matching 𝐺𝑗 maximizes |𝐶𝑖 ∩ 𝐺𝑗 |
|𝐶𝑖 |
• Total purity of clustering C:
𝑚 𝑚
|𝐶𝑖 | 𝐶𝑖 ∩ 𝐺𝑗 1
𝑝𝑢𝑟𝑖𝑡𝑦 = max 𝑙 = max l |𝐶𝑖 ∩ 𝐺𝑗 }
𝑛 𝑗=1 𝐶𝑖 𝑛 j=1
𝑖=1 𝑖=1
Matching-Based Methods: Example Ground Truth G1 G2
Cluster C1 C2 C3
• Consider 11 objects
• Other methods:
• maximum matching; F-measure
Information Theory-Based Methods (I)
Conditional Entropy Ground Truth G1
Cluster C1 C2
G2
C3
𝑙 𝐶𝑖 ∩𝐺𝑗 𝐶𝑖 ∩𝐺𝑗
𝐻 𝐺 𝐶𝑖 = − σ𝑗=1 log
𝐶𝑖 𝐶𝑖
• Conditional entropy of G given clustering C:
𝑚 𝑚 𝑙
𝐶𝑖 𝐶𝑖 ∩ 𝐺𝑗 𝐶𝑖 ∩ 𝐺𝑗
𝐻 𝐺 𝐶 = 𝐻 𝐺 𝐶𝑖 = − log
𝑛 𝑛 𝐶𝑖
𝑖=1 𝑖=1 𝑗=1
Example Ground Truth G1 G2
Cluster C1 C2 C3
• Consider 11 objects
Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two clusters
Information Theory-Based Methods (II)
Normalized Mutual Information (NMI)
𝑝𝑖𝑗
• Mutual information 𝐼 𝐶, 𝐺 = − σ𝑟𝑖=1 σ𝑘𝑗=1 𝑝𝑖𝑗 log 𝑝
𝐶𝑖 𝑝𝐺𝑗
• Quantify the amount of shared info between the clustering C and the ground-truth
partitioning G
• Measure the dependency between the observed joint probability 𝑝𝑖𝑗 of C and G, and
the expected joint probability 𝑝𝐶𝑖 𝑝𝐺𝑗 under the independence assumption
• When C and G are independent, 𝑝𝑖𝑗 = 𝑝𝐶𝑖 𝑝𝐺𝑗 , I(C, G) = 0
• However, there is no upper bound on the mutual information
𝐼(𝐶,𝐺) 𝐼(𝐶,𝐺) 𝐼 𝐶,𝐺
• Normalized mutual information 𝑁𝑀𝐼 𝐶, 𝐺 = 𝐻(𝐶) 𝐻(𝐺)
=
𝐻 𝐶 𝐻(𝐺)
• Value range of NMI: [0,1]
• Value close to 1 indicates a good clustering
Pairwise Comparison-Based Methods: Jaccard
Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same
cluster/group, the assignment is regarded as positive; otherwise, negative
• Depending on assignments, we have four possible cases: