Chp10 Cluster Analysis Basic Concepts and Methods
Chp10 Cluster Analysis Basic Concepts and Methods
Concepts and
Techniques
(3rd ed.)
— Chapter 10 —
3
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
4
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
5
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
6
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
7
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)
9
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2), where
n is the number of total objects
Integration of hierarchical & distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
10
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the
data record
11
Clustering Feature Vector in BIRCH
CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
Xi
9
(2,6)
8
i 1
7
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
12
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0-th, 1st,
nodes 13
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
14
The Birch Algorithm
Cluster Diameter 1 2
(x x )
n( n 1) i j
parents
Algorithm is O(n)
Concerns
Sensitive to insertion order of data points
natural
Clusters tend to be spherical given the radius and diameter
measures
15
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
16
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 17
CHAMELEON (Clustering Complex
Objects)
18
Probabilistic Hierarchical Clustering
Algorithmic hierarchical clustering
Nontrivial to choose a good distance measure
Hard to handle missing attribute values
Optimization goal not clear: heuristic, local search
Probabilistic hierarchical clustering
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
19
Generative Model
Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:
20
A Probabilistic Hierarchical Clustering
Algorithm
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can
be measured by,
22
By Weka App. Do Hierarchical
cluster
Open Weka -> explore
Choose contact-lenses.arff file
Click Cluster button.
Choose clusterers -> HierarchicalCluster
From setting Chose “distanceFunction” and change
it to “ManhattanDistance”.
From setting change numClusters to 6.
Click Start bottom.
You can visualize output by right click on result list an
select “Visualize tree”
Try it with LinkType = Complete
23
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
24