Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis: Basic Concepts and Algorithms
and Algorithms
What is Cluster Analysis?
Understanding
– Group related documents
for browsing, group
genes and proteins that
have similar functionality,
or group stocks with
similar price fluctuations
Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
Supervised and Unsupervised Classification
─ What is Classification?
─ What is Supervised Classification/Learning?
─ What is Unsupervised Classification/Learning?
─ SOM – Self Organizing Maps
Types of Clustering Algorithms
Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey. ACM Computing
Surveys, 1999. 31: pp. 264-323.
Agglomerative Divisive
Gradient Descent Evolutionary
Algorithms Algorithms
and Artificial Methods
Neural Networks
Clustering can be
done at: Clustering can be based
– Indexing time on:
– At query time URL source
Put pages from the
Applied to documents same server together
Applied to snippets Text Content
-Polysemy (“puma”,
“banks”)
-Multiple aspects of a
single topic
Links
-Look at the connected
components in the link
graph (A/H analysis can
do it)
What is not Cluster Analysis?
Supervised classification
– Have class label information
Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
Results of a query
– Groupings are a result of an external specification
Graph partitioning
– Some mutual relevance and synergy, but areas are not
identical
Notion of a Cluster can be Ambiguous
Partitional Clustering
– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Other Distinctions Between Sets of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the
cluster than to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
Density-based
– A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high
density.
– Used when the clusters are irregular or intertwined, and
when noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Types of Clusters: Objective Function
Map
the clustering problem to a different
domain and solve a related problem in that
domain
– Proximity matrix defines a weighted graph, where
the nodes are the points being clustered, and the
weighted edges represent the proximities between
points
Hierarchical clustering
Density-based clustering
K-means Clustering
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step
3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster
means
X
Squared Error Criterion
K-means Clustering – Details
Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance,
cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only
one.
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only
one.
Solutions to Initial Centroids Problem
Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to
determine initial centroids
Select more than k initial centroids and then
select among these initial centroids
– Select most widely separated
Postprocessing
Bisecting K-means
– Not as susceptible to initialization issues
Handling Empty Clusters
Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest
SSE
– If there are several empty clusters, the above can
be repeated several times.
Updating Centers Incrementally
In
the basic K-means algorithm, centroids are
updated after all points are assigned to a
centroid
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Eliminate small clusters that may represent
outliers
– Split ‘loose’ clusters, i.e., clusters with relatively
high SSE
– Merge clusters that are ‘close’ and that have
relatively low SSE
– Can use these steps during the clustering process
ISODATA
Bisecting K-means
6 5
0.2
4
3 4
0.15
2
5
2
0.1
1
0.05 1
3
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
– Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point
(or there are k clusters)
Start
with clusters of individual points and a
proximity matrix p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
Intermediate Situation
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
Intermediate Situation
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
After Merging
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an
objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an
objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an
objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an
objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an
objective function
– Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
Similarity
of two clusters is based on the two
most similar (closest) points in the different
clusters
– Determined by one pair of points, i.e., by one link
in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MAX
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
p jClusterj
proximity(Clusteri , Clusterj )
|Clusteri ||Clusterj |
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Strengths
– Less susceptible to noise and outliers
Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
Similarity
of two clusters is based on the
increase in squared error when two clusters
are merged
– Similar to group average if distance between
points is distance squared
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Agglomerative Algorithm
The Agglomerative algorithm is carried out in
three steps:
1) Convert object attributes
to distance matrix
2) Set each object as a
cluster (thus if we have N
objects, we will have N
clusters at the beginning)
3) Repeat until number of
cluster is one (or known #
of clusters)
Merge two closest
clusters
Update distance matrix 100
Example
Problem: clustering analysis with
agglomerative algorithm
data matrix
Euclidean distance
distance matrix
101
Example
Merge two closest clusters (iteration 1)
102
Example
Update distance matrix (iteration 1)
103
Example
Merge two closest clusters (iteration 2)
104
Example
Update distance matrix (iteration 2)
105
Example
Merge two closest clusters/update distance
matrix (iteration 3)
106
Example
Merge two closest clusters/update distance
matrix (iteration 4)
107
Example
Final result (meeting termination condition)
108
Example
Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime
Demo
112
Relevant Issues
How to determine the number of clusters
– If the number of clusters known, termination
condition is given!
– The K-cluster lifetime as the range of threshold value
on the dendrogram tree that leads to the
identification of K clusters
– Heuristic rule: cut a dendrogram tree with maximum
life time
113
Conclusions
Hierarchical algorithm is a sequential clustering algorithm
– With distance matrix to construct a tree of clusters (dendrogram)
– Hierarchical representation without the need of knowing # of
clusters (can set termination condition with known # of clusters)
Major weakness of agglomerative clustering methods
– Can never undo what was done previously
– Sensitive to cluster distance measures and noise/outliers
– Less efficient: O (n2 ), where n is the number of total objects
There are several variants to overcome its weaknesses
– BIRCH: uses clustering feature tree and incrementally adjusts the
quality of sub-clusters, which scales well for a large data set
– ROCK: clustering categorical data via neighbour and link
analysis, which is insensitive to noise and outliers
– CHAMELEON: hierarchical clustering using dynamic modeling,
which integrates hierarchical method with other clustering
methods
114
DBSCAN
Eliminatenoise points
Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
0.9 0.9
0.8 0.8
0.7 0.7
0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
Two matrices
– Proximity Matrix
– “Incidence” Matrix
One row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different
clusters
Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
High correlation indicates that points that belong to
the same cluster are close to each other.
Not a good measure for some density or contiguity
based clusters.
Measuring Cluster Validity Via Correlation
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Using Similarity Matrix for Cluster Validation
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5
y
0.5
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100Similarity 0 0.2 0.4 0.6 0.8 1
Points x
K-means
Using Similarity Matrix for Cluster Validation
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Complete Link
Using Similarity Matrix for Cluster Validation
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: SSE
Clusters in more complicated figures aren’t well
separated
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
– SSE
SSE is good for comparing two clusterings or two
clusters (average SSE).
10
Can also be used to estimate the number of clusters
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Internal Measures: SSE
1
2 6
3
4
Example: SSE
– BSS + WSS = constant
m
1 m1 2 3 4 m2 5
cohesion separation
Internal Measures: Silhouette Coefficient
https://slideplayer.com/slide/8520975/