Module 5
Module 5
Dr. Sudhamani M J
Associate Professor
Dept of CSE
RNSIT
clustering
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
l Clustering is an unsupervised method of data
analysis
l Data instances grouped according to some notion
of similarity
Applications of Cluster Analysis
n City Planning: Identify groups of houses and separate them into different
clusters according to similar characteristics – type, size, geographical location
What is not Cluster Analysis?
l Supervised classification
– Have class label information
l Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
l Results of a query
– Groupings are a result of an external specification
l Graph partitioning
Types of clustering
l Hierarchical clustering
l Density-based clustering
K-means Clustering
Iteration-01:
We calculate the distance of each point from each
of the center of the three clusters.
l The distance is calculated by using the given
distance function.
calculating Distance Between A1(2, 10) and C1(2,
10)- Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
Cluster-01:
A1(2, 10)
Cluster-02:
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)
Cluster-03:
A2(2, 5)
A7(1, 2)
l Now,
l C1(3, 9.5)
l C2(6.5, 5.25)
l C3(1.5, 3.5)
exercise problem 1
Use the k-means algorithm and Euclidean
distance to cluster the following 8 examples into 3
clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5),
A6=(6,4), A7=(1,2), A8=(4,9).
The distance matrix based on the Euclidean
distance is given below:
Suppose that the initial seeds (centers of each cluster) are
A1, A4 and A7. Run the k-means algorithm for 1 epoch
only. At the end of this epoch show:
a) The new clusters (i.e. the examples belonging to each
cluster)
b) The centers of the new clusters
c) Draw a 10 by 10 space with all the 8 points and show the
clusters after the first epoch and the new centroids.
d) How many more iterations are needed to converge?
Draw the result for each epoch.
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Handling Empty Clusters
l Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
Updating Centers Incrementally
l Pre-processing
– Normalize the data
– Eliminate outliers
l Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
u ISODATA
Bisecting K-means
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
– Divisive:
u Start with one, all-inclusive cluster (top down clusters)
u At each step, split a cluster until each cluster contains a point (or
there are k clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
Intermediate Situation
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
Intermediate Situation
l We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
After Merging
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
Linkage Criteria: How to Define Inter-Cluster
Similarity
l It determines the distance between sets of observations as a function
of the pairwise distance between observations.
p3
p4
p5
l MIN
.
l MAX
.
l Group Average .
Proximity Matrix
l Distance Between Centroids
l Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
l MIN
.
l MAX
.
l Group Average .
Proximity Matrix
l Distance Between Centroids
l Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
l MIN
.
l MAX
.
l Group Average .
Proximity Matrix
l Distance Between Centroids
l Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
l MIN
.
l MAX
.
l Group Average .
Proximity Matrix
l Distance Between Centroids
l Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
l MIN
.
l MAX
.
l Group Average .
Proximity Matrix
l Distance Between Centroids
l Other methods driven by an objective
function
– Ward’s Method uses squared error
exercise problem
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MAX
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
p jCluster j
proximity(Clusteri , Clusterj )
|Clusteri ||Clusterj |
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
l Strengths
– Less susceptible to noise and outliers
l Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and Space requirements
– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation