Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
! Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
! Finding groups of objects such that the objects in a group ! Supervised classification
will be similar (or related) to one another and different – Have class label information
from (or unrelated to) the objects in other groups
! Simple segmentation
Inter-cluster
– Dividing students into different registration groups
Intra-cluster distances are alphabetically, by last name
distances are maximized
minimized
! Results of a query
– Groupings are a result of an external specification
! Graph partitioning
– Some mutual relevance and synergy, but areas are not
identical
Notion of a Cluster can be Ambiguous Partitional Clustering
! Hierarchical clustering p1
p3 p4
– A set of nested clusters organized as a hierarchical tree p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Other Distinctions Between Sets of Clusters Types of Clusters: Well-Separated
! Density-based clusters
! Property or Conceptual
! Map the clustering problem to a different domain ! K-means and its variants
and solve a related problem in that domain
– Proximity matrix defines a weighted graph, where the ! Hierarchical clustering
nodes are the points being clustered, and the
weighted edges represent the proximities between
! Density-based clustering
points
y
!
mentioned above. 1
iterations. 0
– Often the stopping condition is changed to Until relatively few
points change clusters -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
! Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
y
1 1 1
0.5
0.5 0.5 0.5
0
0 0 0
3 3
Iteration 4 Iteration 5 Iteration 6
3 3 3
2.5 2.5
2 2 2
1.5 1.5
y
y
1 1 1
0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
– For each point, the error is the distance to the nearest cluster
2.5 2.5
2 2
– To get SSE, we square these errors and sum them. 1.5 1.5
y
K 1 1
0
0.5
i =1 x"Ci
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
! can show that mi corresponds to the center (mean) of the cluster 2.5 2.5 2.5
– Given two clusters, we can choose the one with the smallest 2 2 2
y
– One easy way to reduce SSE is to increase K, the number of 1 1 1
0 0 0
! A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K -2 -1.5 -1 -0.5 0
x
0.5 1 1.5 2 -2 -1.5 -1 -0.5 0
x
0.5 1 1.5 2 -2 -1.5 -1 -0.5 0
x
0.5 1 1.5 2
Iteration 5
1
2
3
4 ! If there are K real clusters then the chance of selecting
3 one centroid from each cluster is small.
2.5
– Chance is relatively small when K is large
– If clusters are the same size, n, then
2
1.5
y
0.5
– For example, if K = 10, then probability = 10!/1010 = 0.00036
0
– Sometimes the initial centroids will readjust themselves in
right way, and sometimes they don t
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x – Consider an example of five pairs of clusters
10 Clusters Example 10 Clusters Example
Iteration 4
1
2
3 Iteration 4
1
2
3
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters Starting with some pairs of clusters having three initial centroids, while other have only one.
6 6
6 6
4 4
4 4
2 2
2 2
y
0 0
y
0 0
-2 -2
-2 -2
-4 -4
-4 -4
-6 -6
-6 -6
0 5 10 15 20 0 5 10 15 20
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
x x
Iteration 3 Iteration 4 8 8
8 8
6 6
6 6
4 4
4 4
2 2
2 2
y
0 0
y
0 0
-2 -2
-2 -2
-4 -4
-4 -4
-6 -6
-6 -6
0 5 10 15 20 0 5 10 15 20
0 5 10 15 20 0 5 10 15 20 x x
x x
Starting with two initial centroids in one cluster of each pair of clusters Starting with some pairs of clusters having three initial centroids, while other have only one.
Solutions to Initial Centroids Problem Updating Centers Incrementally
! Produces a set of nested clusters organized as a ! Two main types of hierarchical clustering
hierarchical tree – Agglomerative:
! Start with the points as individual clusters
! Can be visualized as a dendrogram ! At each step, merge the closest pair of clusters until only one cluster
– A tree like diagram that records the sequences of (or k clusters) left
merges or splits
– Divisive:
6 5 ! Start with one, all-inclusive cluster
0.2
4 ! At each step, split a cluster until each cluster contains a point (or
0.15
3
2
4
there are k clusters)
5
0.1
2
0.05
1
1
! Traditional hierarchical algorithms use a similarity or
distance matrix
3
0
1 3 2 5 4 6
! More popular hierarchical clustering technique ! After some merging steps, we have some clusters
C1 C2 C3 C4 C5
! Basic algorithm is straightforward
C1
1. Compute the proximity matrix
C2
2. Let each data point be a cluster
C3
3. Repeat C3
C4
4. Merge the two closest clusters C4
C5
5. Update the proximity matrix
Proximity Matrix
6. Until only a single cluster remains C1
! Start with clusters of individual points and a ! We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
proximity matrix p1 p2 p3 p4 p5
C1 C2 C3 C4 C5
... C1
p1
C2
p2
C3
p3 C3
C4
p4 C4
C5
p5
.
Proximity Matrix
.
C1
. Proximity Matrix
C2 C5
... ...
p1 p2 p3 p4 p9 p10 p11 p12
p1 p2 p3 p4 p9 p10 p11 p12
After Merging How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
! The question is How do we update the proximity matrix?
p1
C2
U
C1 C5 C3 C4 p2
C1 ? p3
C2 U C5 ? ? ? ? p4
C3
C3 ? p5
C4 ! MIN
? .
C4 ! MAX
.
Proximity Matrix ! Group Average
C1 .
Proximity Matrix
! Distance Between Centroids
! Other methods driven by an objective
C2 U C5
function
– Ward s Method uses squared error
...
p1 p2 p3 p4 p9 p10 p11 p12
p3 p3
p4 p4
p5 p5
! MIN ! MIN
. .
! MAX ! MAX
. .
! Group Average .
! Group Average .
Proximity Matrix Proximity Matrix
! Distance Between Centroids ! Distance Between Centroids
! Other methods driven by an objective ! Other methods driven by an objective
function function
– Ward s Method uses squared error – Ward s Method uses squared error
How to Define Inter-Cluster Similarity Cluster Similarity: MIN or Single Link
p1 p2 p3 p4 p5 ...
p1
! Similarity of two clusters is based on the two most
p2
similar (closest) points in the different clusters
p3 – Determined by one pair of points, i.e., by one link in
the proximity graph.
p4
p5
! MIN
.
! MAX
. I1 I2 I3 I4 I5
! Group Average .
Proximity Matrix I1 1.00 0.90 0.10 0.65 0.20
! Distance Between Centroids I2 0.90 1.00 0.70 0.60 0.50
! Other methods driven by an objective I3 0.10 0.70 1.00 0.40 0.30
function I4 0.65 0.60 0.40 1.00 0.80
– Ward s Method uses squared error I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
p4
5 0.2
2 1 0.15
p5
! MIN 2 3 6
. 0.1
! MAX
. 0.05
! Group Average . 4
Proximity Matrix 4
! Distance Between Centroids 0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
Original Points Two Clusters
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
• Can handle non-elliptical shapes I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6
0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
p j#Clusterj
proximity(Clusteri , Clusterj ) =
|Clusteri |!|Clusterj |
5 4 1
2
0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Cluster Similarity: Ward s Method Hierarchical Clustering: Time and Space requirements
! Similarity of two clusters is based on the increase ! O(N2) space since it uses the proximity matrix.
in squared error when two clusters are merged – N is the number of points.
– Similar to group average if distance between points is
distance squared
! O(N3) time in many cases
– There are N steps and at each step the size, N2,
! Less susceptible to noise and outliers proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for
! Biased towards globular clusters some approaches
! Once a decision is made to combine two clusters, ! Use MST for constructing hierarchy of clusters
it cannot be undone
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well Cluster Validity
! Idea is that for points in a cluster, their kth nearest 0.9 0.9
0.7
0.8
0.7
0.5
0.6
0.5
DBSCAN
y
distance
y
0.4 0.4
0.3 0.3
! So, plot sorted distance of every point to its kth 0.2 0.2
0
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation Measuring Cluster Validity Via Correlation
0.9
1
0.9
structure without respect to external information. 0.8 0.8
0.6 0.6
– Relative Index: Used to compare two different clusterings or 0.5 0.5
y
clusters. 0.4 0.4
! Often an external or internal index is used for this function, e.g., SSE or entropy 0.3 0.3
0.2 0.2
0 0
– However, sometimes criterion is the general strategy and index is the 0 0.2 0.4
x
0.6 0.8 1 0 0.2 0.4
x
0.6 0.8 1
1 10 0.9 0.9
1
10 20 0.8 0.8
0.9
0.9
30 0.7 0.7
20 0.8
0.8 40 0.6
0.6
30
Points
0.7
0.7 50 0.5 0.5
y
40 0.6
0.6 60 0.4 0.4
Points
50 0.5 70 0.3 0.3
0.5
y
Using Similarity Matrix for Cluster Validation Using Similarity Matrix for Cluster Validation
! Clusters in random data are not so crisp ! Clusters in random data are not so crisp
1 1 1 1
Points
Points
y
y
100 0 0 100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x Points x
0.9
1 500
2 0.8
6
0.7
1000
3 0.6 1
4 2 6
1500 0.5
0.4 3
4
2000
0.3
5
0.2
2500
0.1
7 5
3000 0
500 1000 1500 2000 2500 3000
7
DBSCAN
SSE of clusters found using K-means
! Clusters in more complicated figures aren t well separated ! Need a framework to interpret any measure.
! Internal Index: Used to measure the goodness of a clustering – For example, if our measure of evaluation has the value, 10, is that
structure without respect to external information good, fair, or poor?
9
! If the value of the index is unlikely, then the cluster results are valid
6
8 – These approaches are more complicated and harder to understand.
4
7
i x"C i
Count
0.5
y
25
0.4
20
0.3
15
– Separation is measured by the between cluster sum of squares
0.2
10
0.1
0
5 BSS = ! Ci (m " mi )2
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 i
x SSE – Where |Ci| is the size of cluster i
0.7 0.7
0.6 0.6
0.5 0.5
K=1 cluster: WSS= (1 ! 3)2 + (2 ! 3)2 + (4 ! 3)2 + (5 ! 3)2 = 10
y
0.4 0.4
0.2 0.2
0.1 0.1
0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1 Total = 10 + 0 = 10
x x
cohesion separation