Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
Data Mining: Clustering Validation Minimum Description Length Information Theory Co-Clustering
LECTURE 8
Clustering Validation
Minimum Description Length
Information Theory
Co-Clustering
CLUSTERING VALIDITY
Cluster Validity
• How do we evaluate the “goodness” of the resulting
clusters?
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
K-means Complete
0.8 0.8
0.7 0.7
0.6 0.6
Link
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists in the
data.
2. Comparing the results of a cluster analysis to externally known
results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data
without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
y
0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 10 0.9
0.9 20 0.8
0.8 30 0.7
0.7 40 0.6
Points
0.6 50 0.5
0.5 60 0.4
y
0.4 70 0.3
0.3 80 0.2
0.2 90 0.1
0.1 100 0
20 40 60 80 100 Similarity
0
0 0.2 0.4 0.6 0.8 1
Points
x
𝑑 𝑖𝑗 − 𝑑 𝑚𝑖𝑛
𝑠 𝑖𝑚(𝑖 , 𝑗 )=1 −
𝑑 𝑚𝑎𝑥 −𝑑 𝑚𝑖𝑛
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
K-means
Using Similarity Matrix for Cluster
Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Complete Link
Using Similarity Matrix for Cluster Validation
1
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: SSE
• Internal Index: Used to measure the goodness of a
clustering structure without reference to external
information
• Example: SSE
• SSE is good for comparing two clusterings or two clusters
(average SSE).
• Can also be used to estimate the number of clusters
10
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Estimating the “right” number of clusters
• Typical approach: find a “knee” in an internal measure curve.
10
6
SSE
0
2 5 10 15 20 25 30
K
1
2 6
3
4
WSS ( x ci ) 2
i xCi
cohesion separation
Internal measures – caveats
• Internal measures have the problem that the
clustering algorithm did not set out to optimize
this measure, so it is will not necessarily do well
with respect to the measure.
25
0.4
20
0.3
15
0.2
10
0.1
5
0
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE
Statistical Framework for Correlation
• Correlation of incidence and proximity matrices for the
K-means clusterings of the following two data sets.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
• Recall:
• Of cluster i with respect to class j:
• For the precision of a clustering you can take the maximum
• F-measure:
• Harmonic Mean of Precision and Recall:
Good and bad clustering
Cluster 1 Cluster 1
Cluster 2 Cluster 2
Cluster 3 Cluster 3
Purity: (0.94, 0.81, 0.85) – overall 0.86 Purity: (0.38, 0.38, 0.38) – overall 0.38
Precision: (0.94, 0.81, 0.85) Precision: (0.38, 0.38, 0.38)
Recall: (0.85, 0.9, 0.85) Recall: (0.35, 0.42, 0.38)
Another bad clustering
Cluster 1
Cluster 2 Cluster 1:
Purity: 1
Cluster 3 Precision: 1
Recall: 0.35
100 300
External Measures of Cluster Validity:
Entropy and Purity
Final Comment on Cluster Validity
“The validation of clustering structures is the most
difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”
Example
• Regression: find a polynomial for describing a set of values
• Model complexity (model cost): polynomial coefficients
• Goodness of fit (data cost): difference between real value and the polynomial value
• Cluster integers into two clusters and describe the cluster by the centroid
and the points by their distance from the centroid
• Model cost: cost of the centroids
• Data cost: cost of cluster membership and distance from centroid
• Example
00001000010000100001000010000100001000010001000010000100001
0100111001010011011010100001110101111011011010101110010011100
• Fun lecture:
• Compression Progress: The Algorithmic Principle Behind Curio
sity and
Creativity
Issues with MDL
• What is the right model family?
• This determines the kind of solutions that we can have
• E.g., polynomials
• Clusterings
AAABBBAAACCCABACAABBAACCABAC
• Not symmetric
• Problematic if Q not defined for all x of P.
Some information theoretic measures
• Jensen-Shannon Divergence JS(P,Q): distance
between two distributions P and Q
• Deals with the shortcomings of KL-divergence
• Jensen-Shannon is a metric
USING MDL FOR
CO-CLUSTERING
(CROSS-ASSOCIATIONS)
Thanks to Spiros Papadimitriou.
Co-clustering
• Simultaneous grouping of rows and columns of a
matrix into homogeneous groups
Students buying books
5
5
97% 3%
Customer groups
10
10
54% 15
Customers
15
20
20 3% 96%
25
25 5 10 15 20 25
5
Products
10 15 20 25
Product groups
Why is this
better?
Row groups
Row groups
versus
implies
Co-clustering
MDL formalization—Cost objective
ℓ = 3 col. groups
m1 m2 m3
density of ones
n × m matrix
+ log*k + log*ℓ +
transmit
log n m
i,j
i
transmit
j
one row group n row groups
one col group m col groups
+
low high
description cost
(block structure)
Co-clustering
MDL formalization—Cost objective
k = 3 row groups
ℓ = 3 col groups
low
+
description cost
(block structure)
one row group n row groups k = 3 row groups
one col group m col groups ℓ = 3 col groups
MDL formalization—Cost objective
ℓ
Cost vs. number of groups
Co-clustering
k
total bit cost
Co-clustering
column
row shuffle
shuffle
No cost improvement:
Discard
Search for solution
Shuffles
• Let
Similarity (“KL-divergences”)
p1,1 p1,2 p1,3 of row fragments
denote row and col. partitions
to blocks of aat
rowthe I-th iteration
group
• Fix I and for everyAssign
row x:to second row-group
p2,1 p p
2,2 2,3
• Splice into ℓ parts, one for each column group
• Let j, for j = 1,…,ℓ, be the number of ones in each part
p3,1 p3,2
• Assign row x to pthe
3,3
row group i¤ I+1(x) such that, for all
i = 1,…,k,
Search for solution
Overview: number of groups k and ℓ (splits & shuffles)
k = 5, ℓ = 5
Search for solution
Overview: number of groups k and ℓ (splits & shuffles)
k = 1, ℓ = 1
shuffle shuffle
col. split
row row split
No cost improvement:
Discard
k = 5,
6, ℓ = 56 k = 5, ℓ = 5
shuffle shuffle shuffle shuffle shuffle shuffle shuffle
col. splitrow split col. split row split col. split row split col. split
k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5
Split: Shuffle:
Increase k or ℓ Rearrange rows or cols
Search for solution
Overview: number of groups k and ℓ (splits & shuffles)
k = 1, ℓ = 1 k = 5, ℓ = 5
k = 5, ℓ = 5
Final result
k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5
Split: Shuffle:
Increase k or ℓ Rearrange rows or cols
Co-clustering
CLASSIC
CLASSIC corpus
• 3,893 documents
• 4,303 words
Documents
Combination of 3 sources:
Words
• MEDLINE (medical)
• CISI (info. retrieval)
• CRANFIELD (aerodynamics)
Graph co-clustering
CLASSIC
Documents
Words
MEDLINE
(medical)
CISI
(Information
Retrieval)
CRANFIELD
(aerodynamics)
0.94-1.00
6 207 0 0 1.000
7 188 0 0 1.000
8 131 0 0 1.000
9 209 0 0 1.000
10 107 2 0 0.982 0.987
11 152 3 2 0.968
12 74 0 0 1.000
13 139 9 0 0.939
14 163 0 0 1.000
15 24 0 0 1.000
Recall 0.996 0.990 0.968
0.97-0.99