Hierar Scale4
Hierar Scale4
Apes and Us
http://orangutan.org/orangutan-genome-part-1-the-quest-for-leakeys-ancestral-great-ape/
The Tree of Life
http://pixgood.com/evolution-of-life-poster.html
P450 protein superfamily
1600 sequences
http://www.whoi.edu/page.do?pid=75497&tid=441&cid=166573&ct=61&article=109489 16 species
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits 5
6
0.2 4
3 4
2
0.15 5
2
0.1
1
0.05 3 1
0
1 3 2 5 4 6
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C3 C4
C5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
– Determined by one pair of points, i.e., by one link
in the proximity graph.
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Dendrogram
Nested Clusters
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
0.05
4
0
3 6 4 1 2 5
p jClusterj
proximity(Clusteri , Clusterj )
|Clusteri ||Clusterj |
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase
in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and Space requirements
E (mi ( j ) x j )
j 1
mi mi E
mi mi E
mi ( j ) ( x j mi ( j ) )
Scale-based Clustering
• Clustering is done at a “scale”
• An answer to the question of “how many
clusters”
• Best clusters tend to live over the longest
range of scales
Algorithm
• Start with a large number of clusters
• Initialize by selecting from data set
• Initialize “sigma” to a small value
• Update all centroids
• Eliminate duplicate centroids whenever there is a
merger
• Increase sigma by a constant factor
• If there are more than 1 unique centroid continue
update of centroids
• Stop only when a single unique centroid remains
Data set
Clustering result:
Evolution of the centroids
Thank you