Clustering
Clustering
x
x x
x
x x
x
What can we do with clustering?
• One of the major applications of clustering in
bioinformatics is on microarray data to cluster similar
genes
• Hypotheses:
• Genes with similar expression patterns implies that the
coexpression of these genes
• Coexpressed genes can imply that
• they are involved in similar functions
• they are somehow related, for instance because their proteins
directly/indirectly interact with each other
• It is widely believed that coexpressed genes implies that
they are involved in similar functions
• But still, what can we really gain from doing clustering?
Purpose of Clustering on Microarray
• Suppose genes A and B are grouped in the same
cluster, then we hypothesis that genes A and B are
involved in similar function.
• If we know the role of gene A is apoptosis
• but we do not know if gene B is involved in apoptosis
• we can do experiments to confirm if gene B indeed is
involved in apoptosis.
Purpose of Clustering on Microarray
• Suppose genes A and B are grouped in the same
cluster, then we hypothesize that proteins A and B
might interact with each other.
• So we can do experiments to confirm if such interaction
exists.
• So clustering microarray data in a way helps us
make hypotheses about:
• potential functions of genes
• potential protein-protein interactions
Does Clustering Always Work?
• Do coexpressed genes always imply that they have
similar functions?
• Not necessarily
• housekeeping genes
• genes which always expressed or never expressed despite of
different conditions
• there can be noise in microarray data
• But clustering is useful in:
• visualization of data
• hypothesis generation
Overview of clustering
• Dissimilarity matrix 0
d(2,1) 0
• (one mode)
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
(Dis)similarity measures
• Instead of talking about similarity measures, we
often equivalently refer to dissimilarity measures
(I’ll give an example of how to convert between
them in a few slides…)
• Jagota defines a dissimilarity measure as a function
f(x,y) such that f(x,y) > f(w,z) if and only if x is less
similar to y than w is to z
• This is always a pair-wise measure
• Think of x, y, w, and z as gene expression profiles
(rows or columns)
Continuous Variable
• Standardize data
• Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f ... xnf )
.
• Manhattan distance
n
d ( g1 , g 2 ) ( xi yi )
i 1
• Minkowski distance
n
d ( g1 , g 2 ) m ( xi yi ) m
i 1
deuc=0.5846 deuc=1.1345
( x x )(y
i i y)
( x , y) n
i 1
n
( xi x )
i 1
2
i
( y
i 1
y ) 2
1 n
x xi
n i
1 n
y yi
n i
1 (x, y)
dp
2
Pearson Linear Correlation
• PLC only measures the degree of a linear
relationship between two expression profiles!
• If you want to measure other relationships, there
are many other possible measures.
= 0.0249, so dp = 0.4876
The green curve is the square of the blue
curve – this relationship is not captured
with PLC
Binary Variable
• A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a c b d p
d (i, j) p
p
m
• Agglomerative (bottom-up):
• Beginning with singletons (sets with 1 element)
• Merging them until S is achieved as the root.
• It is the most common approach.
• Divisive (top-down):
• Recursively partitioning S until singleton sets are reached.
Linkage in Hierarchical Clustering
• We already know about distance measures
between data items, but what about between a
data item and a cluster or between two clusters?
• We just treat a data point as a cluster with a single
item, so our only problem is to define a linkage
method between clusters
• As usual, there are lots of choices…
Average Linkage
• Eisen’s cluster program defines average linkage as
follows:
• Each cluster ci is associated with a mean vector i which
is the mean of all the data items in the cluster
• The distance between two clusters ci and cj is then just
d(i , j )
• This is somewhat non-standard – this method is
usually referred to as centroid linkage and average
linkage is defined as the average of all pairwise
distances between points in the two clusters
Single Linkage
• The minimum of all pairwise distances between
points in the two clusters
• Tends to produce long, “loose” clusters
Complete Linkage
• The maximum of all pairwise distances between
points in the two clusters
• Tends to produce very “tight” clusters
Hierarchical Agglomerative Clustering
• We start with every data point in a separate cluster
• We keep merging the most similar pairs of data
points/clusters until we have one big cluster left
• This is called a bottom-up or agglomerative method
Hierarchical Agglomerative Clustering
• This produces a
binary tree or
dendrogram
• The final cluster is
the root and each
data item is a leaf
• The height of the
bars indicate how
close the items are
Hierarchical Clustering Example
Hierarchical Clustering Example
Formation of Clusters
• Forming clusters from dendograms
Single-Link Method
Euclidean Distance
a
a,b
b a,b,c a,b,c,d
c d c d d
b c d b c d c d d
a 2 5 6 a 2 5 6 a, b 3 5 a , b, c 4
b 3 5 b 3 5 c 4
c 4 c 4
Distance Matrix
Complete-Link Method
Euclidean Distance
a
a,b a,b
b a,b,c,d
c,d
c d c d
b c d b c d c d c, d
a 2 5 6 a 2 5 6 a, b 5 6 a, b 6
b 3 5 b 3 5 c 4
c 4 c 4
Distance Matrix
Compare Dendrograms
Single-Link Complete-Link
ab c d 0
ab c d
6
Hierarchical Clustering Issues
• Distinct clusters are not produced – sometimes this
can be good, if the data has a hierarchical structure
without clear boundaries
• There are methods for producing distinct clusters,
but these usually involve specifying somewhat
arbitrary cutoff values
• What if data doesn’t have a hierarchical structure?
Is HC appropriate?
Hierarchical Clustering
• Advantages
• Dendograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
• Disadvantages
• Not easy to define levels for clusters
• Experiments showed that other clustering techniques
outperform hierarchical clustering
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Clustering Algorithms –
2. k-means Clustering
k-means Clustering
1. Choose a number of clusters k
2. Initialize cluster centers 1,… k
• Could pick k data points and set cluster centers to these
points
• Or could randomly assign points to clusters and take
means of clusters
3. For each data point, compute the cluster center it is
closest to (using some distance measure) and assign the
data point to this cluster
4. Re-compute cluster centers (mean of data points in cluster)
5. Stop when there are no new re-assignments
k-means Clustering
• Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means Clustering
• Stopping criteria:
• No change in the members of all clusters
• when the squared error is less than some small threshold
value
• Squared error se k
se
2
p mi
i 1 pci
• where mi is the mean of all instances in cluster ci
• se(j) < (after jth iteration)
• Properties of k-means
• Guaranteed to converge
• Guaranteed to achieve local optimal, not necessarily global
optimal.
k-means Clustering
• Pros:
• Low complexity
• Cons:
• Necessity of specifying k
• Sensitive to noise and outlier data points
• Outliers: a small number of such data can substantially
influence the mean value)
• Clusters are sensitive to initial assignment of centroids
• k-means is not a deterministic algorithm
• Clusters can be inconsistent from one run to another
k-means Clustering Issues
• Random initialization means that you may get
different clusters each time
• Data points are assigned to only one cluster (hard
assignment)
• Implicit assumptions about the “shapes” of clusters
(more about this in project #3)
• You have to pick the number of clusters…
Determining # of Clusters
• We’d like to have a measure of cluster quality Q and
then try different values of k until we get an
optimal value for Q
• But, since clustering is an unsupervised learning
method, we can’t really expect to find a “correct”
measure Q.
• So, once again there are different choices of Q and
our decision will depend on what dissimilarity
measure we’re using and what types of clusters we
want
Cluster Quality Measures
• Jagota (p.36) suggests a measure that emphasizes
cluster tightness or homogeneity:
k
1
Q d (x , i )
i 1 | Ci | x Ci
k
Cluster Quality
• The Q measure takes into account homogeneity within
clusters, but not separation between clusters
• Other measures try to combine these two characteristics
(i.e., Davies-Bouldin measure, Silhouette)
𝑆𝑖 +𝑆𝑗
where 𝐷𝑖 = max 𝑅𝑖,𝑗 and 𝑅𝑖,𝑗 =
𝑗≠𝑖 𝑀𝑖,𝑗
1 𝑇𝑖 𝑝 1/𝑝
where 𝑆𝑖 = σ 𝑋𝑗 − 𝐴𝑖 and 𝑀𝑖,𝑗 = 𝐴𝑖 − 𝐴𝑗
𝑇𝑖 𝑗=1 𝑝
Within Between
Silhouette
𝑏 𝑖 −𝑎(𝑖)
𝑠 𝑖 =
max 𝑎 𝑖 ,𝑏(𝑖)
where 𝑎 𝑖 is the average distance between 𝑖 and all other data within the
same cluster, and 𝑏 𝑖 is the lowest average distance of 𝒊 to all points in
any other cluster, of which 𝒊 is not a member.
Same as “neighboring cluster”
i 1