0% found this document useful (0 votes)
65 views

Clustering Algorithms

The document discusses different clustering techniques including hierarchical and partitional clustering. It describes hierarchical agglomerative clustering and three linkage methods - single, complete, and average linkage. It also explains k-means clustering including how it works, the algorithm, updating cluster means, and stopping criteria. Clustering and biclustering of microarray data is also briefly mentioned.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Clustering Algorithms

The document discusses different clustering techniques including hierarchical and partitional clustering. It describes hierarchical agglomerative clustering and three linkage methods - single, complete, and average linkage. It also explains k-means clustering including how it works, the algorithm, updating cluster means, and stopping criteria. Clustering and biclustering of microarray data is also briefly mentioned.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Clustering

Dr. Zoya Khalid


[email protected]
Clustering techniques

• Hierarchical: Organize elements into a tree, leaves represent genes and


the length of the paths between leaves represents the distances between
objects (genes, etc). Similar objects lie within the same subtrees. It has
two types:
• Agglomerative (Bottom-Up): Start with every element in its own cluster,
and iteratively join clusters together
• Divisive (Top-Down): Start with one cluster and iteratively divide it into
smaller clusters
Contid….
Measures of similarity and
dissimilarity (distance)
• There are many different ways of calculating similarity and distance
• Knowing your data is important
• When working on distance, pay attention to three properties:
positivity, symmetry, and triangle inequality.
• Examples
• Euclidean distance
Hierarchical Agglomerative Clustering
Most Hierarchical clustering algorithms are agglomerative
Three Techniques
Hierarchical clustering: Recomputing distances
d min (C , C * ) = min d ( x, y )
for all elements x in C and y in C*
• Distance between two clusters is the smallest distance between any pair of their elements
(single-linkage)

d max (C , C * ) = max d ( x, y )
for all elements x in C and y in C*
• Distance between two clusters is the largest distance between any pair of their elements
(complete-linkage)
d avg (C , C ) =
* å d ( x, y )
C C*
for all elements x in C and y in C*
• Distance between two clusters is the average distance between all pairs of their elements
(average-linkage)
Single Linkage example
Single Linkage continued

A B D F
Continued
Complete Linkage Method
Contid….
Contid….
Contid…
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers
• Complete-link, Average-link,
• Robust to outliers
• Tend to break large clusters
• Prefer spherical clusters (smaller sized)
Partitional clustering
• It determines all clusters at once
They include:
• K-means and derivatives
• Fuzzy c-means clustering
• QT clustering algorithm
K –means clustering
K- Means Clustering
K-Means clustering
• consider an example in which our vectors have 2 dimensions

+ +

+ cluster center
profile

+
K-Means clustering
• each iteration involves two steps
• assignment of profiles to clusters
• re-computation of the cluster centers (means)
+ + + +

+ +

+ +

assignment re-computation of cluster centers


Example
y
Distance between two clusters
Distance from Cluster 1 – Cluster 2
y
Tabulate them
y
y
Tabulate the new dataset
y
y
y
y
Elbow Method
• It involves running the algorithm multiple times over a loop, with an
increasing number of cluster choice and then plotting a clustering
score as a function of the number of clusters.
How K-means algorithm works
K-Means clustering algorithm
• Input: K, number of clusters, a set X={x1,.. xN} of data points, where
xi are p-dimensional vectors
• Initialize
• Select initial cluster means f1, ….., fK
• Repeat until convergence
• Assign each xi to cluster C(i) such that

C(i) = argmin1≤k≤K ǁ xi - fk ǁ2

• Re-estimate the mean of each cluster based on new members


K-means: updating the mean
• To compute the mean of the kth cluster

1 X
fk = xi
Nk
i:C(i)=k

Number of points in cluster k All points in cluster k


K-means stopping criteria
1. Assignment of objects to clusters don’t change (convergence)

2. Maximum Number of iterations


Microarray data
Clustering of microarray data
Clsutering and Biclustering
• Biclustering - Identifies groups of genes with similar/coherent
expression patterns under a specific subset of the conditions.
• Clustering - Identifies groups of genes/conditions that show similar
activity patterns under all the set of conditions/all the set of genes
under analysis.

You might also like