0% found this document useful (0 votes)
13 views

Clustering

Clustering

Uploaded by

parksy317575
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Clustering

Clustering

Uploaded by

parksy317575
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Clustering

Instructor: Junghye Lee


School of Management Engineering
[email protected]
Motivation – Why Do Clustering?
What Is Clustering?
• A way of grouping together data samples that are
similar in some way - according to some criteria
that you pick
• A form of unsupervised learning – you generally
don’t have examples demonstrating how the data
should be grouped together
• So, it’s a method of data exploration – a way of
looking for patterns or structure in the data that
are of interest
Why clustering?
• Let’s look at the problem in a different angle
• The issue here is dealing with high-dimensional data
• How do people deal with high-dimensional data?
• Start by finding interesting patterns associated with the data
• Clustering is one of the well-known techniques with successful
applications on large domain for finding patterns
• Some successes in applying clustering on microarray data
• Golub et. al (1999) uses clustering techniques to discover
subclasses of AML and ALL from microarray data
• Eisen et. al (1998) uses clustering techniques that are able to
group genes of similar function together.
• But what is clustering?
What Kind of Clusters?
• Cluster genes = columns
• Measure expression at multiple time-points, different
conditions, etc.
• Similar expression patterns may suggest similar
functions of genes (is this always true?)
• Cluster samples = rows
• e.g., expression levels of thousands of genes for each
tumor sample
• Similar expression patterns may suggest biological
relationship among samples
Introduction
• The goal of clustering is to
• group data points that are close (or similar) to each other
• identify such groupings (or clusters) in an unsupervised
manner
• Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
• Example
x x What should be the clusters
for these data points?

x
x x
x
x x
x
What can we do with clustering?
• One of the major applications of clustering in
bioinformatics is on microarray data to cluster similar
genes
• Hypotheses:
• Genes with similar expression patterns implies that the
coexpression of these genes
• Coexpressed genes can imply that
• they are involved in similar functions
• they are somehow related, for instance because their proteins
directly/indirectly interact with each other
• It is widely believed that coexpressed genes implies that
they are involved in similar functions
• But still, what can we really gain from doing clustering?
Purpose of Clustering on Microarray
• Suppose genes A and B are grouped in the same
cluster, then we hypothesis that genes A and B are
involved in similar function.
• If we know the role of gene A is apoptosis
• but we do not know if gene B is involved in apoptosis
• we can do experiments to confirm if gene B indeed is
involved in apoptosis.
Purpose of Clustering on Microarray
• Suppose genes A and B are grouped in the same
cluster, then we hypothesize that proteins A and B
might interact with each other.
• So we can do experiments to confirm if such interaction
exists.
• So clustering microarray data in a way helps us
make hypotheses about:
• potential functions of genes
• potential protein-protein interactions
Does Clustering Always Work?
• Do coexpressed genes always imply that they have
similar functions?
• Not necessarily
• housekeeping genes
• genes which always expressed or never expressed despite of
different conditions
• there can be noise in microarray data
• But clustering is useful in:
• visualization of data
• hypothesis generation
Overview of clustering

• From the paper “Data clustering: review”


• Feature Selection
• identifying the most effective subset of the original features to use in
clustering
• Feature Extraction
• transformations of the input features to produce new salient features.
• Interpattern Similarity
• measured by a distance function defined on pairs of patterns.
• Grouping
• methods to group similar patterns in the same cluster
(Dis)similarity Measures
Data Representations for Clustering
• Input data to algorithm is usually a vector (also called a
“tuple” or “record”)
• Example: Clinical Sample Data
• Age (numerical)
• Weight (numerical)
• Gender (categorical)
• Diseased? (binary)
• Types of data
• Numerical
• Categorical
• Boolean
• Must also include a method for computing similarity of
or distance between vectors
How do we define “similarity”?
• Recall that the goal is to group together “similar”
data – but what does this mean?
• No single answer – it depends on what we want to
find or emphasize in the data; this is one reason
why clustering is an “art”
• The similarity measure is often more important
than the clustering algorithm used – don’t overlook
this choice!
Data structures
 x11 x1p 
• Data matrix 
... x1f ...

• (two modes)  ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

• Dissimilarity matrix  0 
 d(2,1) 0 
• (one mode)  
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
(Dis)similarity measures
• Instead of talking about similarity measures, we
often equivalently refer to dissimilarity measures
(I’ll give an example of how to convert between
them in a few slides…)
• Jagota defines a dissimilarity measure as a function
f(x,y) such that f(x,y) > f(w,z) if and only if x is less
similar to y than w is to z
• This is always a pair-wise measure
• Think of x, y, w, and z as gene expression profiles
(rows or columns)
Continuous Variable
• Standardize data
• Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where
m f  1n (x1 f  x2 f  ...  xnf )
.

• Calculate the standardized measurement (z-score)


xif  m f
zif  sf

• Using mean absolute deviation is more robust than using


standard deviation
Distance Measure
• Euclidean distance
n
d ( g1 , g 2 )   ( xi  yi ) 2
i 1

• Manhattan distance
n
d ( g1 , g 2 )   ( xi  yi )
i 1

• Minkowski distance
n
d ( g1 , g 2 )  m  ( xi  yi ) m
i 1
deuc=0.5846 deuc=1.1345

deuc=2.6115 These examples of Euclidean


distance match the intuition
of dissimilarity pretty well.
deuc=1.41 deuc=1.22

What about these?


What might be going on with the expression profiles on
the left? On the right?
Correlation
• We might care more about the overall shape of expression
profiles rather than the actual magnitudes
• That is, we might want to consider genes similar when they
are “up” and “down” together
• When might we want this kind of measure? What
experimental issues might make this appropriate?
Pearson Linear Correlation
n

 ( x  x )(y
i i  y)
 ( x , y)  n
i 1
n

 ( xi  x )
i 1
2
 i
( y
i 1
 y ) 2

1 n
x   xi
n i
1 n
y   yi
n i

• We’re shifting the expression profiles down (subtracting the


means) and scaling by the standard deviations (i.e., making
the data have mean = 0 and std = 1)
Pearson Linear Correlation
• Pearson linear correlation (PLC) is a measure that is
invariant to scaling and shifting (vertically) of the expression
values
• Always between –1 and +1 (perfectly anti-correlated and
perfectly correlated)
• This is a similarity measure, but we can easily make it into a
dissimilarity measure:

1   (x, y)
dp 
2
Pearson Linear Correlation
• PLC only measures the degree of a linear
relationship between two expression profiles!
• If you want to measure other relationships, there
are many other possible measures.

 = 0.0249, so dp = 0.4876
The green curve is the square of the blue
curve – this relationship is not captured
with PLC
Binary Variable
• A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p

• Simple matching coefficient (invariant, bc


d (i, j) 
if the binary variable is symmetric): a bc  d
• Jaccard coefficient (noninvariant if the d (i, j)  bc
binary variable is asymmetric): a bc
Dissimilarity of Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• gender is a symmetric attribute
• the remaining attributes are asymmetric binary
• let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Nominal Variables
• A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
• m: # of matches, p: total # of variables

d (i, j)  p 
p
m

• Method 2: use a large number of binary variables


• creating a new binary variable for each of the M nominal
states
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replacing xif by their rank rif {1,...,M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1
• compute the dissimilarity using methods for interval-scaled
variables
Clustering Algorithms –
1. Hierarchical Clustering
Hierarchical Clustering
• There are two styles of hierarchical clustering
algorithms to build a tree from the input set S:

• Agglomerative (bottom-up):
• Beginning with singletons (sets with 1 element)
• Merging them until S is achieved as the root.
• It is the most common approach.

• Divisive (top-down):
• Recursively partitioning S until singleton sets are reached.
Linkage in Hierarchical Clustering
• We already know about distance measures
between data items, but what about between a
data item and a cluster or between two clusters?
• We just treat a data point as a cluster with a single
item, so our only problem is to define a linkage
method between clusters
• As usual, there are lots of choices…
Average Linkage
• Eisen’s cluster program defines average linkage as
follows:
• Each cluster ci is associated with a mean vector i which
is the mean of all the data items in the cluster
• The distance between two clusters ci and cj is then just
d(i , j )
• This is somewhat non-standard – this method is
usually referred to as centroid linkage and average
linkage is defined as the average of all pairwise
distances between points in the two clusters
Single Linkage
• The minimum of all pairwise distances between
points in the two clusters
• Tends to produce long, “loose” clusters
Complete Linkage
• The maximum of all pairwise distances between
points in the two clusters
• Tends to produce very “tight” clusters
Hierarchical Agglomerative Clustering
• We start with every data point in a separate cluster
• We keep merging the most similar pairs of data
points/clusters until we have one big cluster left
• This is called a bottom-up or agglomerative method
Hierarchical Agglomerative Clustering

• This produces a
binary tree or
dendrogram
• The final cluster is
the root and each
data item is a leaf
• The height of the
bars indicate how
close the items are
Hierarchical Clustering Example
Hierarchical Clustering Example
Formation of Clusters
• Forming clusters from dendograms
Single-Link Method
Euclidean Distance

a
a,b
b a,b,c a,b,c,d
c d c d d

(1) (2) (3)

b c d b c d c d d
a 2 5 6 a 2 5 6 a, b 3 5 a , b, c 4
b 3 5 b 3 5 c 4
c 4 c 4

Distance Matrix
Complete-Link Method
Euclidean Distance

a
a,b a,b
b a,b,c,d
c,d
c d c d

(1) (2) (3)

b c d b c d c d c, d
a 2 5 6 a 2 5 6 a, b 5 6 a, b 6
b 3 5 b 3 5 c 4
c 4 c 4

Distance Matrix
Compare Dendrograms
Single-Link Complete-Link

ab c d 0
ab c d

6
Hierarchical Clustering Issues
• Distinct clusters are not produced – sometimes this
can be good, if the data has a hierarchical structure
without clear boundaries
• There are methods for producing distinct clusters,
but these usually involve specifying somewhat
arbitrary cutoff values
• What if data doesn’t have a hierarchical structure?
Is HC appropriate?
Hierarchical Clustering
• Advantages
• Dendograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters

• Disadvantages
• Not easy to define levels for clusters
• Experiments showed that other clustering techniques
outperform hierarchical clustering
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Clustering Algorithms –
2. k-means Clustering
k-means Clustering
1. Choose a number of clusters k
2. Initialize cluster centers 1,… k
• Could pick k data points and set cluster centers to these
points
• Or could randomly assign points to clusters and take
means of clusters
3. For each data point, compute the cluster center it is
closest to (using some distance measure) and assign the
data point to this cluster
4. Re-compute cluster centers (mean of data points in cluster)
5. Stop when there are no new re-assignments
k-means Clustering
• Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means Clustering
• Stopping criteria:
• No change in the members of all clusters
• when the squared error is less than some small threshold
value 
• Squared error se k
se   
2
p  mi
i 1 pci
• where mi is the mean of all instances in cluster ci
• se(j) <  (after jth iteration)
• Properties of k-means
• Guaranteed to converge
• Guaranteed to achieve local optimal, not necessarily global
optimal.
k-means Clustering
• Pros:
• Low complexity

• Cons:
• Necessity of specifying k
• Sensitive to noise and outlier data points
• Outliers: a small number of such data can substantially
influence the mean value)
• Clusters are sensitive to initial assignment of centroids
• k-means is not a deterministic algorithm
• Clusters can be inconsistent from one run to another
k-means Clustering Issues
• Random initialization means that you may get
different clusters each time
• Data points are assigned to only one cluster (hard
assignment)
• Implicit assumptions about the “shapes” of clusters
(more about this in project #3)
• You have to pick the number of clusters…
Determining # of Clusters
• We’d like to have a measure of cluster quality Q and
then try different values of k until we get an
optimal value for Q
• But, since clustering is an unsupervised learning
method, we can’t really expect to find a “correct”
measure Q.
• So, once again there are different choices of Q and
our decision will depend on what dissimilarity
measure we’re using and what types of clusters we
want
Cluster Quality Measures
• Jagota (p.36) suggests a measure that emphasizes
cluster tightness or homogeneity:
k
1
Q  d (x ,  i )
i 1 | Ci | x Ci

• |Ci | is the number of data points in cluster i


• Q will be small if (on average) the data points in
each cluster are close
Cluster Quality
• This is a plot of the Q measure for k-means clustering on the
data shown earlier.
• How many clusters do you think there actually are?

k
Cluster Quality
• The Q measure takes into account homogeneity within
clusters, but not separation between clusters
• Other measures try to combine these two characteristics
(i.e., Davies-Bouldin measure, Silhouette)

• An alternate approach is to look at cluster stability:


• Add random noise to the data many times and count
how many pairs of data points no longer cluster together
• How much noise to add? Should reflect estimated
variance in the data
Davies-Bouldin Measure
1
𝐷𝐵 ≡ σ𝑛𝑐 𝐷
𝑛𝑐 𝑖=1 𝑖

𝑆𝑖 +𝑆𝑗
where 𝐷𝑖 = max 𝑅𝑖,𝑗 and 𝑅𝑖,𝑗 =
𝑗≠𝑖 𝑀𝑖,𝑗

1 𝑇𝑖 𝑝 1/𝑝
where 𝑆𝑖 = σ 𝑋𝑗 − 𝐴𝑖 and 𝑀𝑖,𝑗 = 𝐴𝑖 − 𝐴𝑗
𝑇𝑖 𝑗=1 𝑝
Within Between
Silhouette
𝑏 𝑖 −𝑎(𝑖)
𝑠 𝑖 =
max 𝑎 𝑖 ,𝑏(𝑖)

where 𝑎 𝑖 is the average distance between 𝑖 and all other data within the
same cluster, and 𝑏 𝑖 is the lowest average distance of 𝒊 to all points in
any other cluster, of which 𝒊 is not a member.
Same as “neighboring cluster”

1 − 𝑎(𝑖)/𝑏 𝑖 if 𝑎 𝑖 < 𝑏(𝑖)


𝑠 𝑖 =ቐ 0, if 𝑎 𝑖 = 𝑏(𝑖)
𝑏(𝑖)/𝑎 𝑖 − 1 if 𝑎 𝑖 > 𝑏(𝑖)

What is the range of this measure?


−1 ≤ 𝑠 𝑖 ≤ 1
Silhouette
Fuzzy c-means
• An extension of k-means
• Hierarchical, k-means generates partitions
• each data point can only be assigned in one cluster
• Fuzzy c-means allows data points to be assigned
into more than one cluster
• each data point has a degree of membership (or
probability) of belonging to each cluster
Fuzzy c-means Algorithm
• Let xi be a vector of values for data point gi.
1. Initialize membership U(0) = [ uij ] for data point gi of clust
er clj by random
2. At the k-th step, compute the fuzzy centroid C(k) = [ cj ] for
j = 1, .., nc, where nc is the number of clusters, using
n
 (uij ) m xi
i 1
cj  n
 ij
(u ) m

i 1

where m is the fuzzy parameter and n is the number of data points.


Fuzzy c-means Algorithm
3. Update the fuzzy membership U(k) = [ uij ], using
1
 1   m 1
 
 xi  c j 
uij   
1
nc   m 1
 1 
  xi  c j 
j 1
 

4. If ||U(k) – U(k-1)|| < , then STOP, else return to step 2.


5. Determine membership cutoff
• For each data point gi, assign gi to cluster clj if uij of U(k) > 
Fuzzy c-means
• Pros:
• Allows a data point to be in multiple clusters
• A more natural representation of the behavior of genes
• genes usually are involved in multiple functions
• Cons:
• Need to define c, the number of clusters
• Need to determine membership cutoff value
• Clusters are sensitive to initial assignment of centroids
• Fuzzy c-means is not a deterministic algorithm
Other Clustering Algorithms
• Clustering is a very popular method of microarray
analysis and also a well established statistical technique
– huge literature out there
• Many variations on k-means, including algorithms in
which clusters can be split and merged or that allow for
soft assignments (multiple clusters can contribute)
• Semi-supervised clustering methods, in which some
examples are assigned by hand to clusters and then
other membership information is inferred

You might also like