0% found this document useful (0 votes)

141 views18 pages

K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University

1. The document discusses the K-Means clustering algorithm, which is used to group unlabeled data points into K number of clusters based on minimizing distances between data points and cluster centers. 2. The K-Means algorithm works by first randomly initializing K cluster centers. It then alternates between assigning data points to the closest cluster center, and updating each cluster center to the mean of its assigned points. 3. This process repeats until the assignments no longer change, resulting in a set of clusters where within-cluster distances are minimized and between-cluster distances are maximized.

Uploaded by

Hiino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views18 pages

K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University

Uploaded by

Hiino

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

K -Means Clustering and Related Algorithms

Ryan P. Adams
COS 324 – Elements of Machine Learning
Princeton University

In its broadest definition, machine learning is about automatically discovering structure in data.
Those data can take many forms, depending on what our goals are. The data might be something
like images and labels, in a supervised visual object recognition task, or the might be something
much more abstract, such as the “life experiences” of a robot embodied in the world. Here we will
examine one of the simplest ideas for discovering structure: we will look for groups, or clusters
among the data. Intuitively, a cluster is a subset of data in which the data are in some sense more
similar to each other than they are to data not in the cluster. For example, in a large set of news
articles, one cluster might correspond to a group of documents that are about baseball, because
these will tend to use similar words and be very different from documents about, say, international
politics, which might form a different cluster.
Finding groups in data is just one motivation for clustering. We might also imagine prediction
tasks based on these groups. For example, if we had a data set with images of different kinds of
animals, we might hope that a clustering algorithm would discover the animal types and be able to
answer questions such as “is the animal in image A of the same type as the animal in image B?” We
can also motivate clustering algorithms by thinking about compression. If I had to summarize a
large data set by a small set of examples, it might be sensible to choose those examples to represent
different distinct groups in the data. When each of the data is well described by its associated
example, then I might believe I had discovered some interesting structure. It can also be useful,
as we’ll see later in the course, to think of clustering algorithms as providing features of the data
that summarize a lot of information in the data in a concise way. Good features are critical for
supervised learning tasks to perform well (Coates et al., 2011). Even simple clustering algorithms
can discover important information about, for example, whether a recorded speech signal has a male
or female speaker. Finally, we can also view clustering algorithms as a funny kind of classification
in which we have to discover the labels for ourselves. That is, maybe we have a big bag of images
and we don’t a priori have concepts such as “horse” or “dog” and so we have to discover these
(hopefully) coherent groups from scratch.

1 Clustering
For most clustering algorithms, the main thing that we need is a notion of distance between our
data. If our data live in some space X , then we need a function that takes two points in X , say, x

1
(a) Initialization (b) Iteration 1 (c) Iteration 2 (d) Iteration 3 (e) Iteration 4

Figure 1: Four iterations of K-Means applied to the lengths and widths of fruit (oranges and
lemons), as measured in centimeters. Here, K = 5 and the initial means were chosen randomly in
the square shown. The colored regions show the Voronoi partitions induced by each cluster center,
and the data are colored according to their association. The cluster centers are shown as ×.

and x ′ and computes a distance between them. We’ll write such a distance as ||x − x ′ || . If X
is RD , then a natural choice would be the Euclidean ( L 2 ) distance we’ve studied this term in other
contexts:
󰁹
󰁸
󰁷 D
󳕗
||x − x ′ ||2 = ( xd − xd′ )2 . (1)
d=1

There are many distance metrics that one might come up with, depending on what your data are
and what “similarity” means for the problem you want to solve. For strings or DNA sequences,
one might use edit distance.1 For bit vectors, it might be sensible to use Hamming distance.2 This
choice is important because it will determine whether two objects should want to be in the same
group or not.
You also need to decide is how many groups you want. This is the number K that gives the
K-means algorithm its name. Choosing K can be a bit of an art, because it depends on what kind of
structure you’re looking for. Sometimes you might know how many clusters there are in advance.
Other times, you might want to “oversegment” or “undersegment” the data, depending on whether
you’d like to end up with many clusters or just a few. In the compression view of clustering, this
boils down to asking whether you’d like a more compressed representation that loses information
(smaller K ), or a less compressed representation that keeps more information about your data
(larger K ). If you’re using K-Means for learning a feature representation, it’s usually a good idea to
use a larger K . If you want to interpret the groups, then perhaps you want to go with a smaller K .
Our data are N points in X . Let’s denote the nth of these as xn , so we can write the data as the
set {xn }n=
N . Clustering algorithms assign every one of these data to one of the K clusters. What
1
we’re doing is trying to find a good (ideally, the best) assignment of the data to the clusters. We
represent these assignments by giving every one of the N data a binary responsibility vector rn .
This vector is all zeros except in one component, which corresponds to the cluster it is assigned to.
That is, if xn is assigned to cluster k , then rnk = 1 and all of the other entries in rn are zero. This
1http://en.wikipedia.org/wiki/Edit_distance
2http://en.wikipedia.org/wiki/Hamming_distance

2
Algorithm 1 K-Means Clustering (Lloyd’s Algorithm) Note: written for clarity, not eﬃciency.
1: Input: Data vectors {xn }n= N , number of clusters K
1
2: for n ← 1 . . . N do ⊲ Initialize all of the responsibilities.
3: rn ← [0, 0, · · · , 0] ⊲ Zero out the responsibilities.
4: k ← RandomInteger(1, K )
′ ⊲ Make one of them randomly one to initialize.
5: rnk = 1
′

6: end for
7: repeat
for k ← 1 . . . K do ⊲ Loop over the clusters.
󳕐N
8:
9: Nk ← n=1 rnk ⊲ Compute the number assigned to cluster k .
1 󳕐N
10: µ k ← Nk n=1 rnk xn ⊲ Compute the mean of the k th cluster.
11: end for
12: for n ← 1 . . . N do ⊲ Loop over the data.
13: rn ← [0, 0, · · · , 0] ⊲ Zero out the responsibilities.
14: k ′ ← arg min k ||xn − µ k || 2 ⊲ Find the closest mean.
15: rnk ′ = 1
16: end for
17: until none of the rn change
18: Return assignments {rn }n= N for each datum, and cluster means {µ } K .
1 k k=1

is an example of one-hot coding in which an integer between 1 and K is encoded as a length-K

binary vector that is zero everywhere except for one place.

2 The K-Means Algorithm

When the data space X is RD and we’re using Euclidean distance, we can represent each cluster by
the point in data space that is the average of the data assigned to it. Since each cluster is represented
by an average, this approach is called K-Means. The K-Means procedure is among the most popular
machine learning algorithms, due to its simplicity and interpretability. Pseudocode for K-Means is
shown in Algorithm 1. K-means is an iterative algorithm that loops until it converges to a (locally
optimal) solution. Within each loop, it makes two kinds of updates: it loops over the responsibility
vectors rn and changes them to point to the closest cluster, and it loops over the mean vectors µ k
and changes them to be the mean of the data that currently belong to it. There are K of these mean
vectors (hence the name of the algorithm) and you can think of them as “prototypes” that describe
each of the clusters. The basic idea is to find a prototype that describes a group in the data and to
use the rn to assign the data to the best one. In the compression view of K-Means, you can think
of replacing your actual datum xn with its prototype and then trying to find a situation in which
that doesn’t seem so bad, i.e., that compression will not lose too much information if the prototype
accurately reflects the group. When updating the assignments, we tend to use the squared distance,
rather than the actual distance, as this doesn’t change the answer and we avoid the square root in
Eq. 1.

3
(a) Centers (b) Cluster 1 (c) Cluster 2 (d) Cluster 3 (e) Cluster 4

(f) Cluster 5 (g) Cluster 6 (h) Cluster 7 (i) Cluster 8 (j) Cluster 9 (k) Cluster 10

(l) Cluster 11 (m) Cluster 12 (n) Cluster 13 (o) Cluster 14 (p) Cluster 15 (q) Cluster 16

Figure 2: This is the result of K-Means clustering applied to the MNIST digits data. (a) The 16
cluster centers. (b-q) 25 data examples are shown for each of the 16 clusters. The clusters roughly
grab digits with similar stroke patterns.

Figure 1 shows several iterations of the K-Means clustering algorithm applied to two-dimensional
data. These are the lengths and widths of fruit (oranges and lemons) purchased by Iain Murray.3
These distances are in centimeters. I initialized the centers randomly within the square shown, and
used K = 5.

2.1 Example: Handwritten Digits

Figure 2 shows the result of applying K-Means clustering to the MNIST handwritten digits.4 There
are 60000 digits and each is a 28 × 28 grayscale image, i.e., each pixel is an unsigned byte between
0 and 255. I loaded the data into Matlab, turned it into a big 60000 × 784 matrix, casted it into a
double and then divided by 255. This gave me a bunch of vectors between 0 and 1. I then initialized
with K-Means++, followed by Lloyd’s algorithm. I did this with K = 16. I used the vectorization
tricks that I mention here, and on my laptop it converged in less than a minute.
3http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/
4http://yann.lecun.com/exdb/mnist/

4
(a) Centers (b) Cluster 1 (c) Cluster 2 (d) Cluster 3 (e) Cluster 4

(f) Cluster 5 (g) Cluster 6 (h) Cluster 7 (i) Cluster 8 (j) Cluster 9 (k) Cluster 10

(l) Cluster 11 (m) Cluster 12 (n) Cluster 13 (o) Cluster 14 (p) Cluster 15 (q) Cluster 16

Figure 3: This is the result of K-Means clustering applied to the CIFAR-100 image data. (a) The 16
cluster centers. (b-q) 25 data examples are shown for each of the 16 clusters. The clusters primarily
pick up on low-frequency color variations.

2.2 Example: Color Images

Figure 3 shows the result of applying K-Means clustering to the CIFAR-100 color images.5 There
are 50000 32 × 32 color images, i.e., each pixel is an RGB triplet of unsigned bytes between 0 and
255. I loaded the data into Matlab, turned it into a big 50000 × 3072 matrix, casted it into a double
and then divided by 255. This gave me a bunch of vectors between 0 and 1. I then initialized with
K-Means++, followed by Lloyd’s algorithm. I did this with K = 16. I used the vectorization tricks
that I mention here, and on my laptop it converged in less than three minutes.

2.3 Example: Images of Faces

Figure 4 shows the result of applying K-Means clustering to a preprocessed variant of the Labeled
Faces in the Wild data.6 There are 13233 40 × 40 color images, i.e., each pixel is an RGB triplet of
unsigned bytes between 0 and 255. I used the aligned variant of the data, cropped out the middle
150 pixel square, and resized the cropped image to 40 × 40 pixesl. I used Imagemagick7 to do this
quickly, as it provides a very nice command line tool for image processing. I loaded the image data
5http://www.cs.toronto.edu/~kriz/cifar.html
6http://vis-www.cs.umass.edu/lfw/
7http://www.imagemagick.org/

5
(a) Cluster Cen- (b) Cluster 1 (c) Cluster 2 (d) Cluster 3 (e) Cluster 4
ters

(f) Cluster 5 (g) Cluster 6 (h) Cluster 7 (i) Cluster 8 (j) Cluster 9 (k) Cluster 10

(l) Cluster 11 (m) Cluster 12 (n) Cluster 13 (o) Cluster 14 (p) Cluster 15 (q) Cluster 16

Figure 4: This is the result of K-Means clustering applied to a preprocessed variant of the Labeled
Faces in the Wild image data. (a) The 16 cluster centers. (b-q) 25 data examples are shown for each
of the 16 clusters. The clusters capture a combination of face shape, background, and illumination.

into Matlab, turned it into a big 13233 × 4800 matrix, casted it into a double and then divided by
255. This gave me a bunch of vectors between 0 and 1, which I standardized. I then initialized with
K-Means++, followed by Lloyd’s algorithm. I did this with K = 16. I used the vectorization tricks
that I mention here, and on my laptop it converged in less than three minutes.

2.4 Example: Grolier Encyclopedia Articles

Table 1 shows the result of applying K-Means to a set of 30991 articles from Grolier’s Encyclopedia.8
The articles are represented as a sparse vector of word counts with a vocabulary of the 15276 most
common words, excluding stop words9 such as “the” and ”and”. These counts were treated as the
features directly, leading to a very simple notion of distance. I used K = 12 and initialized with
K-Means++ before applying Lloyd’s algorithm.
8http://www.cs.nyu.edu/~roweis/data.html
9http://en.wikipedia.org/wiki/Stop_words

6
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
education south war war art light
united population german government century energy
american north british law architecture atoms
public major united political style theory
world west president power painting stars
social mi power united period chemical
government km government party sculpture elements
century sq army world form electrons
schools deg germany century artists hydrogen
countries river congress military forms carbon

Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12

energy god century city population cells
system world world american major body
radio century water century km blood
space religion called world mi species
power jesus time war government cell
systems religious system john deg called
television steel form life sq plants
water philosophy united united north animals
solar science example family south system
signal history life called country human

Table 1: This is the result of a simple application of K-Means clustering to a set of Grolier
encyclopedia articles. Shown above are the words with the highest “mean counts” in each of the
cluster centers, with clusters as columns. Even with this very simple approach, K-Means identifies
groups of words that seem conceptually related.

3 Derivation
Where does this algorithm come from and what does it do? As with many machine learning
algorithms, we begin by defining a loss function which specifies what solutions are good and bad.
This loss function takes as arguments the two sets of parameters that we introduced in the previous
sections, the responsibilities rn and the means µ k . Given these parameters and the data vectors xn ,
what does it mean to be in a good configuration versus a bad one? One intuition is that good
settings of rn and µ k will be those in which as many of the data as possible can be near their
assigned µ k . This fits well with the compression view of K-Means: if each of the xn was replaced
by the appropriate µ k , then better solutions will have this error be very small on average. We write

7
this as an objective function in terms of the rn and µ k :

󳕗
N 󳕗
K
N
J ({rn }n= K
1, {µ k } k=1 ) = rnk || xn − µ k ||22 . (2)
n=1 k=1

Here we’re continuing with the assumption that the distance is Euclidean. This function sums
up the squared distances between each example and the prototype it belongs to. The K-Means
algorithm minimizes this via coordinate descent, i.e., it alternates between 1) minimizing each of
the rn , and 2) minimizing each of the µ k .

Minimizing the rn If we look at the sum in Eq. 2, we see that the rn only appears in one of the
outer sums, because it only aﬀects one of the data examples. There are only K possible values
for rn and so we can minimize it (holding everything else fixed) by choosing rnk = 1 for the cluster
that has the smallest distance:
󰀫
1 if k = arg min k ′ ||xn − µ k ′ ||22
rnk = (3)
0 otherwise.

Minimizing the µ k Having fixed everything else, we note that each µ k only depends on one of
the parts of the inner sums. We can think about the objective function written in terms of only one
of these:
󳕗
N
J (µ k ) = rnk ||xn − µ k ||22 (4)
n=1
󳕗
N
= rnk (xn − µ k )T (xn − µ k ). (5)
n=1

Here I’ve written out the squared Euclidean distance as a quadratic form, because it makes the
calculus a bit easier to see. To minimize, we diﬀerentiate this objective with respect to µ k , set the
resulting gradient to zero, and then solve for µ k .

󳕗
N
∇ µk J (µ k ) = ∇ µk rnk (xn − µ k )T (xn − µ k ) (6)
n=1
󳕗
N
= rnk ∇ µk (xn − µ k )T (xn − µ k ). (7)
n=1

8
Recall that ∇ a aT a = 2 a , and we apply the chain rule:

󳕗
N
∇ µk J (µ k ) = −2 rnk (xn − µ k ) = 0 (8)
n=1
󳕗
N 󳕗N
rnk xn = µ k rnk (9)
󳕐N
n=1 n=1

n=1 rnk x n
µk = 󳕐N . (10)
n=1 rnk

Thus, for Euclidean distances anyway, taking the average of the assigned data gives the mean that
minimizes the distortion (holding everything else fixed). It can also useful to think of the K-Means
clustering algorithm as finding a Voronoi partition10 of the data space.

4 Practical Considerations
4.1 Hardness and Initialization
The objective function in Eq. 2 is highly non-convex, with many local minima. Some of them are
very easy to see. For example, you could clearly permute the indices of the clusters and wind up in
a “diﬀerent” solution that was just as good. Coordinate descent, as described here, therefore only
finds a local minimum of the objective. Strictly speaking, “K-Means” is not an algorithm, but a
problem specified by finding a configuration of the rn that minimizes Eq. 2 – note that the µ k are
completely determined by the rn for a given data set. The iterative algorithm described here is often
called Lloyd’s algorithm (Lloyd, 1982), but it represents just one way to optimize the K-Means
objective. It turns out that finding (one of) the globally optimal solutions to the K-Means problem
is NP-hard (Aloise et al., 2009), even if there are only two clusters.
When faced with highly non-convex optimization problems, a common strategy is to use
random restarts to try to find a good solution. That is, running Algorithm 1 several times (e.g., 10
or 20), with diﬀerent random seeds so that the initial rn might land in better places. Then one looks
at the final value of the objective in Eq. 2 to choose the best solution. Another practical strategy
for larger data sets is to do these restarts with a smaller subset of the data in order to find some
reasonable cluster centers before running the full iteration.
More recently, an algorithm has been proposed that is a bounded-error approximation to the
solution of K-Means (Arthur and Vassilvitskii, 2007). This algorithm, called K-Means++, is shown
in pseudocode in Algorithm 2, and can be an excellent alternative to the simple random initialization
shown in Algorithm 1. In fact, Arthur and Vassilvitskii (2007) show that K-Means++ can do well
even without using Lloyd’s algorithm at all.
10http://en.wikipedia.org/wiki/Voronoi_diagram

9
Algorithm 2 K-Means++ Note: written for clarity, not eﬃciency.
1: Input: Data vectors {xn }n= N , number
1 of clusters K
2: n ← RandomInteger(1, N ) ⊲ Choose a datum at random.
3: µ 1 ← xn ⊲ Make this random datum the first cluster center.
4: for k ← 2 . . . K do ⊲ Loop over the rest of the centers.
5: for n ← 1 . . . N do ⊲ Loop over the data.
6: dn ← min k ′ <k || xn − µ k ′ ||2 ⊲ Compute the distance to the closest center.
7: end for
for n ← 1 . . . N do ⊲ Loop over the data again.
󳕐
8:
9: pn ← dn2 / n′ dn2′ ⊲ Compute a distribution proportional to dn2 .
10: end for
11: n ← Discrete( p1, p2, . . . , pN ) ⊲ Draw a datum from this distribution.
12: µ k ← xn ⊲ Make this datum the next center.
13: end for
14: Return cluster means {µ k } k= K .
1

4.2 Standardizing Data

It is necessary to think carefully about what distances mean in the context of K-Means. For example,
imagine that we wanted to cluster cars using two quantities: wheelbase (a length) and performance
(horsepower). Consumer cars range in horsepower from about 150hp to perhaps 400hp for sports
cars; exotics such as the Bugatti Veyron might go up to 1000hp. Wheelbase (the distance between
the front and rear centers of the wheels) we might reasonably measure in meters, typically between
2.5m and 3.5m for passenger cars. If we use these numbers as our inputs, we might get into trouble
because the horsepower will dominate the distance, as it is two orders of magnitude larger in scale.
This is very unsatisfying because these are just arbitrary units; we could’ve just as easily used
millimeters for wheelbase and then it would dominate over horsepower! A typical way to deal with
this is to “standardize” each of the dimensions by finding the transformation so that each dimension
has mean zero and standard deviation one. This causes each of the features to be centered at zero
and have approximately the same scale (as determined by the data). Algorithm 3 shows such a
standardization.

4.3 Missing Data

In practical data analysis problems, we often have missing data. That is, some dimensions of some
of the data may be completely missing. How do we deal with this in K-Means? There are a
variety of possibilities, but the two main options are imputation and marginalization. Imputation
uses a procedure for inventing the missing data. For example, we might replace the missing data
with the mean of the remaining data (if we standardized, this will be zero), or we might regress
the missing values on the other features. Imputation can be useful, but it can also be dangerous
if it ends up influencing the conclusions that you draw from the data. Marginalization is a more
principled approach to dealing with missing data, in which we try to account for all possible values

10
Algorithm 3 Data Standardization Note: written for clarity, not eﬃciency.
1: Input: Data vectors {xn }n= N
1
2: for d ← 1 . . . D do ⊲ Loop over dimensions.
3: m1 ← 0 ⊲ For storing the total of the values.
4: m2 ← 0 ⊲ For storing the total of the squared values.
5: for n ← 1 . . . N do ⊲ Loop over the data.
6: m1 ← m1 + xn,d
7: m2 ← m2 + xn,d 2

8: end for
9: µ ← m1 / N ⊲ Compute sample mean.
10: σ 2 ← (m2 / N ) − µ2 ⊲ Compute sample variance.
11: for n ← 1 . . . N do ⊲ Loop over the data again to modify.
12: ′ ← (x
xn,d n,d − µ)/σ ⊲ Shift by mean, scale by standard deviation.
13: end for
14: end for
15: Return transformed data {xn′ }n=N
1

of the unknown dimension. That is, we could compute the expectation of the distance between
two points, integrating over the missing values. If we’ve standardized the data (and there aren’t
too many missing values) then we might reasonably assume that the missing data have a N (0, 1)
distribution. The expectation of the squared Euclidean distance, assuming that the d th dimension
is missing, is
󳔾 󳕗
󰀅 󰀆 ∞ D
E || x − µ||22 = N ( xd | 0, 1) ( xd ′ − µd ′ )2 d xd (11)
󰀥 󰀦
−∞ d ′ =1
󳕗 󳔾 ∞
2
= ( xd ′ − µd ′ ) + N ( xd | 0, 1)( xd − µd )2 d xd (12)
󰀥 󰀦
d ′ 󲧰d −∞

󳕗
= ( xd ′ − µd ′ )2 + 1 + µ2d . (13)
d ′ 󲧰d

Basically, this simple marginalization results in us adding a bit to the distance.

4.4 Vectorizing Code

As with almost all the numeric code you write to do data analysis, if you’re using a high-level
language like Matlab or Python, you’ll want to vectorize things as much as possible. That means
that rather than writing the simple loops such as I show in these algorithm boxes, you’ll want
to frame things in terms of operations on vectors and matrices. For example, instead of writing
the loops of Algorithm 3, you might write a Python function that takes advantage of vectorized
built-ins as in Listing 1. There are no loops in sight, which means that they are (hopefully) being

11
Listing 1: Python function for standardizing data
import numpy as np
def standardize (data):
’’’Take an NxD numpy matrix as input and return a standardized version . ’’’
mean = np.mean(data , axis =0)
std = np.std(data , axis =0)
return (data - mean)/std

performed in the underlying Fortran or C libraries. These computations can be done very quickly
using BLAS11, for example.
Another very useful trick to know is for computing distances. When the data are one-
dimensional, it seems pretty easy. However, with higher dimensional data, it’s less obvious how
to compute a matrix of distances without looping. Let’s imagine that we have an N × D matrix X
and an M × D matrix Y and that we want to compute an N × M matrix D with all of the squared
distances. One entry in this matrix is given by

Dn,m = (xn − ym )(xn − ym )T (14)

= xn xnT − 2 xn ymT + ym ymT , (15)

where xn ∈ RD is the nth row of X and ym ∈ RD is the mth row of Y . These are row vectors, so
this is a sum of inner products. Using broadcasting, which numpy does naturally and can be done
in Matlab using bsxfun, this gives us a rapid way to oﬄoad distance calculations to BLAS (or
equivalent) without looping in our high-level code.

5 How Many Clusters?

A recurring theme in machine learning is model selection, in which we have to decide how big
or complicated our representation of the data should be. For clustering algorithms, we make this
choice when we fix the number of clusters K . As mentioned previously, large K may be good for
feature representations, but smaller K may be more interpretable. Unfortunately, there is no single
best way to determine K from the data. Nevertheless there are various useful heuristics in the
literature. For a review and comparison, see Gordon (1996). More recent work includes Hamerly
and Elkan (2004).
One principled approach is to construct a null hypothesis for the clustering. For example, the
idea of the gap statistic (Tibshirani et al., 2001) is to compare a dispersion statistic of the clustered
fit to the same statistic fit to synthetic data that are known not to have clusters. When we have the
right number of clusters, we would expect to have less dispersion than random. Let’s imagine that
we have run K-Means and we now have a set of responsibilities {rn }n= N and cluster centers {µ } K .
1 k k=1
We define the within-cluster dispersion to be the sum of squared distances between all pairs in a
11Basic Linear Algebra Subprograms: http://www.netlib.org/blas/

12
1.5

1
Gap Statistic

0.5

−0.5

1 2 3 4 5 6 7 8 9 10
Number of Clusters
Figure 5: The gap statistic computed for the fruit data. Here, it seems reasonable to choose K = 5.
This used 100 reference data sets, each from a Gaussian MLE fit for the null model.

given cluster:

󳕗
N 󳕗
N
Dk = rn,k rn′,k ||xn − xn′ || 2 . (16)
n=1 n ′ =1

The dispersion for a size-K clustering is the normalized sum of the within-cluster dispersion over
all of the clusters:
󳕗K
Dk 󳕗K
1 󳕗󳕗
N N
WK = = rn,k rn′,k || xn − xn′ || 2 (17)
k=1
2 N k k=1
2 N k n=1 n ′ =1
󳕗
N
Nk = rn,k . (18)
n=1

This is basically a measure of “tightness” of the fit normalized by the size of each cluster. It will be
smaller when the clustered points group together closely. The gap statistic uses a null distribution
for the data we have clustered, from which we generate reference data. You can think of this as
a distribution on the same space, with similar coarse statistics, but in which there aren’t clusters.
For example, if our original data were in the unit hypercube [0, 1]D , we might generate reference
data uniformly from the cube. If we standardized data on RD so that each feature has zero sample
mean and unit sample variance, when it would be natural to make the null distribution N (0, ID ).
To compute the gap statistic, we now generate several sets of reference data, cluster each of them,
and compute their dispersions. This gives us an idea (with error bars) as to what the expected
dispersion would be based on the coarse properties of the data space, for a given K . We can then

13
cow humpback whale german shepherd mole
walks hairless furry furry
quadrapedal toughskin meatteeth small
vegetation big walks fast
ground swims fast active
big strong quadrapedal newworld
ox blue whale siamese cat hamster
pig seal wolf rat
sheep walrus chihuahua squirrel
buﬀalo dolphin dalmatian mouse
horse killer whale weasel skunk

Table 2: This table shows the result of applying K-Medoids to binary features associated with 50
animals, using Hamming distance. Here K = 4. The bold animals along the top are the medoids for
each cluster. The top five most common features are shown next, followed by five other non-medoid
animals from the same cluster.

compute the gap statistic as follows:

GapN (K ) = Enull [log WK ] − log WK . (19)

Here the first term is the expected log of the dispersion under the null distribution – something
we can compute by averaging the log dispersions of our reference data. We subtract from this
the dispersion of our actual data. We can then look for the K which maximizes GapN (K ). We
choose the smallest one that appears to be statistically significant. Figure 5 shows a boxplot of the
gap statistic for the fruit data. Here the null distribution was an MLE Gaussian fit to the data. I
generated 100 sets of reference data.

6 The K-Medoids Algorithm

As mentioned in the beginning, the distance metric we choose is critical for clustering. When the
distances are Euclidean and means correspond to sensible points in X , then K-Means is a natural
procedure. However, it may be that we want to use more general distances, or that means are
not sensible in our space. In this case, we might want to use a K-Medoids procedure. The main
diﬀerence is that K-Medoids requires the cluster centers to be in the data set. You can think of
these special data as “exemplars” for their clusters. K-Medoids is typically more computationally
intensive to fit, but can be more interpretable because the clusters are represented by data examples.
Algorithm 4 provides a simple implementation of K-Medoids and a couple of examples follow.

14
Algorithm 4 K-Medoids Note: written for clarity, not eﬃciency.
1: Input: Data vectors {xn }n= N , number of clusters K
1
2: for n ← 1 . . . N do ⊲ Initialize all of the responsibilities.
3: rn ← [0, 0, · · · , 0] ⊲ Zero out the responsibilities.
4: k ← RandomInteger(1, K )
′ ⊲ Make one of them randomly one to initialize.
5: rnk = 1
′

6: end for
7: repeat
8: for k ← 1 . . . K do ⊲ Loop over the clusters.
9: for n ← 1 . . . N do ⊲ Loop over data.
if rn,k = 1 then
󳕐
10:
11: Jn ← nN′=1 rn′,k ||xn − xn′ || ⊲ Sum distances to this datum.
12: else
13: Jn ← ∞ ⊲ Infinite cost for data not in this cluster.
14: end if
15: end for
16: n󰂏 ← arg minn Jn ⊲ Pick the one that minimizes the sum of distances.
17: µ k ← xn󰂏 ⊲ Make the minimizing one the cluster center.
18: end for
19: for n ← 1 . . . N do ⊲ Loop over the data.
20: rn ← [0, 0, · · · , 0] ⊲ Zero out the responsibilities.
21: k ′ ← arg min k ||xn − µ k || 2 ⊲ Find the closest medoid.
22: rnk ′ = 1
23: end for
24: until none of the rn change
25: Return assignments {rn }n= N for each datum, and cluster medoids {µ } K .
1 k k=1

6.1 Example: Binary Animal Properties

Table 2 shows the result of applying K-Medoids clustering to binary feature data for diﬀerent types
of animals.12 The animals are things like “weasel” and “horse”, while the binary features are
properties such as “swims” or “furry”. There are 50 animals and 85 features. I used K = 4 and
Hamming distance. I initialized with K-Means++. The table has four columns with the medoid
at the top, the most common five features within the group, and five other animals in the cluster,
excluding the medoid.

6.2 Example: Image Data

Figure 6 shows the result of K-Medoids applied to 50000 32 × 32 color images. These are the
same CIFAR-100 data as used in the K-Means example. The only diﬀerence here is that I used
12http://www.psy.cmu.edu/~ckemp/code/irm.html

15
(a) Centers (b) Cluster 1 (c) Cluster 2 (d) Cluster 3 (e) Cluster 4

(f) Cluster 5 (g) Cluster 6 (h) Cluster 7 (i) Cluster 8 (j) Cluster 9 (k) Cluster 10

(l) Cluster 11 (m) Cluster 12 (n) Cluster 13 (o) Cluster 14 (p) Cluster 15 (q) Cluster 16

Figure 6: This is the result of K-Medoids clustering applied to the CIFAR-100 image data. (a) The
16 cluster medoids. (b-q) 25 data examples are shown for each of the 16 clusters.

K-Medoids clustering instead. Note that the medoids tend to have low spatial frequency.

7 Advanced Topics
These topics are outside the scope of this class, but are good things to look into next if you find
clustering to be an interesting topic.

Spectral Clustering Clusters in data are often more complicated than simple isotropic blobs.
Our human visual system likes to group things in more complicated ways. Spectral clustering (see,
e.g., Ng et al. (2002); Von Luxburg (2007)) constructs a graph over the data first and then performs
operations on that graph using the Laplacian. The idea is that data which are close together tend to
be in the same group, even if they are not close to some single prototype.

Affinity Propagation One powerful way to think about clustering, which we’ll see later in the
semester, is to frame the groups in terms of latent variables in a probabilistic model. Inference in
many probabilistic models can be performed efficiently using “message passing” algorithms which
take advantage of an underlying graph structure. The algorithm of affinity propagation (Frey and
Dueck, 2007) is a nice way to perform K-Medoids clustering efficiently with such a message passing
procedure.

16
Biclustering The clustering algorithms we’ve looked at here have operated on the data instances.
That is, we’ve thought about finding partitions of the rows of an N × D matrix. Many data are not
well represented by “instances” and “features”, but are interactions between items. For example,
we could imagine our data to be a matrix of outcomes between sports teams, or protein-protein
interaction data. Biclustering (Hartigan, 1972) is an algorithm for simultaneously grouping both
rows and columns of a matrix, in eﬀect discovering blocks. Variants of this technique have become
immensely important to biological data analysis.

References
Adam Coates, Andrew Y. Ng, and Honglak Lee. An analysis of single-layer networks in
unsupervised feature learning. In International Conference on Artificial Intelligence and
Statistics, pages 215–223, 2011. URL http://www.stanford.edu/~acoates/papers/
coatesleeng_aistats_2011.pdf.

Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28
(2):129–137, 1982. URL http://www.nt.tuwien.ac.at/fileadmin/courses/389075/
Least_Squares_Quantization_in_PCM.pdf.

Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean
sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. URL http://link.
springer.com/article/10.1007%2Fs10994-009-5103-0.

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceed-
ings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035.
Society for Industrial and Applied Mathematics, 2007. URL http://ilpubs.stanford.edu:
8090/778/1/2006-13.pdf.

Allan D. Gordon. Null models in cluster validation. In From data to knowledge, pages 32–44.
Springer, 1996.

Greg Hamerly and Charles Elkan. Learning the k in k-means. Advances in neural information
processing systems, 16:281, 2004.

Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in
a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 63(2):411–423, 2001. URL http://www.stanford.edu/~hastie/Papers/
gap.pdf.

Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an
algorithm. Advances in neural information processing systems, 2:849–856, 2002. URL http:
//machinelearning.wustl.edu/mlpapers/paper_files/nips02-AA35.pdf.

17
Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–
416, 2007. URL http://web.mit.edu/~wingated/www/introductions/tutorial_on_
spectral_clustering.pdf.

Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science,
315(5814):972–976, 2007.

John A Hartigan. Direct clustering of a data matrix. Journal of the american statistical association,
67(337):123–129, 1972.

Changelog
• 2 November 2018 – Initial revision from old CS181 notes.

Machine Learning in ArcGIS
No ratings yet
Machine Learning in ArcGIS
2 pages
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
100% (3)
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
56 pages
Khelfaoui, Mounia - Sedkaoui, Soraya - Sharing Economy and Big Data Analytics-IsTE - John Wiley & Sons, Inc. (2020)
No ratings yet
Khelfaoui, Mounia - Sedkaoui, Soraya - Sharing Economy and Big Data Analytics-IsTE - John Wiley & Sons, Inc. (2020)
259 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
K Means
No ratings yet
K Means
36 pages
5clustering-2
No ratings yet
5clustering-2
35 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Clustering
No ratings yet
Clustering
84 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
8-cluster
No ratings yet
8-cluster
33 pages
UNIT 4
No ratings yet
UNIT 4
125 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Clustering
No ratings yet
Clustering
35 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
23 pages
Clustering
No ratings yet
Clustering
75 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
AI-unit-5
No ratings yet
AI-unit-5
103 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering and Visualisation of Data - 2020
No ratings yet
Clustering and Visualisation of Data - 2020
5 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
ML_Lec-16
No ratings yet
ML_Lec-16
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
k Means Clustering
No ratings yet
k Means Clustering
29 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
k-means
No ratings yet
k-means
25 pages
Module 4
No ratings yet
Module 4
63 pages
K Mean
No ratings yet
K Mean
7 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
A Fast K-Means Implementation Using Coresets
No ratings yet
A Fast K-Means Implementation Using Coresets
10 pages
Estimators1 PDF
No ratings yet
Estimators1 PDF
2 pages
Chapter 5 Sections: Just Skim
No ratings yet
Chapter 5 Sections: Just Skim
18 pages
The Package: Mathspec
No ratings yet
The Package: Mathspec
33 pages
Omnifocus Trigger List
No ratings yet
Omnifocus Trigger List
2 pages
CS 4984: Erd Os-Renyi and Small World Networks: T. M. Murali February 8 and 13, 2018
No ratings yet
CS 4984: Erd Os-Renyi and Small World Networks: T. M. Murali February 8 and 13, 2018
84 pages
PHD Thesis Computer Science PDF Download
100% (4)
PHD Thesis Computer Science PDF Download
5 pages
unsupervised learning
No ratings yet
unsupervised learning
19 pages
Pattern Extraction From Incident Reports Using Proactive and Reactive Data: A Case Study of Contractors Safety in A Steel Plant
No ratings yet
Pattern Extraction From Incident Reports Using Proactive and Reactive Data: A Case Study of Contractors Safety in A Steel Plant
12 pages
PHD Thesis Amendments
100% (2)
PHD Thesis Amendments
4 pages
10 AI BOARD PAPER 22 TERM I SOLUTION
No ratings yet
10 AI BOARD PAPER 22 TERM I SOLUTION
6 pages
Visualization in Stylometry
No ratings yet
Visualization in Stylometry
15 pages
BCS602 Model Question Paper Solved(Search Creators)-2-37
0% (2)
BCS602 Model Question Paper Solved(Search Creators)-2-37
36 pages
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
20 pages
Energies 16 07773 v2
No ratings yet
Energies 16 07773 v2
36 pages
ML Manufacturing Auto Industry Infosys
No ratings yet
ML Manufacturing Auto Industry Infosys
28 pages
132423
No ratings yet
132423
75 pages
Early Anglo-Saxon Shields 1992 PDF
No ratings yet
Early Anglo-Saxon Shields 1992 PDF
100 pages
MULTI VIEW CLUSTERING AND NMF
No ratings yet
MULTI VIEW CLUSTERING AND NMF
10 pages
Assignment 3 Specification
No ratings yet
Assignment 3 Specification
3 pages
Introduction To Datascience (R20DS501)
100% (1)
Introduction To Datascience (R20DS501)
19 pages
A Review of Water Quality Models and Monitoring
No ratings yet
A Review of Water Quality Models and Monitoring
25 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
28 pages
Multi Clustering Recommendation System For Fashion Retail
No ratings yet
Multi Clustering Recommendation System For Fashion Retail
28 pages
Cert BA UCT
No ratings yet
Cert BA UCT
12 pages
66 Job Interview Questions For Data Scientists
No ratings yet
66 Job Interview Questions For Data Scientists
10 pages
Dwdmsyll Merged
No ratings yet
Dwdmsyll Merged
3 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Data Mining Syllabus new
No ratings yet
Data Mining Syllabus new
2 pages
Java Datamining IEEE Projects 2012 at Seabirds (Chennai, Trichy, Pudukkottai, Thanjavur, Perambalur, Karur)
No ratings yet
Java Datamining IEEE Projects 2012 at Seabirds (Chennai, Trichy, Pudukkottai, Thanjavur, Perambalur, Karur)
9 pages
BCA 6th Semester VSKUB
No ratings yet
BCA 6th Semester VSKUB
9 pages
Ankur - Shukla - DS - Almabetter - Ankur Shukla
No ratings yet
Ankur - Shukla - DS - Almabetter - Ankur Shukla
1 page
(KDD 2023) Boosting Multitask Learning On Graphs Through Higher Order Task Affinities
No ratings yet
(KDD 2023) Boosting Multitask Learning On Graphs Through Higher Order Task Affinities
10 pages

K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University

Uploaded by

K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University

Uploaded by

K -Means Clustering and Related Algorithms

is an example of one-hot coding in which an integer between 1 and K is encoded as a length-K

2 The K-Means Algorithm

2.1 Example: Handwritten Digits

2.2 Example: Color Images

2.3 Example: Images of Faces

2.4 Example: Grolier Encyclopedia Articles

Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12

4.2 Standardizing Data

4.3 Missing Data

Basically, this simple marginalization results in us adding a bit to the distance.

4.4 Vectorizing Code

Dn,m = (xn − ym )(xn − ym )T (14)

5 How Many Clusters?

compute the gap statistic as follows:

GapN (K ) = Enull [log WK ] − log WK . (19)

6 The K-Medoids Algorithm

6.1 Example: Binary Animal Properties

6.2 Example: Image Data

You might also like