0% found this document useful (0 votes)
220 views

Clustering

Uploaded by

Shikha Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views

Clustering

Uploaded by

Shikha Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Clustering

DHS 10.6-10.7, 10.9-10.10, 10.4.3-10.4.4


Clustering
Definition
A form of unsupervised learning, where we identify
groups in feature space for an unlabeled sample set
• Define class regions in feature space using unlabeled
data

• Note: the classes identified are abstract, in the sense


that we obtain ‘cluster 0’ ... ‘cluster n’ as our classes
(e.g. clustering MNIST digits, we may not get 10
clusters)
2
Applications
Clustering Applications Include:
• Data reduction: represent samples by their
associated cluster

• Hypothesis generation

• Discover possible patterns in the data: validate


on other data sets

• Hypothesis testing

• Test assumed patterns in data

• Prediction based on groups

• e.g. selecting medication for a patient using


clusters of previous patients and their reactions
to medication for a given disease 3
Kuncheva:
Supervised vs.
Unsupervised
Classification
A Simple Example
Assume Class Distributions Known to be Normal
Can define clusters by mean and covariance matrix

However...
We may need more information to cluster well
• Many different distributions can share a mean
and covariance matrix
• ....number of clusters?
5
FIGURE 10.6. These four data sets have identical statistics up to second-order—that
is, the same mean ! and covariance ". In such cases it is important to include in the
model more parameters to represent the structure more completely. From: Richard O.
c 2001 by
Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright !
John Wiley & Sons, Inc.
Steps for Clustering
1. Feature Selection

• Ideal: small number of features with little redundancy

2. Similarity (or Proximity) Measure


Red: defining
• Measure of similarity or dissimilarity ‘cluster space’
3. Clustering Criterion

• Determine how distance patterns determine cluster likelihood (e.g.


preferring circular to elongated clusters)

4. Clustering Algorithm

• Search method used with the clustering criterion to identify clusters

5. Validation of Results

• Using appropriate tests (e.g. statistical)

6. Interpretation of Results

• Domain expert interprets clusters (clusters are subjective) 7


Choosing a Similarity Measure
Most Common: Euclidean Distance
Roughly speaking, want distance between samples in a cluster
to be smaller than the distance between samples in different
clusters

• Example (next slide): define clusters by a maximum


distance d0 between a point and a point in a cluster

• Rescaling features can be useful (transform the space)

• Unfortunately, normalizing data (e.g. by setting


features to zero mean, unit variance) may eliminate
subclasses

• One might also choose to rotate axes so they


coincide with eigenvectors of the covariance
matrix (i.e. apply PCA)
8
x2 x2 x2
1 1 1
d0 = .3 d0 = .1 d0 = .03
.8 .8 .8

.6 .6 .6

.4 .4 .4

.2 .2 .2

0 x1 0 x1 0 x1
.2 .4 .6 .8 1 .2 .4 .6 .8 1 .2 .4 .6 .8 1

FIGURE 10.7. The distance threshold affects the number and size of clusters in similarity based clustering
methods. For three different values of distance d0 , lines are drawn between points closer than d0 —the smaller
the value of d0 , the smaller and more numerous the clusters. From: Richard O. Duda, Peter E. Hart, and David
c 2001 by John Wiley & Sons, Inc.
G. Stork, Pattern Classification. Copyright !
x2
1.6

x2 1.4
1

(.50 20) 1.2


.8

1
.6
.8
.4
.6
.2
.4
0 x1
.2 .4 .6 .8 1
.2

(20 .50) 0 x1
x2 .1 .2 .3 .4 .5
.5
.4
.3
.2
.1
0 x1
.25 .5 .75 1 1.25 1.5 1.75 2

FIGURE 10.8. Scaling axes affects the clusters in a minimum distance cluster method.
The original data and minimum-distance clusters are shown in the upper left; points in
one cluster are shown in red, while the others are shown in gray. When the vertical axis
is expanded by a factor of 2.0 and the horizontal axis shrunk by a factor of 0.5, the
clustering is altered (as shown at the right). Alternatively, if the vertical axis is shrunk by
a factor of 0.5 and the horizontal axis is expanded by a factor of 2.0, smaller more nu-
merous clusters result (shown at the bottom). In both these scaled cases, the assignment
of points to clusters differ from that in the original space. From: Richard O. Duda, Peter
x2 x2

x1 x1

FIGURE 10.9. If the data fall into well-separated clusters (left), normalization by scaling
for unit variance for the full data may reduce the separation, and hence be undesirable
(right). Such a normalization may in fact be appropriate if the full data set arises from a
single fundamental process (with noise), but inappropriate if there are several different
processes, as shown here. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
Pattern Classification. Copyright ! c 2001 by John Wiley & Sons, Inc.
Other Similarity Measures
Minkowski Metric (Dissimilarity)
! d #1/q
"
Change the exponent q: d(x, x! ) = |xk − x!k |q
k=1

• q = 1: Manhattan (city-block) distance

• q = 2: Euclidean distance (only form invariant to


! d #1/q
translation and rotation in feature"space)
d(x, x! ) = |xk − x!k |q
k=1

xT x!
Cosine Similarity
!
s(x, x ) =
||x|| ||x! ||

Characterizes similarity by the cosine of the angle


between two feature vectors (in [0,1])
• Ratio of inner product to vector magnitude product

• Invariant to rotations and dilation (not translation) 12


More on Cosine
d(x, x ) Similarity
= |x − x | !
! d
"
k
! q
k
#1/q

k=1
T !
x x
If features binary-valued: !
s(x, x ) =
||x|| ||x! ||

• Inner product is sum of shared feature values

• Product of magnitudes is geometric mean of


number of attributes in the two vectors
! d
"
! d
#1/q"
#1/q
! d(x, ! ! )q =
x |xk − x!k |q
d(x, x ) = |xk − x |
Variations
k
k=1
k=1
xT x!
T ! !
! x x s(x, x ) =
Frequently used for Information
s(x,Retrieval
x)= !
||x|| ||x! ||
||x|| ||x ||

• Ratio of shared attributes (identical! lengths):


xT x!
s(x, x ) =
!
s(x, x ) =
xT x!
d

• Tanimoto distance: ratio of shared attributes! to


s(x,
T x! )= T
d
xT x!
attributes in x or x’ ! x x x x + x!T x! − xT x
s(x, x ) =
xT x + x!T x! − xT x! 13
system (e.g. we observed
they a decrease
only contain 1’s andin0’s) A·B
security
over of 5%
the = of the
union ai bitags in
Thus, the cosine similai.
nly an increase both videos. ofTherefore,
in usability 0.3% relativethe dotto product
matching simply" reduces "
i=1 to
the size intersection
st only author-supplied of the two
tags). In addition, wetagcould
sets (i.e.,
not |A #t ∩ n Rt |) and # n
#! Adding ! 4. Return
#Related Tags Z
the product of the magnitudes
ate the security impact of adding title words reduces
!A!!B! using ourto = the $square root
(ai ) 2 $of (bi ) 2
Cosine Similarity: Tag Sets for YouTube
equencies (whichthe number of tags in the first tag set times the%
are calculated over tag space, not title square Once
i=1 %
rootthe of related videos
i=1 This techn
the number
to notofallow
tags intitle words.tag set (i.e., |Atilarity
the second | |Rt |). order, we introdu
Videos (Example by K. Kluever)
), and so we decided
Therefore, the cosine similarity between a set of author tags truth.
the ground
ground
ated nThe
tru
−b
and a set of related In tags can easily
our case, A and beBcomputed
are binary as:lowed in a YouTube
tag occurrences morevectors ta
than
ng Related Videos by Cosinethey Similarity
only contain
|At ∩ R1’s | and 0’s) over tag the could
set unionwithout
of the tag
contain ex6
Letthose
ect tags from A andvideosBthat bebothbinary
cos
have =the
θvideos.
%mostvectors
%similarof
t
Therefore, the the
dot asame
beproductsinglesimply reduce
character),
consider t
the size |At | |Rof
intersection t | the two tag sets (i.e., |A tional tags
t to the challenge video,
length (represent we performed
all a
tags sortin using
A&B) number of related t ∩ vide
R t |)
osine similarity of the tags onthe product
related of theand
videos magnitudes
the reduces
Therefore, to the If the all
square
adding next
roo
n the challenge video. The Occ. theVector
cosine number
similarityof tags in the
metric is first tag
add set times the%all of these
square ro
Tag Set dog puppy funny catup to 6000 new %tag
dard similarity metricAused in the number
Information of tags
Retrievalin the
to second tag set (i.e., to|Aavoid bi
t A 1 1 1 by 0adding up to nt | add|R
are text documents [20].
Rt The cosine Therefore,
B similarity the
1 cosine 1 similarity
between between
0 videos 1 (sorted a set ofin author
decrea
ectors A and B can be easily computedand a set of asrelated
follows: tags can easily be computed Rejecting
as:v, a
a challenge video
Table 1. Example of a tag occurrence table. Security a
|A
ber ∩
of
R
related
|
tagsthree
the
to gm
A·B cos =
t
generates t
up totained
n relate
S IM(A,Consider
B) = cos θ =
an example where At = {dog, puppy,
θ % % thro
!A!!B! |Aft |unny}|Rt |
and Rt = {dog, puppy, cat}. We can build a simple ta- erating fu
ot product and ble whichofcorresponds
product magnitudes the tag occurrence over R
to are: union TAGS
theELATED (A, R,
tion). F n i
of both tag sets (see Table 1). Reading row-wise from this t is a frequ
table, !n Here
the tag occurrence
SIM(A,
Tag Set B) is
Occ. 2/3.
Vector dog1. puppy
Create
A 1= an funny
empty
ation,
cat
set,
after
Atvectors for AAt and Rt12.areSort
14

related 1 been
videos
have 0
=1, 1, 0}aand
A · B {1, b
i i B = {1, 1, 0, 1}, respectively. Next, we R
Rt B 1 1 0 1
compute
i=1the dot product: of their tag quency
sets gre
relat
Additional Similarity Metrics

Theodoridis Text
Defines a large number of alternative
distance metrics, including:
• Hamming distance: number of locations where
two vectors (usually bit vectors) disagree

• Correlation coefficient

• Weighted distances...

15
Criterion Functions for Clustering

Criterion Function
Quantifies ‘quality’ of a set of clusters
• Clustering task: partition data set D into c disjoint
sets D1 ... Dc

• Choose partition maximizing the criterion function

16
s(x, x ) =
d
T !
x x
Criterion: Sum of Squared Error
!
s(x, x ) = T
x x + x!T x! − xT x!
c "
"
Je = ||x − µDi ||2
i=1 x∈Di

Measures total squared ‘error’ incurred by choice of


cluster centers (cluster means)
‘Optimal’ Clustering
Minimizes this quantity
Issues
• Well suited when clusters compact and well-separated

• Different # points in each cluster can lead to large


clusters being split ‘unnaturally’ (next slide)

• Sensitive to outliers 17
Je = large

Je = small

FIGURE 10.10. When two natural groupings have very different numbers of points, the
clusters minimizing a sum-squared-error criterion Je of Eq. 54 may not reveal the true
underlying structure. Here the criterion is smaller for the two clusters at the bottom than
for the more natural clustering at the top. From: Richard O. Duda, Peter E. Hart, and
David G. Stork, Pattern Classification. Copyright ! c 2001 by John Wiley & Sons, Inc.
c
1!
Related Criteria:
c
Min Variance Je =
2 i=1
ni s̄i

1! c 1 ! s¯ = 1 ! ! ||x − x"||2
J =
Je e=
2 i=1
ni s̄i ni s̄ii n2i x∈D x ∈D
2 i=1 i
!
i

Ans¯i =Equivalent
1 ! !
||xFormulation
− x" 2
|| for SSE
n1i x∈D!
2
x ∈D !
!
c
i i
" 2 1 !
s¯i :=mean 2 squared ||x
distance − x ||
between points
Je = J =in cluster
1
ni s̄i
c i
!
n
(variance) i x∈Di x! ∈Di 2 i=1e ni s̄i
2 i=1

s̄ =
1 !1 Criterions:
Alternative !
c
! use median,
s
s(x, x i)" =
¯
1 ! !other
maximum,
1 ! ! " 2
||x − x ||
i descriptive
2Je = statistic
ni s̄ion distance for
n2 s¯i = ||x − x" |
ni x∈D2i xi=1
! ∈D
i
i x∈D i n
x
2
! ∈D
i
i x∈D !
i x ∈Di
Variation: 1 !Using! Similarity (e.g. Tanimoto)
1 ! !! !
1 s(x, x" )
" " 2 s̄ = "
s¯i s¯
=i =2 min s(x, ||x −xx) || i s̄
2 i = s(x, x )
x∈Di∈D
x,x x! ∈D
n
s may beniany !similarity function (in thisi x∈Dcase,x
2
maximize)
n! ∈D i x∈D !
i i i i i x ∈Di

1 ! ! " "
s̄i = 2 s(x, x" ) s¯i = mins¯i =s(x,
minx ) s(x, x )
ni x∈Di x! ∈Di x,x! ∈D i ! x,x ∈Di
19

"
s¯i = min
!
s(x, x )
x,x ∈Di
c !
!
Criterion: Scatter Matrix-Based
Sw =
i=1 x∈Di
(x − µi )(x − µi )T

c !
! c !
!
trace[Sw ] = ||x − µi ||2 = Je Sw = (x − µi )(x − µi )T
i=1 x∈Di i=1 x∈Di
c
! c !
!
||x − µi ||2 =
Minimize Trace of Sw (within-class)
i=1
trace[Sw ] = trace[Si ] =
i=1 x∈Di

Equivalent to SSE!
Recall that total
!c !
scatter is the sum T of within
Sw = (x − µi )(x − µi )
and between-class
i=1 x∈D scatter (Sm = Sw + Sb).
i

This means that ! by


c ! minimizing2 the trace of
Sw, wetrace[S ]=
also wmaximize ||x −
Sb µi ||
(as Sm =isJefixed):
i=1 x∈D i

c
!
trace[Sb ] = ni ||µi − µ0 ||2
i=1
20
! !
trace[Sw ] = ||x − µi ||2 = Je
i=1 x∈Di
c
!
Scatter-Based Criterions, Cont’d
trace[Sb ] =
i=1
ni ||µi − µ0 ||2
" "
"! "
" c ! "
T"
Jd = |Sw | = " (x − µi )(x − µi ) "
"
" i=1 x∈Di "

Determinant Criterion
Roughly measures square of the scattering
volume; proportional to product of variances
in principal axes (minimize!)
• Minimum error partition will not change with
axis scaling, unlike SSE

21
Scatter-Based: Invariant Criteria
c !
!
InvariantSSwwCriteria
= !
=
c !
(Eigenvalue-based)
(x − µi )(x − µiT)T
(x − µi )(x − µi )
i=1 x∈Di
i=1 x∈Di
Eigenvalues: measure !c !
c
ratio of between to within-
trace[S ] = ! ! ||x − µ ||2 = J
cluster trace[S
scatterw ] in
w = direction
i=1 x∈Di
||x −of
µi ||ieigenvectors
2
= Je e
(maximize!) i=1 x∈D
!c
i

c
!
22
• trace[S
trace[Sbb]] =
Trace of a matrixi=1=
i=1
n ||µ − µ
ni ||µi − µ0 || ||
is sum of eigenvalues (here d is
i i 0

length of feature
"" vector) " "
""! cc ! " "
""! " "
• JEigenvalues
Jd d==|S =are
|Sww|| = "" invariant
""
""i=1
−under
(x −
(x µµi )(x
i )(x−
T "T "
non-singular
linear
−µµi )i )" "
" "
transformations (rotations, translations, scaling, etc.)
i=1 x∈D
x∈Dii

dd
!
−1 !
trace[S w S
trace[S −1
w Sbb]] =
= λλi i
i=1
i=1
d
!
−1 1
Jf = trace[Sm Sw ] = 22
i=1 1 + λi
Clustering with a Criterion
Choosing Criterion
Creates a well-defined problem
• Define clusters so as to maximize the
criterion function

• A search problem

• Brute force solution: enumerate partitions


of the training set, select the partition with
maximum criterion value

23
Comparison: Scatter-Based Criteria

24
Hierarchical Clustering
Motivation
Capture similarity/distance relationships
between sub-groups and samples within the
chosen clusters
• Common in scientific taxonomies (e.g.
biology)

• Can operate bottom up (individual samples to


clusters, or agglomerative clustering) or top-
down (single cluster to individual samples, or
divisive clustering)

25
Agglomerative Hierarchical Clustering
Problem: Given n samples, we want c clusters
One solution: Create a sequence of partitions (clusterings)
• First partition, k = 1: n clusters (one cluster per sample)
• Second partition, k = 2: n-1 clusters
• Continue reducing the number of clusters by one: merge 2 closest
clusters (a cluster may be a single sample) at each step k until...

• Goal partition: k = n - c + 1: c clusters


• Done; but if we’re curious, we can continue on until the...

• ....Final partition, k = n: one cluster


Result
All samples and sub-clusters organized into a tree (a dendrogram)
• Often show cluster similarity for a dendrogram diagram (Y-axis)
If as stated above whenever two samples share a cluster they remain in
a cluster at higher levels, we have a hierarchical clustering 26
x1 x2 x3 x4 x5 x6 x7 x8
k=1 100
k=2 90
k=3 80

similarity scale
k=4 70
k=5 60
k=6 50
k=7 40
k=8
30
20
10
0

FIGURE 10.11. A dendrogram can represent the results of hierarchical clustering algo-
rithms. The vertical axis shows a generalized measure of similarity among clusters. Here,
at level 1 all eight points lie in singleton clusters; each point in a cluster is highly similar
to itself, of course. Points x6 and x7 happen to be the most similar, and are merged at
level 2, and so forth. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
Classification. Copyright ! c 2001 by John Wiley & Sons, Inc.

x4 2 x6
x3 x x7
2 4
3 x5

x1 5
6 x8
7

k=8

FIGURE 10.12. A set or Venn diagram representation of two-dimensional data (which


was used in the dendrogram of Fig. 10.11) reveals the hierarchical structure but not the
quantitative distances between clusters. The levels are numbered by k , in red. From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
c 2001 by John Wiley & Sons, Inc.
!
Distance Measures
dmin (Di , Dj ) = min ||x − x" ||
x∈Di ,x∈Dj

dmax (Di , Dj ) = max ||x − x" ||


x∈Di ,x∈Dj

1 ! !
davg (Di , Dj ) = ||x − x" ||
ni nj x∈Di x! ∈Dj

dmean (Di , Dj ) = ||mi − mj ||

Listed Above:
Minimum, maximum and average inter-sample
distance (samples for clusters i,j: Di , Dj)
Difference in cluster means (mi, mj)
28
Nearest-Neighbor Algorithm
dmin (Di , Dj ) = min ||x − x" ||
Also Known as “Single-Linkage” Algorithm x∈Di ,x∈Dj

Agglomerative hierarchical clustering using dmin


d ||x − x" ||
max (Di , Dj ) = max
x∈Di ,x∈Dj
• Two nearest neighbors in separate clusters determine clusters merged
1 ! !
at each step
davg (Di , Dj ) = ||x − x" ||
ni nj
• x∈Di x ∈Dj
If we continue until k = n (c = 1), produce a minimum spanning tree
!

(similar to Kruskal’s alg.)


dmean (Di , Dj ) = ||mi − mj ||
• MST: Path exists between all node (sample) pairs, sum of edge
costs minimum for all spanning trees

Issues

Sensitive to noise and slight changes in position of data points (chaining effect)

• Example: next slide


29
FIGURE 10.13. Two Gaussians were used to generate two-dimensional samples, shown
in pink and black. The nearest-neighbor clustering algorithm gives two clusters that well
approximate the generating Gaussians (left). If, however, another particular sample is
generated (circled red point at the right) and the procedure is restarted, the clusters do
not well approximate the Gaussians. This illustrates how the algorithm is sensitive to
the details of the samples. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
c 2001 by John Wiley & Sons, Inc.
Pattern Classification. Copyright !
||x − x" ||
Farthest-Neighbor Algorithm
dmin (Di , Dj ) = min
x∈Di ,x∈Dj

dmax (Di , Dj ) = max ||x − x" ||


x∈Di ,x∈Dj
Agglomerative hierarchical ! clustering
!
using dmax
1 "
d (D , D j the smallest maximum distance||between two
) = ||x − x
• Clusters
avg i with
ni nj x∈Di x! ∈Dj
points are merged at each step

• Goal: minimal
dmean (Di ,increase
Dj ) = to largest
||m i−m cluster
j || diameter at
each iteration (discourages elongated clusters)

• Known as ‘Complete-Linkage Algorithm’ if terminated when


distance between nearest clusters exceeds a given threshold
distance

Issues
Works well for compact and roughly equal in size; with
elongated clusters, result can be meaningless 31
dmax = large dmax = small

FIGURE 10.14. The farthest-neighbor clustering algorithm uses the separation between
the most distant points as a criterion for cluster membership. If this distance is set very
large, then all points lie in the same cluster. In the case shown at the left, a fairly large
dmax leads to three clusters; a smaller dmax gives four clusters, as shown at the right. From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
!c 2001 by John Wiley & Sons, Inc.
dmin (Di , Dj ) = min ||x − x" ||
x∈Di ,x∈Dj

Using Mean, Avg Distances


dmax (Di , Dj ) = max
x∈Di ,x∈Dj
||x − x" ||

1 ! !
davg (Di , Dj ) = ||x − x" ||
ni nj x∈Di x! ∈Dj

dmean (Di , Dj ) = ||mi − mj ||

Reduces Sensitivity to Outliers


Mean less expensive to compute than avg,
min, max (each require ni * nj distances)

33
Stepwise Optimal Hierarchical Clustering
Problem
"
None of the agglomerative methods discussed so
dmin (Di , Dj ) far
= directly
min ||x − x ||
x∈Di ,x∈Dj
minimize a specific criterion function
dmax (Di , Dj ) = max ||x − x" ||
Modified Agglomerative Algorithm:
x∈Di ,x∈Dj

1 ! !
davg (Di , Dj ) = ||x − x" ||
For k = 1 to (n - c + 1) ni nj x∈Di x! ∈Dj

• Find clusters whose merger changes criterion least,


dmean (Di ,D
Di jand
) =D j
||m i − mj ||

• Merge Di and Dj "


ni nj
de (Di , Dj ) = ||mi − mj ||
Example: Minimal increase in SSE (Je) ni + nj

de defines the cluster pair that increases Je as little as possible. May not
minimize SSE, but often good starting point

• prefers merging single elements or small with large clusters vs.


merging medium-size clusters 34
k-Means Clustering
k-Means Algorithm
For a number of clusters k:
1. Choose k data points at random

2. Assign all data points to closest of the k cluster centers

3. Re-compute k cluster centers as the mean vector of each cluster

• If cluster centers do not change, stop


• Else, goto 2
Complexity
O(ndcT) - T iterations, d features, n points, c clusters, in practice
usually T << n (much fewer than n iterations)
Note: means tend to move minimizing squared error criterion 35
µ2

µ1
-6 -4 -2 2 4 6

-2

-4

FIGURE 10.2. The k -means clustering procedure is a form of stochastic hill climbing
in the log-likelihood function. The contours represent equal log-likelihood values for
the one-dimensional data in Fig. 10.1. The dots indicate parameter values after different
iterations of the k -means algorithm. Six of the starting points shown lead to local max-
ima, whereas two (i.e., µ1 (0) = µ2 (0)) lead to a saddle point near ! = 0. From: Richard
O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright ! c 2001
by John Wiley & Sons, Inc.
p(x|µa)
p(x|µb)
source density

x
-4 -3 -2 -1 0 1 2 3 4

4
l(µ1, µ2) 2
0
-2
-4
-50

-100

-150 µ2

-52.2 µa

start -56.7 µb
-5
-2.5
0 start
2.5 µ1
5

FIGURE 10.1. (Above) The source mixture density used to generate sample data, and
two maximum-likelihood estimates based on the data in the table. (Bottom) Log-
likelihood of a mixture model consisting of two univariate Gaussians as a function of
their means, for the data in the table. Trajectories for the iterative maximum-likelihood
estimation of the means of a two-Gaussian mixture model based on the data are shown
as red lines. Two local optima (with log-likelihoods −52.2 and −56.7) correspond to the
two density estimates shown above. From: Richard O. Duda, Peter E. Hart, and David
G. Stork, Pattern Classification. Copyright "c 2001 by John Wiley & Sons, Inc.
x2
1 3 2

x1
FIGURE 10.3. Trajectories for the means of the k -means clustering procedure applied to
two-dimensional data. The final Voronoi tesselation (for classification) is also shown—
the means correspond to the “centers” of the Voronoi cells. In this case, convergence is
obtained in three iterations. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
c 2001 by John Wiley & Sons, Inc.
Pattern Classification. Copyright !
Fuzzy k-means
Basic Idea
Allow every point to have a probability of
membership in every cluster. The criterion (cost
function) minimized is:
c !
! n
Jf uz = [P̂ (ωi |xj , Θ̂)]b ||xj − µi ||2
i=1 j=1

Theta is the membership function parameter set.


b (‘blending’) is a free parameter:
• b = 0: Sum of squared error criterion (one cluster
per data point)

• b > 1: each pattern may belong to multiple clusters 39


Jf uz = [P̂ (ωi |xj , Θ̂)]b ||xj − µi ||2
i=1 j=1
c !
! n "n
b 2 b
Jf uz = [P̂ (ωi |xj , Θ̂)] ||xj − µi || j=1 [P̂ (ωi |xj )] xj
Fuzzy
i=1 j=1 k-Mean Clustering
µ = Algorithm j "n
j=1 [P̂ (ωi |xj )]
b

"n
j=1 [ P̂ (ωi |x j )]b
xj (1/dij )1/(b−1)
µj = "n P̂ (ωi |xj ) = "c 1/(b−1)
b r=1 (1/drj )
j=1 [P̂ (ωi |xj )]
dij = ||xj − µi ||2
Algorithm
1. Compute probability of each class for every point in the
training set (uniform probability: equal likelihood in each
cluster)
2. Recompute means using expression at top-left
3. Recompute probability of each class for each point using
expression at top right
• If change in means and membership probabilities for
points is small, stop
• Else goto 2 40
x2 4
3
2
1

x1
FIGURE 10.4. At each iteration of the fuzzy k -means clustering algorithm, the prob-
ability of category memberships for each point are adjusted according to Eqs. 32 and
33 (here b = 2). While most points have nonnegligible memberships in two or three
clusters, we nevertheless draw the boundary of a Voronoi tesselation to illustrate the
progress of the algorithm. After four iterations, the algorithm has converged to the red
cluster centers and associated Voronoi tesselation. From: Richard O. Duda, Peter E.
Hart, and David G. Stork, Pattern Classification. Copyright ! c 2001 by John Wiley &
Sons, Inc.
Fuzzy k-means, Cont’d

Convergence Properties
Sometimes fuzzy k-means improves
convergence over classical k-means
However, probability of cluster membership
depends on the number of clusters; can lead
to problems if poor choice of k is made

42
Cluster Validity
So far...
We’ve assumed that we know the number of clusters

When number of clusters isn’t known


We can try a clustering procedure using c=1, c=2,
etc., and making note of sudden decreases in the
error criterion (e.g. SSE)
More formal: statistical tests, however problem of
testing cluster validity is unsolved

• DHS: Section 10.10 presents a statistical test


centered around testing the null hypothesis of
having c clusters, by comparing with c+1
clusters 43

You might also like