0% found this document useful (0 votes)
38 views

Chap7 Basic Cluster Analysis

The document discusses cluster analysis, including types of clustering (partitional and hierarchical), types of clusters (center-based, contiguous, density-based, and conceptual), and clustering algorithms like K-means, hierarchical clustering, and density-based clustering. It provides details on K-means clustering, which groups data into K partitions by minimizing distances between points and cluster centroids, using an iterative algorithm. The document also discusses cluster validation and applications of cluster analysis.

Uploaded by

mksayshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Chap7 Basic Cluster Analysis

The document discusses cluster analysis, including types of clustering (partitional and hierarchical), types of clusters (center-based, contiguous, density-based, and conceptual), and clustering algorithms like K-means, hierarchical clustering, and density-based clustering. It provides details on K-means clustering, which groups data into K partitions by minimizing distances between points and cluster centroids, using an iterative algorithm. The document also discusses cluster validation and applications of cluster analysis.

Uploaded by

mksayshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Cluster Analysis: Basic Concepts

and Algorithms

Lecture Notes for Chapter 7


Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Look for accompanying R


code on the course web site.
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
What is Cluster Analysis?
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

▪ A clustering is a set of clusters and each cluster contains a set of


points.
Applications of Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

Cluster Analysis 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN

▪ Understanding 2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
—Group related documents Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
for browsing, group genes Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
and proteins that have
similar functionality, or
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
group stocks with similar
price fluctuations
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

▪ Summarization
—Reduce the size of large
data sets

Clustering precipitation
in Australia
What is not Cluster Analysis?

▪ Supervised classification
—Uses class label information

▪ Simple segmentation
—Dividing students into different registration groups alphabetically, by last
name

▪ Results of a query
—Groupings are a result of an external specification

→ Clustering uses only the data


Similarity

▪ How do we measure
similarity/proximity/dissimilarity/distance?

▪ Examples
—Minkovsky distance: Manhattan distance, Euclidean Distance, etc.
—Jaccard index for binary data
—Gower's distance for mixed data (ratio/interval and nominal)
—Correlation coefficient as similarity between variables
Notion of a Cluster can be Ambiguous

How many clusters?


Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Types of Clusterings

A division data objects into


non-overlapping subsets
Partitional Clustering (clusters) such that each data
object is in exactly one subset

A set of nested clusters


Hierarchical clustering organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering
Other Distinctions Between Sets of Clusters

Exclusive versus non- Fuzzy versus non-fuzzy Partial versus complete Heterogeneous versus
exclusive homogeneous
In non-exclusive clusterings, In fuzzy clustering, a point In some cases, we only want Cluster of widely different
points may belong to multiple belongs to every cluster with to cluster some of the data sizes, shapes, and densities
clusters. some membership weight
between 0 and 1
Membership weights must
sum to 1
Probabilistic clustering has
similar characteristics
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Types of Clusters

Center-
Contiguous
based
clusters
clusters

Density-
Conceptual
based
clusters
clusters
Center-based Clusters
Not well separated
Well separated (overlapping)

Cluster centers
A cluster is a set of objects such that an object in a cluster is closer (more
similar) to the “center” of a cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
Contiguous and Density-based Clusters
High density
Conceptual Clusters

Conceptual clusters are hard to detect since they are often not:
▪ Center-based
▪ Contiguity-based
▪ Density-based
Objective Functions

▪ The best clustering minimizes or maximizes an objective function.


▪ Example: Minimize the Sum of Squared Errors
𝐾

𝑆𝑆𝐸 = ෍ ෍ 𝒙 − 𝒎𝑖 2

𝑖=1 𝒙∈𝐶𝑖
𝒙 is a data point in cluster 𝐶𝑖 , 𝒎𝑖 is the center for cluster 𝐶𝑖 as the mean of all
points in the cluster and ⋅ is the L2 norm (= Euclidean distance).

▪ Problem: Enumerate all possible ways of dividing the points into


clusters and evaluate the `goodness' of each potential set of clusters
by using the given objective function. (NP Hard)
Objective Functions

Global objective function


▪ Typically used in partitional clustering. k-means uses SSE.
▪ Mixture Models assume that the data is a ‘mixture' of a number of
parametric statistical distributions (e.g., a mixture of Gaussians). Maximize
log-likelihood of the model.

Local objective function


▪ Hierarchical clustering algorithms typically have local objectives.
▪ Density-based clustering is based on local density estimates.
▪ Graph based approaches. Graph partitioning (e.g., min-cut) and shared
nearest neighbors.

We will talk about the objective functions when we talk about individual
clustering algorithms.
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
K-means Clustering

▪ Partitional clustering approach


▪ Each cluster is associated with a centroid (center point)
▪ Each point is assigned to the cluster with the closest centroid
▪ Number of clusters, K, must be specified

Lloyd’s algorithm (Voronoi iteration):


K-means Clustering – Details

▪ Initial centroids are often chosen randomly.


—Clusters produced vary from one run to another.
▪ The centroid is the mean of the points in the cluster.
▪ ‘Closeness’ is measured by Euclidean distance
▪ K-means will converge (points stop changing assignment) typically in
the first few iterations (<10).
—Sometimes the stopping condition is changed to ‘Until relatively few points
change clusters’

▪ Complexity is 𝑂(𝑛 𝐾 𝐼 𝑑 )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

See visualization on course web site


Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem

▪ Multiple runs. This is standard in most tools and typically helps.

▪ Sample and use hierarchical clustering to determine initial centroids.

▪ Select more than k initial centroids and then select among these
initial centroids the ones that are far away from each other.
Evaluating K-means Clusters

▪ Most common measure is Sum of Squared Error (SSE)


—For each point, the error is the distance to the nearest cluster center
𝐾

𝑆𝑆𝐸 = ෍ ෍ 𝒙 − 𝒎𝑖 2

𝑖=1 𝒙∈𝐶𝑖

—𝒙 is a data point in cluster 𝐶𝑖 , 𝒎𝑖 is the center for cluster 𝐶𝑖 as the mean of


all points in the cluster and ⋅ is the L2 norm (= Euclidean distance).
—Given two clusterings, we can choose the one with the smallest error
—Only compare clusterings with the same K! One easy way to reduce SSE is to
increase K, the number of clusters

▪ Note: K-Means is a heuristic to minimize SSE.


Pre-processing and Post-processing

▪ Pre-processing
—Normalize the data (e.g., scale to unit standard deviation)
—Eliminate outliers

▪ Post-processing
—Eliminate small clusters that may represent outliers
—Split ‘loose’ clusters, i.e., clusters with relatively high SSE
—Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means

▪ K-means has problems when clusters are of differing


—Sizes
—Densities
—Non-globular shapes

▪ K-means has problems when the data contains outliers.


Limitations of K-means: Differing Sizes
Limitations of K-means: Differing Density
Limitations of K-means: Non-globular Shapes
Overcoming K-means Limitations

Use a larger
number of clusters

Several clusters
represent a true
cluster
Overcoming K-means Limitations

Use a larger
number of clusters

Several clusters
represent a true
cluster
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Hierarchical Clustering

▪ Produces a set of nested clusters organized as a hierarchical tree


called a dendrogram. The dendrogram shows at what distance points
join into a cluster.

Dendrogram
6 5

4 distance
3 4 0.2
2
5
2 0.15

1 0.1

3 1
0.05

0
1 3 2 5 4 6

Data points
Strengths of Hierarchical Clustering

▪ You do not have to assume any


particular number of clusters
—Any desired number of clusters can
be obtained by ‘cutting’ the
dendogram at the proper level.

▪ They may correspond to


meaningful taxonomies
—Example in biological sciences
(e.g., animal kingdom, phylogeny
reconstruction, …)
Hierarchical Clustering

▪ Two main types of hierarchical clustering


—Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

—Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)

▪ Traditional hierarchical algorithms


—use a similarity or distance matrix
—merge or split one cluster at a time
Agglomerative Clustering Algorithm

▪ Agglomerative approach is more popular.


▪ Basic algorithm is straightforward

1. Compute the proximity matrix


2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

▪ A key operation is to compute the proximity between two clusters.


Starting Situation

▪ Start with clusters of individual p1 p2 p3 p4 p5 ...


points and a proximity matrix p1
y
p2
p3

p4
p5
.
.
. Proximity Matrix

...
x
Dendrogram
Intermediate Situation
C1 C2 C3 C4 C5
▪ After some merging steps, we
C1
have some clusters
y C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
x C2 C5 C3
Dendrogram
Intermediate Situation
▪ We want to merge the two closest C1 C2 C3 C4 C5
clusters (C2 and C5) and update the C1
y proximity matrix.
C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
x C2 C5 C3
Dendrogram
After Merging
C2
▪ The question is “How do we update C1
U
C5 C3 C4
the proximity matrix?”
y C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

...
x C2 C5 C3
Dendrogram
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

Similarity? p2

p3

p4

p5

.
▪ MIN .
▪ MAX
.
▪ Group Average Proximity Matrix
▪ Distance Between Centroids
▪ Other methods driven by an objective function
Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average .
▪ Distance Between Centroids
Proximity Matrix

▪ Other methods driven by an objective function


Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average .
▪ Distance Between Centroids
Proximity Matrix

▪ Other methods driven by an objective function


Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average (Average Link)
.
▪ Distance Between Centroids Proximity Matrix
▪ Other methods driven by an objective function
Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

  p2

p3

p4

p5

.
▪ MIN (Single Link) .
▪ MAX (Complete Link) .
Proximity Matrix
▪ Group Average
▪ Distance Between Centroids
▪ Other methods driven by an objective function
Ward’s Method uses squared error
Single Link

Advantage: Non-spherical, non-convex clusters


Problem: Chaining
Complete Link

Advantage: more robust against noise (no chaining)


Problem: Tends to break large clusters,
Biased towards globular clusters
Average Link

Compromise between Single and Complete Link


Cluster Similarity: Ward’s Method

▪ Similarity of two clusters is based on the increase in squared error


when two clusters are merged
▪ Less susceptible to noise and outliers
▪ Biased towards globular clusters
▪ Hierarchical analogue of K-means
Hierarchical Clustering: Complexity

▪ Space: 𝑂(𝑁 2 ) since it uses the proximity matrix.


—N is the number of points.

This restricts the


number of points that
can be clustered!

▪ Time: 𝑂(𝑁 3 ) in many cases


—There are 𝑁 steps and at each step the proximity matrix of size 𝑁 2 must be
updated and searched
—Complexity can be reduced to 𝑂(𝑁 2 log(𝑁)) time for some approaches
Hierarchical Clustering: Limitations

▪ Greedy: Once a decision is made to combine two clusters, it cannot


be undone

▪ No global objective function is directly minimized

▪ Different schemes have problems with one or more of the following:


—Sensitivity to noise and outliers
—Difficulty handling different sized clusters and convex shapes
—Chaining, breaking large clusters
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
DBSCAN

Density = 7 points

▪ Density = number of points within a specified radius (Eps)


DBSCAN

MinPts = 5

▪ A point is a core point if it has more than a specified number of


points (MinPts) within Eps. These are points that are at the interior of
a cluster
▪ A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
▪ A noise point is any point that is not a core point or a border point.
DBSCAN DBSCAN(D, eps, MinPts)
C=0
Algorithm for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)

expandCluster(P, NeighborPts, C, eps, MinPts)


add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


DBSCAN: Determine Clusters

Point types: core,


border and noise Clusters

▪ Resistant to Noise
▪ Can handle clusters of different shapes and sizes
▪ Eps and MinPts depend on each other and can be hard to specify
When DBSCAN Does
NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

▪ Varying densities
▪ High-dimensional data

(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts

▪ Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
▪ Noise points have the kth nearest neighbor at farther distance
▪ So, plot sorted distance of every point to its kth nearest neighbor

MinPts = k

Eps
Some Other Clustering Algorithms

▪ Center-based Clustering ▪Graph-based Clustering


—Fuzzy c-means —Graph partitioning on a
—PAM (Partitioning Around sparsified proximity graph
Medoids) —Shared nearest-neighhbor (SNN
graph)
▪ Mixture Models
—Expectation-maximization ▪Spectral Clustering
(EM) algorithm —Reduce the dimensionality using
the spectrum of the similarity,
▪ Hierarchical and cluster in this space.
—CURE (Clustering Using
Representatives): shrinks
points toward center ▪Subspace Clustering
—BIRCH (balanced iterative ▪Data Stream Clustering
reducing and clustering using
hierarchies)
Topics

▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Cluster Validity

▪ For supervised classification (= we have a class label) we have a


variety of measures to evaluate how good our model is: Accuracy,
precision, recall

▪ For cluster analysis (=unsupervised learning), the analogous question


is:

How to evaluate the “goodness” of the resulting


clusters?
Clusters found in Random Data (Overfitting)
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6 DBSCAN


Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

K-means 0.8 0.8


Complete
0.7 0.7
Link
0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

If you tell a clustering algorithm to find clusters then it will!


Different Aspects of Cluster Validation

1. Determining the clustering tendency of a set of data, i.e.,


distinguishing whether non-random structure actually exists in the
data (e.g., to avoid overfitting).
2. External Validation: Compare the results of a cluster analysis to
externally known class labels (ground truth).
3. Internal Validation: Evaluating how well the results of a cluster
analysis fit the data without reference to external information.
4. Compare clusterings to determine which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to


evaluate the entire clustering or just individual clusters.
Measures of Cluster Validity

Numerical measures that are applied to judge various aspects of


cluster validity, are classified into the following three types.

▪ External Index: Used to measure the extent to which cluster labels


match externally supplied class labels.
—Entropy, Purity, Rand index
▪ Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
—Sum of Squared Error (SSE), Silhouette coefficient
▪ Relative Index: Used to compare two different clusterings or clusters.
—Often an external or internal index is used for this function, e.g., SSE or
entropy
Similarity Matrix Visualization for Cluster Validation

▪ Order the similarity matrix with respect to cluster labels and inspect
visually.

1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Similarity Matrix Visualization for Cluster Validation
▪ Clusters in random data are not as crisp
1
1

0.9
10
DBSCAN 0.9

20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
1 1
10 k-means 0.9 10
Complete L. HC 0.9

20 0.8 20 0.8
30 0.7 30 0.7

40 0.6 40 0.6
Points

Points

50 0.5 50 0.5
60 0.4 60 0.4

70 0.3 70 0.3
80 0.2 80 0.2

90 0.1 90 0.1

100 0 100 0
20 40 60 80 100 Similarity 20 40 60 80 100 Similarity
Points Points
Measuring Cluster Validity Via Correlation

▪ Two matrices
—Proximity Matrix representing the data
—Incidence Matrix representing the clustering
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
▪ Compute the correlation between the two matrices
▪ High correlation indicates that points that belong to the same cluster
are close to each other.
▪ Not a good measure for some density or contiguity-based clusters
(e.g., single link HC).
Measuring Cluster Validity Via Correlation

▪ Correlation of incidence and proximity matrices for the K-means


clusterings of the following two data sets.
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810

Note: Correlation is always negative between distance matrix and incidence matrix
Internal Measures: Cohesion and Separation

▪ Cluster Cohesion: Measures how closely related objects in a cluster


are.
▪ Cluster Separation: Measure how distinct or well-separated a cluster
is from other clusters.

cohesion separation
Internal Measures: Sum of Squares
▪ Cluster Cohesion: Within cluster sum of squares (WSS=SSE) 𝐶𝑖
𝐾
2
𝑊𝑆𝑆 = ෍ ෍ 𝒙 − 𝒎𝑖 𝒎𝑖
𝑖=1 𝒙∈𝐶𝑖

▪ Cluster Separation: Between cluster sum of squares (BSS)


𝐾 𝒎𝟐
2
𝐵𝑆𝑆 = ෍ 𝐶𝑖 𝒎𝒊 − 𝒎 𝒎𝟏 𝒎
𝑖=1
𝒎𝟑
Where 𝐶𝑖 is the size of cluster i and 𝒎 is the centroid of the data space

2
▪ Total sum of squares: 𝑇𝑆𝑆 = σ𝒙 𝒙 − 𝒎

𝑇𝑆𝑆 = 𝑊𝑆𝑆 + 𝐵𝑆𝑆


Internal Measures: Sum of Squares
TSS = BSS + WSS = constant for a given data set
m=m1
K=1 cluster: 
1 2 3 4 5

WSS= ( 1− 3) 2 +( 2− 3 )2 +( 4− 3 )2 +(5− 3 )2= 10


BSS= 4× ( 3− 3 )2 = 0
Total= 10+0= 10
m
K=2 clusters:
 
1 m1 2 3 4 m2 5

WSS= ( 1− 1. 5) 2+( 2− 1. 5 )2 +( 4 − 4 .5 )2 +(5− 4 .5 )2 = 1


BSS= 2× ( 3− 1 .5 )2 +2× ( 4 .5− 3 )2= 9
Total= 1+9= 10
Internal Measures: Choosing k with Sum
of Squares
▪ SSE is good for comparing two clusterings or two clusters (average
SSE).
▪ Can also be used to estimate the number of clusters

Look for
Data points forming 10 clusters the knee
y

x
Internal Measures: Silhouette Coefficient

▪ A proximity graph-based approach can also be used for cohesion and


separation.
—Cluster cohesion is the sum of the weight of all links within a cluster.
—Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient
▪ Silhouette Coefficient combine ideas of both cohesion and separation, but
for individual points. For an individual point i:
—Calculate a(i) = average dissimilarity of i to all other points in its cluster
—Calculate b(i) = lowest average dissimilarity of i to any other)

a(i) i

b(i)

Cohersion Separation

▪ The closer to 1 the better.


▪ Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures: Silhouette Plot
Internal Measures: Choosing k using the
Average Silhouette Width
External Measures of Cluster Validity: Entropy and Purity

Other measures: Precision, Recall, F-measure, Rand, Adj. Rand


Final Comment on Cluster Validity

“The validation of clustering structures is the most


difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and great
courage.”

Algorithms for Clustering Data, Jain and Dubes

You might also like