Clustering
Clustering
Clustering
More Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Recapitulation: Clustering
• The goal of clustering is to
– group data points that are close (or similar) to each other
– identify such groupings (or clusters) in an unsupervised manner
• Unsupervised: no information is provided to the algorithm on which
data points belong to which clusters
• Example
What should the clusters
be for these data points?
× ×
×
× ×
×
× ×
×
What is Clustering?
• Clustering can be considered the most important unsupervised
learning problem; so, as every other problem of this kind deals with
finding a structure in a collection of unlabeled data.
In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects
belong to the same cluster if this one defines a concept common to all that
objects.
In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures.
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• Radius: square root of average distance from any point of the cluster to its
centroid
N (t cm ) 2
Rm i 1 ip
N
• Diameter: square root of average mean squared distance between all pairs
of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a
set of k clusters k
m1tmiKm (Cm tmi )
2
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial cluster 5 5
center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Height and weight information are given. Using these two variables,
we need to group the objects based on height and weight information.
Data Sample
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in
Means Clustering)
In this dataset, only two variables –height and weight – are considered for
clustering
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 2: Initialize cluster centroid
In this example, value of K is considered as 2. Cluster centroids are
initialized with first 2 observations.
Initial Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Step 3: Calculate Euclidean Distance
We have considered two observations for assignment only because we knew the
assignment. And there is no change in Centroids as these two observations were
only considered as initial centroids.
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60
Euclidean Distance Euclidean Distance
from Cluster 1 from Cluster 2 Assignment
(168-185)2+(60-72)2 =20.808 (168-185)2+(60-72)2= 4.472 2
9 9 9
8 8 8
7 7 7
6
Arbitrary 6
Assign 6
5
choose k 5 each 5
4 object as 4 remaining 4
3
initial 3
object to 3
2
medoids 2
nearest 2
1 1 1
0 0
medoids 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
8 8
Swapping O total cost of
Until no change
7 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3
2
3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at each
level.
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Illustrative Example:
Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
a
ab
b Cluster distance
abcde
c Termination
cde condition
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of clusters, until there is only one
cluster.
• The history of merging forms a binary tree or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “farthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Cluster Distance Measures
single link
• (min)
Single link: smallest distance between an
element in one cluster and an element in
the other, i.e., d(Ci, Cj) = min{d(xip, xjq)}
d(C, C)=0
Dendrogram
• Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the union of
its children clusters at level i+1.
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
Feature 1 2 4 5 6
Single link
a b c d e dist(C1 , C 2 ) min{ d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
min{3, 4, 5, 2, 3, 4} 2
a 0 1 3 4 5
b 1 0 2 3 4 Complete link
dist(C1 , C 2 ) max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2 max{3, 4, 5, 2, 3, 4} 5
d 4 3 1 0 1
Average d(a, c) d(a, d) d(a, e) d(b, c) d(b, d) d(b, e)
dist(C1 , C 2 )
e 5 4 2 1 0 6
3 4 5 2 3 4 21
3.5
6 6
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into a
distance matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
Merge two closest clusters
Update “distance matrix”
Example
• Problem: clustering analysis with agglomerative algorithm
data matrix
Euclidean distance
distance matrix
Example
• Merge two closest clusters (iteration 1)
Example
• Update distance matrix (iteration 1)
Example
• Merge two closest clusters (iteration 2)
Example
• Update distance matrix (iteration 2)
Example
• Merge two closest clusters/update distance matrix (iteration 3)
Example
• Merge two closest clusters/update distance matrix (iteration 4)
Example
• Final result (meeting termination condition)
Example
• Dendrogram tree representation
object
Exercise
Given a data set of five objects characterised by a single continuous feature:
a b C d e
Feature 1 2 4 5 6
Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
Density-Based Clustering Algorithms
Density-Based Clustering
• Clustering based on density (local cluster criterion), such as density-
connected points or based on an explicitly constructed density
function
• This connected dense component which can grow in any direction
that density leads.
• Density, connectivity and boundary
• Arbitrary shaped clusters and good scalability
• Each cluster has a considerable higher density of points than outside
of the cluster
Major Features
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
Two Major Types of Density-Based
Clustering Algorithms
• Connectivity based:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)
ε-Neighborhood of p
ε ε ε-Neighborhood of q
qq pp
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given and MinPts, categorize the
objects into three exclusive groups.
Minpts = 3
Eps=radius
of the circles
Density-Reachability
Directly density-reachable
An object q is directly density-reachable from object p if p is a
core object and q is in p’s -neighborhood.
• Density-connected
– A point p is density-connected to a point q wrt. p q
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps
and MinPts. o
Formal Description of Cluster
• Given a data set D, parameter and threshold MinPts.
• A cluster C is a subset of objects satisfying two criteria:
– Connected: p, q C: p and q are density-connected.
– Maximal: p, q: if p C and q is density-reachable from p, then q C.
(avoid redundancy)
P is a core object.
Review of Concepts
Is an object o in a cluster or Are objects p and q in the
an outlier? same cluster?
DBScan Algorithm
DBSCAN: The Algorithm
– Arbitrary select a point p
– Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Advantages
• DBSCAN does not require to specify the number of clusters in the data
apriori, as opposed to k-means.
• DBSCAN can find arbitrarily shaped clusters. It can even find a cluster
completely surrounded by (but not connected to) a different cluster. Due to
the MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
• DBSCAN has a notion of noise, and is robust to outliers.
• DBSCAN requires just two parameters and is mostly insensitive to the
ordering of the points in the database. (However, points sitting on the edge of
two different clusters might swap cluster membership if the ordering of the
points is changed, and the cluster assignment is unique only up to
isomorphism.)
• The parameters minPts and ε can be set by a domain expert, if the data is well
understood.
DBSCAN Algorithm: Disadvantages
• DBSCAN is not entirely deterministic: border points that are reachable from
more than one cluster can be part of either cluster, depending on the order the
data is processed. Fortunately, this situation does not arise often, and has little
impact on the clustering result: both on core points and noise points,
DBSCAN is deterministic.
• The quality of DBSCAN depends on the distance measure used in the
function regionQuery (P, ε). The most common distance metric used
is Euclidean distance. Especially for high-dimensional data, this metric can be
rendered almost useless due to the so-called "Curse of dimensionality",
making it difficult to find an appropriate value for ε. This effect, however, is
also present in any other algorithm based on Euclidean distance.
• DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
clusters.
• If the data and scale are not well understood, choosing a meaningful distance
threshold ε can be difficult.
Steps of Grid-based Clustering
Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density
of each cell.
3. Eliminate cells, whose density is below a certain threshold .
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)
Advantages of Grid-based Clustering Algorithms
• fast:
– No distance computations
– Clustering is performed on summaries and not individual objects;
complexity is usually O(#-populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are neighboring
• Shapes are limited to union of grid-cells
Grid-Based Clustering Methods
• Grid-based methods quantize the object space into a finite number of cells
that form a gird structure (Uses multi-resolution grid data structure).
• All the clustering operations are performed on the grid structure.
• Clustering complexity depends on the number of populated grid cells and
not on the number of objects in the dataset
• Several interesting methods (in addition to the basic grid-based algorithm)
– STING (a STatistical INformation Grid approach) by Wang, Yang and
Muntz (1997)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution
STING: A Statistical Information Grid
Approach (2)
– Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
– Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
– Parameters of higher level cells can be easily calculated from parameters
of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1.Start from a pre-selected layer—typically with a small number of cells
2.From the pre-selected layer until you reach the bottom layer do the following:
• For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
– If it is relevant, include the cell in a cluster
– If it irrelevant, remove cell from further consideration
– otherwise, look for relevant cells at the next lower layer
3.Combine relevant cells into relevant regions (based on grid-neighborhood)
and return the so obtained clusters as your answers.
STING: A Statistical Information Grid
Approach (3)
– Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
– Disadvantages:
• All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected