07-Clustering
07-Clustering
Geographic
SizeDistance
Based Based
Each point represents a house
Clustering Definition
◘ Clustering
– Given a set of data points
– Data points have a set of attributes find clusters
– A similarity measure
Clustering Applications
Data Types
Categorical Numeric
Categorical Categorical
Continous Discrete Binary
(Nominal) (Ordinal)
12 0-18 Smoker Mountain bicycle Very Unhappy
45 18-40 Non-Smoker Utility bicycle Unhappy
34 40-100 Racing bicycle Neutral
9 Happy
48 Very Happy
Similarity/Dissimilarity Measures
◘ Numeric Data
– If attributes are continuous:
• Manhattan Distance
(p=1)
• Euclidean Distance (p=2)
• Minkowski Distance
◘ Others
– Problem-specific measures
Example for Clustering Numeric Data
◘ Document Clustering
– Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the corresponding term occurs in
the document.
Doc
Doc
timeout
season
Doc
coach
game
score
team
Doc
ball
lost
pla
Doc
wi
n
y
Doc
Doc
Doc
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Dissimilarity Measures for Categorical Data
◘ Categorical Data:
◘ e.g. Binary Variables - 0/1 - presence/absence
d (i, j) b c
a b c
Object j
1 0 sum
1 a b a b
Object i
0 c d c d
sum a c b d p
Example for Clustering Categorical Data
(a = 1, b = 3, c = 0, d= 0) d (i, j) b c
a b c
0 1 b c
d ( jack , mary ) 0.33 d (i, j)
2 0 1 a b c
11
d ( jack , jim ) 0.67 Object j
111 1 0 sum
1 2
d ( jim , mary ) 0.75 1 a b a b
11 2 Object i
0 c d c d
sum a c b d p
Result: Jim and Mary are unlikely to have a similar disease.
Jack and Mary are the most likely to have a similar disease.
Categories of Clustering Algorithms
K-Means AGNES
K-Medoids DIANA DBSCAN STING COBWEB
(PAM) BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON
6 5
4
3 4
2
5
2
1
3 1
Partitioning Methods
◘ Strength:
– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.
– Easy to understand
◘ Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5
each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
initial to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Orandom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Orandom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
21
The K-Medoid Clustering Method
( )
𝐶𝑛 𝑑𝑚
WCSS = ∑ ∑ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ( 𝑑𝑖 , 𝐶 𝑘 )
2
𝐶𝑘 𝑑𝑖 𝑖𝑛𝐶 𝑖
– S(i) = where
(separation).
K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON
6 5
4
3 4
2
5
2
1
3 1
Hierarchical Clustering
◘ Create a hierarchical decomposition of the set of data using some criterion
◘ Strength: This method does not require the number of clusters k as an input.
◘ Weakness: But it needs a termination condition.
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
…
How the Clusters are Merged?
Average Link: avg distance between an Centroid Link: distance between the
element in one cluster and an element in the centroids of two clusters, i.e., dist(K i,
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Kj) = dist(Ci, Cj)
How the Clusters are Merged?
5 0.4
1 0.2
4 1 0.35
3
2 5 0.3
5 0.15 5 0.25
2 1 2
0.2
2 3 6
0.1 3 6 0.15
3
0.05
1 0.1
4 4 0.05
4 0 0
3 6 2 5 4 1 3 6 4 1 2 5
5 4 1 0.25
2 0.2
5
2 0.15
3 6 0.1
1
4 0.05
3
0
3 6 4 1 2 5
Average Link
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
◘ Handling noise
– Use a threshold to determine the termination criterion (do not
generate clusters that are too small because they contain
mainly noises)
Hierarchical Clustering: DIANA
• Step1: Initially Cl = {a, b, c, d, e}
• Step2: Ci = Cl and Cj = Φ
• Step3: Initial iteration
• Calculate the average dissimilarities of the
objects in Ci with the other objects in Ci
Average dissimilarity of a:
a= *
= (9 + 3 + 6 + 11) = 7.25
Similarly, we have:
b = 7.75, c = 5.25, d = 7.00, e = 7.75
• The highest average distance is 7.75 and
there are two corresponding objects,
arbitrarily choose one of them, let’s say b.
• Move b to Cj
• The updated cluster elements are:
Ci = {a, c, d, e} and Cj = {b}
Hierarchical Clustering: DIANA
• Step4: Remaining iterations
• (i) 2nd iteration
Calculate average dissimilarity for each object
again:
Dd = 0.67, De = 0.
= - = -13.5
De = -2.5.
K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON
6 5
4
3 4
2
5
2
1
3 1
Density-Based Clustering
p MinPts = 5
Eps = 1 cm
q
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Border
Core
Eps = 1cm
MinPts = 5
Density-Reachable and Density-Connected
◘ Density-reachable:
– A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
◘ Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database
Continue the process until all of the points have been processed
DBSCAN: Advantages
◘ Resistant to noise
◘ Can handle clusters of different shapes and sizes
DBSCAN: Disadvantages
• It does not work well varying densities and high-dimensional data
• Sensitive to parameters
Categories of Clustering Algorithms
K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON
6 5
4
3 4
2
5
2
1
3 1
Grid Based Clustering Methods
K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON
6 5
4
3 4
2
5
2
1
3 1
Model Based Methods
Attempt to optimize the fit between the given data and some mathematical model
It uses statistical functions
Clustering Algorithms: General Overview
Factors Affecting Clustering Results
◘ Outliers
◘ Inappropriate value for parameters
◘ Drawbacks of the clustering algorithm themselves
INPUT DATASET
◘Clustering
•Data points in one cluster are more similar to one another.
•Data points in separate clusters are less similar to one another.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized