0% found this document useful (0 votes)
3 views

07-Clustering

Clustering is an unsupervised learning technique that groups similar data objects into clusters based on their attributes and similarity measures. Various clustering methods, such as K-Means and K-Medoids, are employed for different applications including spatial data analysis, image processing, and market segmentation. The quality of clustering is determined by intra-class and inter-class similarity, and methods for selecting the optimal number of clusters include the elbow method and silhouette method.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

07-Clustering

Clustering is an unsupervised learning technique that groups similar data objects into clusters based on their attributes and similarity measures. Various clustering methods, such as K-Means and K-Medoids, are employed for different applications including spatial data analysis, image processing, and market segmentation. The quality of clustering is determined by intra-class and inter-class similarity, and methods for selecting the optimal number of clusters include the elbow method and silhouette method.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Clustering

CME 4416 – Introduction to Data Mining

Asst. Prof. Dr. Göksu Tüysüzoğlu


Outline

◘ What is Cluster Analysis?


◘ Similarity Measure
◘ Clustering Applications
◘ Clustering Methods
◘ Clustering Algorithm Selection
◘ Cluster Validation
What is Cluster Analysis?

◘ Clustering is the process of grouping large data sets according to


their similarity.

◘ A cluster is a collection of data objects that are similar to one


another within the same cluster.

Geographic
SizeDistance
Based Based
Each point represents a house
Clustering Definition

◘ Unsupervised learning: no predefined classes (i.e., learning by


observations vs. learning by examples: supervised)
◘ Grouping similar data objects into clusters.

◘ Clustering
– Given a set of data points
– Data points have a set of attributes find clusters
– A similarity measure
Clustering Applications

◘ Spatial Data Analysis:


Detect spatial clusters, City-planning
◘ Image Processing
◘ WWW: Cluster Weblog data
◘ Information retrieval: Document clustering
◘ Marketting: Help marketers discover distinct groups in their
customer bases
◘ Medical Applications
◘ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
◘ Climate: Understanding earth climate, find patterns of atmospheric
and ocean
◘ Biology: Taxonomy of living things (kingdom, phylum, class, order,
family, genus and species)
Quality: What Is Good Clustering?

◘ A good clustering method will produce high quality


clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
◘ The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– its ability to discover some or all of the hidden patterns
Similarity/Dissimilarity Measures

Data Types

Categorical Numeric

Nominal Ordinal Continous Discrete

Categorical Categorical
Continous Discrete Binary
(Nominal) (Ordinal)
12 0-18 Smoker Mountain bicycle Very Unhappy
45 18-40 Non-Smoker Utility bicycle Unhappy
34 40-100 Racing bicycle Neutral
9 Happy
48 Very Happy
Similarity/Dissimilarity Measures

◘ Numeric Data
– If attributes are continuous:
• Manhattan Distance
(p=1)
• Euclidean Distance (p=2)
• Minkowski Distance

◘ Categorical Data d (i, j)  b c


a b  c
– Jaccard's distance
– ...

◘ Others
– Problem-specific measures
Example for Clustering Numeric Data

◘ Document Clustering
– Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the corresponding term occurs in
the document.

Doc
Doc

timeout

season
Doc

coach

game
score
team
Doc

ball

lost
pla
Doc

wi
n
y
Doc
Doc
Doc

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Dissimilarity Measures for Categorical Data

◘ Categorical Data:
◘ e.g. Binary Variables - 0/1 - presence/absence

Jaccard's distance (measure dissimilarity)

d (i, j)  b c
a b  c

Object j

1 0 sum
1 a b a b
Object i
0 c d c d
sum a  c b  d p
Example for Clustering Categorical Data

◘ Find the Jaccard's distance between Apple and Banana.

Feature of Fruit Sphere shape Sweet Sour Crunchy


Object i =Apple Yes Yes Yes Yes
Object j =Banana No Yes No No

(a = 1, b = 3, c = 0, d= 0) d (i, j)  b c
a b  c

(3+0) / (1 + 3 + 0) = 3/4 = 0.75 Object j


1 0 sum
1 a b a b
Object i
0 c d c d
sum a  c b  d p
Example for Clustering Categorical Data

◘ Who are the most likely to have a similar disease?


Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

0 1 b c
d ( jack , mary )  0.33 d (i, j) 
2  0 1 a b  c
11
d ( jack , jim )  0.67 Object j
111 1 0 sum
1 2
d ( jim , mary )  0.75 1 a b a b
11 2 Object i
0 c d c d
sum a  c b  d p
Result: Jim and Mary are unlikely to have a similar disease.
Jack and Mary are the most likely to have a similar disease.
Categories of Clustering Algorithms

Density Grid Model


Partitioning Hierarchical Based Based Based
Methods Methods Methods Methods Methods

K-Means AGNES
K-Medoids DIANA DBSCAN STING COBWEB
(PAM) BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON

6 5

4
3 4
2
5
2

1
3 1
Partitioning Methods

◘ Construct a partition of a database D of n objects into a set of k


clusters
◘ Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion e.g. minimize SSE (Sum of Squared Distance)
Partitioning Methods

◘ k-means (MacQueen’67, Lloyd’57/’82)


Each cluster is represented by the center of the cluster

◘ k-medoids or PAM (Partition around medoids) (Kaufman &


Rousseeuw’87)
Each cluster is represented by one of the objects in the cluster
K-Means

◘ K-Means is an algorithm to cluster n objects


based on attributes into k partitions, k < n.
◘ Given k, the k-means algorithm is
implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of
the clusters of the current partitioning (the
centroid is the center, i.e., mean point, of
the cluster)
3. Assign each object to the cluster with the
nearest seed point
4. Go back to Step 2, stop when the
assignment does not change
K-Means Example
K-Means Example
K-Means Adv. DisAdv.

◘ Strength:
– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.
– Easy to understand

◘ Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

Other clustering K-Means result


algorithm result
Problem of the K-Means Method?

◘ The k-means algorithm is sensitive to outliers !


– Since an object with an extremely large value may substantially
distort the distribution of the data
◘ K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5
each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

initial to
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Orandom
10 10

Do loop 9

8
Compute
9

8
Swapping total cost
Until no
7 7

O and 6
of 6

change Orandom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

21
The K-Medoid Clustering Method

◘ K-Medoids Clustering: Find representative objects (medoids) in clusters


– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
◘ Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling
How to Select the Optimum
Number of Clusters (k)?
◘ Elbow method
– For each k, calculate the within cluster
sum of squares (WCSS)

( )
𝐶𝑛 𝑑𝑚

WCSS = ∑ ∑ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ( 𝑑𝑖 , 𝐶 𝑘 )
2

𝐶𝑘 𝑑𝑖 𝑖𝑛𝐶 𝑖

where C is the cluster centroids and d is the


data point in each cluster.
– Plot the curve of WCSS according to the
number of clusters
– The location of the bend in the plot is
generally considered as indicator of the
approximate number of clusters
How to Select the Optimum
Number of Clusters (k)?
◘ Silhouette method
– It is a measure of how similar a data
point is within-cluster (cohesion)
compared to other clusters

– S(i) = where
(separation).

• S(i) is silhouette coefficient of the data


point i,
• is the average distance between i and all
other data points in the cluster to which i
belongs,
• is the average distance from i to all other
clusters to which i does not belong. Calculate the overall silhouette score
– Calculate the average_silhouette for
every cluster • Overall_silhouette = (
• Avg_silhouette = ( where c is the number of clusters
where n is the number of samples in this
cluster  Select k value with the highest overall
silhouette score.
How to Select the Optimum
Number of Clusters (k)?
◘ Silhouette method
– The value of the silhouette coefficient is between [-1, 1]
– A score of 1 denotes the best, meaning that the data point i is very
compact within the cluster to which it belongs and far away from the other
clusters
– The worst value is -1
– Values near 0 denote overlapping clusters.
Categories of Clustering Algorithms

Density Grid Model


Partitioning Hierarchical Based Based Based
Methods Methods Methods Methods Methods

K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON

6 5

4
3 4
2
5
2

1
3 1
Hierarchical Clustering
◘ Create a hierarchical decomposition of the set of data using some criterion
◘ Strength: This method does not require the number of clusters k as an input.
◘ Weakness: But it needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
◘ Introduced in Kaufmann and Rousseeuw (1990)
◘ Implemented in statistical packages, e.g., Splus
◘ Use the single-link method and the dissimilarity matrix
◘ Merge nodes that have the least dissimilarity
◘ Go on in a non-descending fashion
◘ Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster
Hierarchical Clustering: AGNES
Hierarchical Clustering: AGNES

1. Compute the proximity matrix


2. Let each data point be a cluster
3. Repeat
4. Merge the two closest
clusters
5. Update the proximity matrix
6. Until only a single cluster
remains


How the Clusters are Merged?

Single Link: smallest distance between  Complete Link: largest distance


an element in one cluster and an element between an element in one cluster
in the other, i.e., dist(Ki, Kj) = min(tip, tjq) and an element in the other, i.e.,
dist(Ki, Kj) = max(tip, tjq)

 

Average Link: avg distance between an Centroid Link: distance between the
element in one cluster and an element in the centroids of two clusters, i.e., dist(K i,
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Kj) = dist(Ci, Cj)
How the Clusters are Merged?

5 0.4
1 0.2
4 1 0.35
3
2 5 0.3
5 0.15 5 0.25
2 1 2
0.2
2 3 6
0.1 3 6 0.15
3
0.05
1 0.1

4 4 0.05
4 0 0
3 6 2 5 4 1 3 6 4 1 2 5

Single Link Complete Link

5 4 1 0.25

2 0.2
5
2 0.15

3 6 0.1
1
4 0.05
3
0
3 6 4 1 2 5

Average Link
DIANA (Divisive Analysis)

◘ Introduced in Kaufmann and Rousseeuw (1990)


◘ Implemented in statistical analysis packages, e.g., Splus
◘ Inverse order of AGNES
◘ Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)

◘ Choosing which cluster to split


– Check the sum of squared errors of the clusters and choose
the one with the largest value
◘ Splitting criterion: Determining how to split
– One may use Ward’s criterion to chase for greater reduction
in the difference in the SSE criterion as a result of a split
– For categorical data, Gini-index can be used

◘ Handling noise
– Use a threshold to determine the termination criterion (do not
generate clusters that are too small because they contain
mainly noises)
Hierarchical Clustering: DIANA
• Step1: Initially Cl = {a, b, c, d, e}
• Step2: Ci = Cl and Cj = Φ
• Step3: Initial iteration
• Calculate the average dissimilarities of the
objects in Ci with the other objects in Ci
Average dissimilarity of a:

a= *
= (9 + 3 + 6 + 11) = 7.25

Similarly, we have:
b = 7.75, c = 5.25, d = 7.00, e = 7.75
• The highest average distance is 7.75 and
there are two corresponding objects,
arbitrarily choose one of them, let’s say b.
• Move b to Cj
• The updated cluster elements are:
Ci = {a, c, d, e} and Cj = {b}
Hierarchical Clustering: DIANA
• Step4: Remaining iterations
• (i) 2nd iteration
Calculate average dissimilarity for each object
again:

Da = (d(a, c) + d(a, d) + d(a, e)) - (d(a,


b))
= - 9 = -2.33

Dc = (d(c, a) + d(c, d) + d(c, e)) - (d(c, b))


= - 7 = -2.33

Dd = 0.67, De = 0.

Dd is the largest and Dd > 0, so move, d to Cj .


The updated cluster elements are:
Ci = {a, c, e} and Cj = {b, d}
Hierarchical Clustering: DIANA
• (ii) 3rd iteration
Calculate average dissimilarity for each object
again:

Da = (d(a, c) + d(a, e)) - (d(a, b) + d(a,


d))

Dc = (d(c, a) + d(c, e)) - (d(c, b) + d(c, d))


= - = -0.5

= - = -13.5
De = -2.5.

All are negative. So, we stop and form the


clusters Ci and Cj.
Hierarchical Clustering: DIANA
• Step5:
• To divide Ci and Cj, cluster diameters
should be computed:

diameter(Ci) = max{d(a, c), d(a, e), d(c, e)}

diameter(Cj) = max{d(b, d)} = 5


= max{3, 11, 2} = 11

• The cluster with the largest diameter


should be selected for further partitioning.
• So now split Ci and repeat the process by
taking Cl = {a, c, e} until you reach only
one object in each cluster in the end.
Categories of Clustering Algorithms

Density Grid Model


Partitioning Hierarchical Based Based Based
Methods Methods Methods Methods Methods

K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON

6 5

4
3 4
2
5
2

1
3 1
Density-Based Clustering

◘ Dense objects should be grouped together into one cluster.


◘ They use a fixed threshold value to determine dense regions. MinPts = 3

◘ Density-based clustering algorithms


– DBSCAN (Ester et al., 1996)
– DENCLUE (Hinneburg & Keim, 1998)
– OPTICS (Ankerst et al, 1999)
◘ Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-neighbourhood of that
point

p MinPts = 5
Eps = 1 cm
q
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise

◘ Relies on a density-based notion of cluster: A cluster is


defined as a maximal set of density-connected points
◘ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier

Border

Core

Eps = 1cm
MinPts = 5
Density-Reachable and Density-Connected

◘ Density-reachable:
– A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
◘ Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database
 Continue the process until all of the points have been processed
DBSCAN: Advantages
◘ Resistant to noise
◘ Can handle clusters of different shapes and sizes
DBSCAN: Disadvantages
• It does not work well varying densities and high-dimensional data
• Sensitive to parameters
Categories of Clustering Algorithms

Density Grid Model


Partitioning Hierarchical Based Based Based
Methods Methods Methods Methods Methods

K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON

6 5

4
3 4
2
5
2

1
3 1
Grid Based Clustering Methods

◘ Simplest approach is to divide region into a number of rectangular


cells of equal volume and define density as # of points the cell
contains
Categories of Clustering Algorithms

Density Grid Model


Partitioning Hierarchical Based Based Based
Methods Methods Methods Methods Methods

K-Means AGNES
K-Medoid DIANA DBSCAN STING COBWEB
PAM BIRCH OPTICS WaveCluster CLASSIT
CLARA CURE DENCLUE CLIQUE SOM
CLARANS CHAMELEON

6 5

4
3 4
2
5
2

1
3 1
Model Based Methods
Attempt to optimize the fit between the given data and some mathematical model
It uses statistical functions
Clustering Algorithms: General Overview
Factors Affecting Clustering Results
◘ Outliers
◘ Inappropriate value for parameters
◘ Drawbacks of the clustering algorithm themselves

INPUT DATASET

GOOD CLUSTERING BAD CLUSTERING


Parameter (k=6) Parameter (k=20)
Cluster Validation - SSE

◘Clustering
•Data points in one cluster are more similar to one another.
•Data points in separate clusters are less similar to one another.

Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized

Sum of Squared Error (SSE)

Clustering in 3-D space.


Clustering Algorithm Selection
1. Scalability
– Efficiently execution on large databases
– Scanning database only several times
2. Running on different data types
– Continuous, discrete, binary, nominal, ordinal, …
3. Updateability
– Updating clusters after insertion and deletion of some data values
4. Efficient memory usage
5. Input parameters
– Results in different outputs on different inputs
– Un-understandable and too many input parameters
6. Without any foreknowledge
7. Different cluster shapes
8. Workable on dirty data
– Workable on missing, wrong and noise data
9. Insensitivity on data ordering
10. Multi dimension
– Workability on multi-dimensional datasets
11. Usability for different areas

You might also like