Presentation On Clustering Algorithms
Presentation On Clustering Algorithms
Clustering Algorithms
23 May 2024 1
Presented to:
Presented by:
A.K.M Mahfuzur Rahman
Roll: 231020104
Reg: 7724
Session: MS:2022-2023
Dept. of Computer Science and Engineering
Jatiya Kabi Kazi Nazrul Islam University
23 May 2024 2
Overview
BIRCH
CURE
CHAMELEON
23 May 2024 3
BRICH Agendas:
What is Birch ?
1.
Data Clustering
BIRCH BIRCH Goal
Example -1,2,3,4
23 May 2024 4
Balance
Hierarchies Iterative
BIRCH BIRCH
Clustering Reducing
23 May 2024 5
Birch is based on the notation of Clustering Feature a CF
What is Tree.
BIRCH ? CF tree is a height balanced tree that stores the clustering
features for a hierarchical clustering.
23 May 2024 6
Data It is portioning of a database into clusters.
Closely packed group.
Clustering
Collection of data objects that are similar to one another
and treated collectively in a group.
23 May 2024 7
Clustering
Treat dense areas
decision made
as one and
without scanning
reduce noise
whole data
BIRCH
Goals Minimizes
running time and
Exploit the non
uniformity of the
data scans data
23 May 2024 8
The tree cluster of data points as CF is represented by three
numbers (N, LS, SS).
23 May 2024 9
Phase 1: Scan all data and build an initial in-memory CF tree,
using the given amount of memory and recycling space on
disk.
Algorithm Phase 2: Condense into desirable length by building a smaller
CF tree.
Phase 3: Global clustering.
Phase 4: Cluster refining - this is optional, and requires more
passes over the data to refine the results.
23 May 2024 10
𝑛 𝑛 2
LS= 𝑖=1 𝑥𝑖 𝑎𝑛𝑑 𝑆𝑆 = 𝑖=1 𝑥𝑖
23 May 2024 11
Another example with 2-D objects,
Given that,
C2={(1,1),(2,1),(3,2)}
So,
Example-2 CF(C2)=(3,(6,4),(14,6))
where n=3,
LS = (1+2+3,1+1+2) = (6,4)
SS = (1²+2²+3², 1²+1²+2²) =(14,6)
23 May 2024 12
Another important property of the CFs is that they are additive. That
is, two disjoint clusters C1 and C2 with CFs CF1=(n1,LS2,SS1) and
CF2=(n2,LS2,SS2) respectively.
The CF of the cluster formed by merging C1 and C2 is given as,
CF1+CF2=(n1+n2,LS1+LS2,SS1+SS2)
C1={(2,5),(3,2),(4,3)} and
Example-3 C2={(1,1),(2,1),(3,1)}
Then,
CF1 = (3,(2+3+4,5+2+3), (2²+3²+4²,5²+2²+3²)) = (3,(9,10),(29,38))
and CF2 = (3,(1+2+3,1+1+1),(1²+2²+3²,1²+1²+1²)) = (3,(6,3),(14,3))
Now,
if C3 = C1UC2 then
CF3 = CF1+CF2 = (6,(15,13),(43,41))
23 May 2024 13
𝐿𝑆
Cluster’s Centroid, X0 = 𝑛
𝑛 2
𝑥𝑖 −𝑋0 𝑛 𝑆𝑆 −2𝐿𝑆2−𝑛(𝐿𝑆)
Cluster’s Radius, R= 𝑖=1
=
Formula 𝑛 𝑛2
𝑛 𝑛 2
𝑗=1(𝑥𝑖−𝑥𝑗 )
2
𝑖=1 2𝑛 𝑆𝑆 −2 𝐿𝑆
Cluster’s Diameter, D= =
𝑛(𝑛−1) 𝑛(𝑛−1)
23 May 2024 14
Apply BIRCH to cluster the given dataset. The dataset D:{
(3,4) (2,6) (4,5) (4,7) (3, 8), (6, 2), (7, 2), (7, 4), (8, 4), (7, 9)}.
The branching factor, B= 2, the maximum number of sub-
clusters at each leaf node, L= 2, and the threshold on the
diameter of sub-clusters stored in the leaf nodes is 1.5.
• For each data point, find the Radius and CF.
• Consider data point x₁ = (3,4)
• It is alone in the feature map. So,
• Radius = 0
Example-4 • Cluster Feature CF1<N, LS, SS>=<1, (3, 4), (9, 16)>
• Create the leaf node with data point x₁ = (3,4) and branch as
CFI.
CF1<1,(3,4),(9,16)>
Leaf
X1=(3,4)
23 May 2024 15
For each data point, find the Radius and CF.
Consider data point x2 = (2,6):
1. Linear Sum, LS= (3, 4)+(2, 6) = (5,10)
2. Squared Sum, SS = (22+9, 62+16) = (13,52); N=2 CF1<2,(5,10),(13,52)>
𝐿𝑆2 (5,10)2
𝑆𝑆 (13,52)
• Radius= − 𝑁
= − 2
= (0.5, 1) Leaf
𝑁 𝑁 2 2
X1=(3,4)
X2=(2,6)
• R (0.5, 1)<(T, T) -->True
• So, x2 = (2,6) will cluster with leaf x1 = (3,4).
3. Cluster Feature CF1 <N, LS, SS>=<2, (5, 10), (13, 52)>
23 May 2024 16
For each data point, find the Radius and CF.
Consider data point x3 = (4,5):
1. Linear Sum, LS= (5, 10)+(4, 5) = (9,15)
CF1<3, (9, 15), (29, 77)>
2. Squared Sum, SS = (42 + 13, 52+52) = (29,77); N=3
𝐿𝑆2 (9,15)2
𝑆𝑆 (29,77) Leaf
• Radius= 𝑁
− 𝑁
𝑁
= 3
− 3
3
= (0.47, 0.47) X1=(3,4)
X2=(2,6)
X3=(4,5)
• R (0.47, 0.47)<(T, T) -->True
• So, x3 = (4,5) will cluster with leaf x1 and x2
3. Cluster Feature CF1 <N, LS, SS>=<3, (9, 15), (29, 77)>
23 May 2024 17
Similarly
Leaf
X1=(3,4)
X2=(2,6)
X3=(4,5)
x4=(4,7)
X5=(3,8)
23 May 2024 18
Consider data point x6 = (6,2):
1. Linear Sum, LS= (16, 30)+(6, 2) = (22,32)
2. Squared Sum, SS = (62 + 54, 22+190) = (90,194); N=6
𝐿𝑆2 (22,32)2
𝑆𝑆 (90,194)
• Radius= 𝑁
− 𝑁
𝑁
= 6
− 6
6
= (1.24, 1.97)
CF1<5, (16, 30), (54, 190)> CF2<1, (6, 2), (36, 4)>
Leaf Leaf
X1=(3,4) X6=(6,2)
X2=(2,6)
X3=(4,5)
X4=(4,7)
23 May 2024 X5=(3,8) 19
For data point x7 = (7,2). Two Branches B1 for CF1 and B2 for CF2 exists. Find x,
closes to CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (6,2)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 1 = (6, 2)-> is close to x7
CF2 will be in consider
1. Linear Sum, LS= (6, 2)+(7, 2) = (13,4)
2. Squared Sum, SS = (72 + 6, 22+2) = (85,8); N=2
𝐿𝑆2 (13,4)2
𝑆𝑆 (85,8)
• Radius= − 𝑁
= − 2
= (0.5, 0)
𝑁 𝑁 2 2
CF1<5, (16, 30), (54, 190)> CF2<2, (13, 4), (85, 8)>
Leaf
Leaf
X1=(3,4)
X6=(6,2)
X2=(2,6)
X7=(7,2)
X3=(4,5)
X4=(4,7)
X5=(3,8)
23 May 2024 20
Similarly
Leaf
Leaf
X6=(6,2)
X1=(3,4)
X7=(7,2)
X2=(2,6)
X8=(7,4)
X3=(4,5)
X9=(8,4)
x4=(4,7)
X5=(3,8)
23 May 2024 21
For data point x10 = (7,9). Two Branches B1 for CF1 and B2 for CF2
exists. Find x, closes to
CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (28,12)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 4 = (7, 3)
23 May 2024 22
As branching factor is 2, cannot create another branch, So we have to another parent.
CF12<9, (44, 42), (252, 230)> CF3<1, (7, 9), (49, 81)>
CF1<5, (16, 30), (54, 190)> CF2<4, (28,12), (198, 40)> CF3<1, (7, 9), (49, 81)>
23 May 2024 23
CURE Agendas:
2. What is CURE ?
CURE Structure
Algorithm
Example
OUTLIERS
CLUSTERS
23 May 2024 24
Clustering
CURE CURE
Representatives Using
23 May 2024 25
It is a hierarchical based clustering technique, that adopts a
middle ground between the centroid based and the all-point
extremes.
23 May 2024 26
Data
Draw Random Partition
Sample Sample
Partially
Cluster
Partitions
Structure
Eliminations of
Outliers
Clusters
Label Data on Cluster Partial
Disk Clusters
23 May 2024 27
Phase 1: Begin with a large dataset D consisting of n data
points.
Phase 2: Randomly select a sample of c points from the dataset
D where c<<n. Sample should be representative of the entire
dataset.
Phase 3: Use a hierarchical clustering method (e.g., single-link,
complete-link, or average-link) on the sample to form an initial
set of clusters. This is typically done until a desired number of
clusters k is reached.
23 May 2024 28
Example
23 May 2024 29
CURE is designed to efficiently process large datasets.
Advantages
CURE is relatively straightforward to implement
23 May 2024 30
Although subsampling helps reduce complexity, the
initial phase of clustering a large sample can still be
computationally intensive, especially for very large
datasets.
23 May 2024 31
CHAMELEON Agendas:
What is CHAMELEON ?
Framework of CHAMELEON
3.
Phase of CHAMELEON
CHAMELEON Advantages
23 May 2024 32
Chameleon is a hierarchical clustering algorithm
that uses dynamic modeling to decide the
similarity among pairs of clusters.
23 May 2024 33
Construct (K-NN)
Sparse Graph
Data Set
Framework
Final Clusters
23 May 2024 34
A Two-phase Clustering Algorithm.
Phase…
(a) (b)
Fig: An example of the bisections produced by multilevel graph
partitioning algorithms on two spatial data sets.
23 May 2024 35
Phase 2: Merging Sub-Clusters using a Dynamic
Framework . It employs an agglomerative hierarchical
clustering method to look for real clusters that may be
merged with the sub-clusters that are generated.
23 May 2024 36
Ci and Cj are two clusters
23 May 2024 37
Absolute closeness normalized with
respect to the internal closeness of the
two clusters.
23 May 2024 38
Internal closeness of the cluster got by
average of the weights of the edges in
the cluster.
Internal
Closeness
Using them,
23 May 2024 39
If the relative inter-connectivity measure
relative closeness measure are same,
choose inter-connectivity.
Merging
Can also use,
the
Clusters RI (Ci , C j )≥T(RI) and RC(C i,C j ) ≥ T(RC)
23 May 2024 40
Allows it to adapt to the natural shapes and
densities of clusters in the data.
23 May 2024 41
Any Questions?
23 May 2024 42
Thank You
23 May 2024 43