0% found this document useful (0 votes)
22 views

Presentation On Clustering Algorithms

Uploaded by

afifrafsan111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Presentation On Clustering Algorithms

Uploaded by

afifrafsan111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Presentation on

Clustering Algorithms

23 May 2024 1
Presented to:

Dr. Tushar Kanti Saha


Professor
Dept. of Computer Science and Engineering
Jatiya Kabi Kazi Nazrul Islam University

Presented by:
A.K.M Mahfuzur Rahman
Roll: 231020104
Reg: 7724
Session: MS:2022-2023
Dept. of Computer Science and Engineering
Jatiya Kabi Kazi Nazrul Islam University

23 May 2024 2
Overview

BIRCH

CURE

CHAMELEON

23 May 2024 3
BRICH Agendas:
 What is Birch ?
1.
 Data Clustering
BIRCH  BIRCH Goal
 Example -1,2,3,4

23 May 2024 4
Balance

Hierarchies Iterative
BIRCH BIRCH

Clustering Reducing

23 May 2024 5
 Birch is based on the notation of Clustering Feature a CF
What is Tree.
BIRCH ?  CF tree is a height balanced tree that stores the clustering
features for a hierarchical clustering.

23 May 2024 6
Data  It is portioning of a database into clusters.
 Closely packed group.
Clustering
 Collection of data objects that are similar to one another
and treated collectively in a group.

23 May 2024 7
Clustering
Treat dense areas
decision made
as one and
without scanning
reduce noise
whole data

BIRCH
Goals Minimizes
running time and
Exploit the non
uniformity of the
data scans data

23 May 2024 8
The tree cluster of data points as CF is represented by three
numbers (N, LS, SS).

BIRCH N = number of LS = vector SS = sum of


items in sub- sum of the data the squared
clusters points data points

23 May 2024 9
Phase 1: Scan all data and build an initial in-memory CF tree,
using the given amount of memory and recycling space on
disk.
Algorithm Phase 2: Condense into desirable length by building a smaller
CF tree.
Phase 3: Global clustering.
Phase 4: Cluster refining - this is optional, and requires more
passes over the data to refine the results.

23 May 2024 10
𝑛 𝑛 2
LS= 𝑖=1 𝑥𝑖 𝑎𝑛𝑑 𝑆𝑆 = 𝑖=1 𝑥𝑖

Consider a cluster C1={3,5,2,8,9,1} then


Example-1 CF(C1)=(6,28,184)
where n=6,
LS=3+5+2+8+9+1=28 and
SS=3²+5²+2²+8²+9²+1²=184

23 May 2024 11
Another example with 2-D objects,
Given that,
C2={(1,1),(2,1),(3,2)}
So,
Example-2 CF(C2)=(3,(6,4),(14,6))
where n=3,

LS = (1+2+3,1+1+2) = (6,4)
SS = (1²+2²+3², 1²+1²+2²) =(14,6)

23 May 2024 12
Another important property of the CFs is that they are additive. That
is, two disjoint clusters C1 and C2 with CFs CF1=(n1,LS2,SS1) and
CF2=(n2,LS2,SS2) respectively.
The CF of the cluster formed by merging C1 and C2 is given as,
CF1+CF2=(n1+n2,LS1+LS2,SS1+SS2)

C1={(2,5),(3,2),(4,3)} and
Example-3 C2={(1,1),(2,1),(3,1)}
Then,
CF1 = (3,(2+3+4,5+2+3), (2²+3²+4²,5²+2²+3²)) = (3,(9,10),(29,38))
and CF2 = (3,(1+2+3,1+1+1),(1²+2²+3²,1²+1²+1²)) = (3,(6,3),(14,3))
Now,
if C3 = C1UC2 then
CF3 = CF1+CF2 = (6,(15,13),(43,41))

23 May 2024 13
𝐿𝑆
Cluster’s Centroid, X0 = 𝑛

𝑛 2
𝑥𝑖 −𝑋0 𝑛 𝑆𝑆 −2𝐿𝑆2−𝑛(𝐿𝑆)
Cluster’s Radius, R= 𝑖=1
=
Formula 𝑛 𝑛2

𝑛 𝑛 2
𝑗=1(𝑥𝑖−𝑥𝑗 )
2
𝑖=1 2𝑛 𝑆𝑆 −2 𝐿𝑆
Cluster’s Diameter, D= =
𝑛(𝑛−1) 𝑛(𝑛−1)

23 May 2024 14
Apply BIRCH to cluster the given dataset. The dataset D:{
(3,4) (2,6) (4,5) (4,7) (3, 8), (6, 2), (7, 2), (7, 4), (8, 4), (7, 9)}.
The branching factor, B= 2, the maximum number of sub-
clusters at each leaf node, L= 2, and the threshold on the
diameter of sub-clusters stored in the leaf nodes is 1.5.
• For each data point, find the Radius and CF.
• Consider data point x₁ = (3,4)
• It is alone in the feature map. So,
• Radius = 0
Example-4 • Cluster Feature CF1<N, LS, SS>=<1, (3, 4), (9, 16)>
• Create the leaf node with data point x₁ = (3,4) and branch as
CFI.

CF1<1,(3,4),(9,16)>

Leaf
X1=(3,4)

23 May 2024 15
For each data point, find the Radius and CF.
 Consider data point x2 = (2,6):
1. Linear Sum, LS= (3, 4)+(2, 6) = (5,10)
2. Squared Sum, SS = (22+9, 62+16) = (13,52); N=2 CF1<2,(5,10),(13,52)>

𝐿𝑆2 (5,10)2
𝑆𝑆 (13,52)
• Radius= − 𝑁
= − 2
= (0.5, 1) Leaf
𝑁 𝑁 2 2
X1=(3,4)
X2=(2,6)
• R (0.5, 1)<(T, T) -->True
• So, x2 = (2,6) will cluster with leaf x1 = (3,4).
3. Cluster Feature CF1 <N, LS, SS>=<2, (5, 10), (13, 52)>

23 May 2024 16
For each data point, find the Radius and CF.
 Consider data point x3 = (4,5):
1. Linear Sum, LS= (5, 10)+(4, 5) = (9,15)
CF1<3, (9, 15), (29, 77)>
2. Squared Sum, SS = (42 + 13, 52+52) = (29,77); N=3

𝐿𝑆2 (9,15)2
𝑆𝑆 (29,77) Leaf
• Radius= 𝑁
− 𝑁
𝑁
= 3
− 3
3
= (0.47, 0.47) X1=(3,4)
X2=(2,6)
X3=(4,5)
• R (0.47, 0.47)<(T, T) -->True
• So, x3 = (4,5) will cluster with leaf x1 and x2
3. Cluster Feature CF1 <N, LS, SS>=<3, (9, 15), (29, 77)>

23 May 2024 17
Similarly

CF1<5, (16, 30), (54, 190)>

Leaf
X1=(3,4)
X2=(2,6)
X3=(4,5)
x4=(4,7)
X5=(3,8)

23 May 2024 18
 Consider data point x6 = (6,2):
1. Linear Sum, LS= (16, 30)+(6, 2) = (22,32)
2. Squared Sum, SS = (62 + 54, 22+190) = (90,194); N=6

𝐿𝑆2 (22,32)2
𝑆𝑆 (90,194)
• Radius= 𝑁
− 𝑁
𝑁
= 6
− 6
6
= (1.24, 1.97)

• R (1.24, 1.97)<(T, T) -->False


• So, x6 = (6,2) will cluster in deferent branch
3. Cluster Feature CF2 <N, LS, SS>= <1, (6, 2), (36, 4)>

CF1<5, (16, 30), (54, 190)> CF2<1, (6, 2), (36, 4)>

Leaf Leaf
X1=(3,4) X6=(6,2)
X2=(2,6)
X3=(4,5)
X4=(4,7)
23 May 2024 X5=(3,8) 19
 For data point x7 = (7,2). Two Branches B1 for CF1 and B2 for CF2 exists. Find x,
closes to CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (6,2)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 1 = (6, 2)-> is close to x7
CF2 will be in consider
1. Linear Sum, LS= (6, 2)+(7, 2) = (13,4)
2. Squared Sum, SS = (72 + 6, 22+2) = (85,8); N=2
𝐿𝑆2 (13,4)2
𝑆𝑆 (85,8)
• Radius= − 𝑁
= − 2
= (0.5, 0)
𝑁 𝑁 2 2

• R (0.5, 0)<(T, T) -->True


• So, x7 = (7,2) will cluster with x6
3. Cluster Feature CF2 <N, LS, SS>= <2, (13, 4), (85, 8)>

CF1<5, (16, 30), (54, 190)> CF2<2, (13, 4), (85, 8)>

Leaf
Leaf
X1=(3,4)
X6=(6,2)
X2=(2,6)
X7=(7,2)
X3=(4,5)
X4=(4,7)
X5=(3,8)
23 May 2024 20
Similarly

CF1<5, (16, 30), (54, 190)> CF2<4, (28,12), (198, 40)>

Leaf
Leaf
X6=(6,2)
X1=(3,4)
X7=(7,2)
X2=(2,6)
X8=(7,4)
X3=(4,5)
X9=(8,4)
x4=(4,7)
X5=(3,8)

23 May 2024 21
 For data point x10 = (7,9). Two Branches B1 for CF1 and B2 for CF2
exists. Find x, closes to
CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (28,12)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 4 = (7, 3)

CF1 will be in consider

1. Linear Sum, LS= (16, 30)+(7, 9) = (23,39)


2. Squared Sum, SS = (72 + 54, 92+190) = (103,271); N=6
𝐿𝑆2 (23,39)2
𝑆𝑆 (103,271)
• Radius= − 𝑁
= − 6
= (1.57, 1.7)
𝑁 𝑁 6 6
• R (1.57, 1.7)<(T, T) -->False and L=5
• So, x10 = (7,9) cannot cluster with CF1

23 May 2024 22
As branching factor is 2, cannot create another branch, So we have to another parent.

CF12<9, (44, 42), (252, 230)> CF3<1, (7, 9), (49, 81)>

CF1<5, (16, 30), (54, 190)> CF2<4, (28,12), (198, 40)> CF3<1, (7, 9), (49, 81)>

Leaf Leaf Leaf


X1=(3,4) X6=(6,2) X10=(7,9)
X2=(2,6) X7=(7,2)
X3=(4,5) X8=(7,4)
x4=(4,7) X9=(8,4)
X5=(3,8)

23 May 2024 23
CURE Agendas:
2.  What is CURE ?

CURE  Structure
 Algorithm
 Example
OUTLIERS

CLUSTERS

23 May 2024 24
Clustering

CURE CURE

Representatives Using

23 May 2024 25
 It is a hierarchical based clustering technique, that adopts a
middle ground between the centroid based and the all-point
extremes.

 It is useful for discovering groups and identifying

What is interesting distributions in the underlying data.

CURE?  Instead of using one point centroid, as in most of data


mining algorithms, CURE uses a set of well-defined
representative points, for efficiently handling the clusters
and eliminating the outliers.

23 May 2024 26
Data
Draw Random Partition
Sample Sample

Partially
Cluster
Partitions
Structure
Eliminations of
Outliers

Clusters
Label Data on Cluster Partial
Disk Clusters

23 May 2024 27
Phase 1: Begin with a large dataset D consisting of n data
points.
Phase 2: Randomly select a sample of c points from the dataset
D where c<<n. Sample should be representative of the entire
dataset.
Phase 3: Use a hierarchical clustering method (e.g., single-link,
complete-link, or average-link) on the sample to form an initial
set of clusters. This is typically done until a desired number of
clusters k is reached.

Algorithm Phase 4: For each cluster obtained, select a fixed number of


representative points r. These points are chosen to be as far apart
as possible to capture the shape and extent of the cluster.
Phase 5: For each cluster, move the representative points
towards the mean of the cluster by a fraction α. This step helps
to avoid the influence of outliers.
Phase 6: Repeat the merging process for the remaining clusters
until the desired number of clusters is achieved.
Phase 7: Assign the remaining non-sampled points in D to the
nearest cluster using the representative points.

23 May 2024 28
Example

23 May 2024 29
CURE is designed to efficiently process large datasets.

CURE is designed to efficiently process large datasets.


CURE reduces the computational complexity without
significantly compromising the quality of the clustering.

Advantages
CURE is relatively straightforward to implement

Flexibility in cluster shapes.

23 May 2024 30
Although subsampling helps reduce complexity, the
initial phase of clustering a large sample can still be
computationally intensive, especially for very large
datasets.

Too few points may not capture cluster shape


Disadvantages accurately, while too many points can increase
computational costs.

As CURE is designed to handle large datasets,


extremely large-scale applications might still face
scalability issues.

23 May 2024 31
CHAMELEON Agendas:
 What is CHAMELEON ?
 Framework of CHAMELEON
3.
 Phase of CHAMELEON
CHAMELEON  Advantages

23 May 2024 32
 Chameleon is a hierarchical clustering algorithm
that uses dynamic modeling to decide the
similarity among pairs of clusters.

What is  It was changed based on the observed


weaknesses of two hierarchical clustering
Chameleon? algorithms such as ROCK and CURE.

 In Chameleon, cluster similarity is assessed


depending on how well-connected objects are
inside a cluster and on the proximity of clusters.

23 May 2024 33
Construct (K-NN)
Sparse Graph

Data Set

Partition the Graph

Framework

Final Clusters

Fig: Overall framework for CHAMELEON.

23 May 2024 34
A Two-phase Clustering Algorithm.

Phase 1: Finding Initial Sub-clusters. The first phase


is graph partitioning, which allows data items to be
clustered into a large number of sub-clusters.

Phase…

(a) (b)
Fig: An example of the bisections produced by multilevel graph
partitioning algorithms on two spatial data sets.

23 May 2024 35
Phase 2: Merging Sub-Clusters using a Dynamic
Framework . It employs an agglomerative hierarchical
clustering method to look for real clusters that may be
merged with the sub-clusters that are generated.

 Two different schemes have been implemented in


CHAMELEON to employ an agglomerative
hierarchical clustering method.
Phase
1. Merges those pairs of clusters whose
relative inter-connectivity and relative
closeness are both above some user
specified threshold.

2. Combine the relative inter-connectivity


and relative closeness. then selects to
merge the pair of clusters that maximizes
this function.

23 May 2024 36
 Ci and Cj are two clusters

RIC= (Absolute IC(Ci,Cj))/‫(‏‬Internal IC(Ci)+Internal IC(Cj))/2

Relative where Absolute IC(Ci,Cj)= sum of weights of


edges that connect Ci with Cj.
Inter
Connectivity Internal IC(Ci) = weighted sum of edges
that partition the cluster into roughly equal
parts.

23 May 2024 37
 Absolute closeness normalized with
respect to the internal closeness of the
two clusters.

Relative  Absolute closeness got by average


Closeness similarity between the points in Ci that
are connected to the points in Cj.

23 May 2024 38
 Internal closeness of the cluster got by
average of the weights of the edges in
the cluster.

Internal
Closeness
 Using them,

23 May 2024 39
 If the relative inter-connectivity measure
relative closeness measure are same,
choose inter-connectivity.
Merging
 Can also use,
the
Clusters RI (Ci , C j )≥T(RI) and RC(C i,C j ) ≥ T(RC)

23 May 2024 40
 Allows it to adapt to the natural shapes and
densities of clusters in the data.

 Can handle large dataset effectively.

Advantages  The margin and refinement processes enhance


the quality of data.

23 May 2024 41
Any Questions?

23 May 2024 42
Thank You

23 May 2024 43

You might also like