0% found this document useful (0 votes)

103 views31 pages

Birch 09

BIRCH is a clustering algorithm designed for large datasets. It has four phases: 1) it builds a CF-tree representing the dataset, 2) optionally condenses the tree, 3) performs global clustering on the leaf nodes, and 4) optionally refines clusters. The CF-tree compactly represents clusters with clustering features and allows incremental updates. When memory is full, it increases the threshold to merge nodes. Experiments show BIRCH uses less time and scans than k-means clustering.

Uploaded by

Sugandha Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views31 pages

Birch 09

Uploaded by

Sugandha Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 31

BIRCH:

Balanced Iterative Reducing and Clustering using Hierarchies

Sugandha Saha 211CS3298

Outline
Introduction to Clustering

Main Techniques in Clustering

Hybrid Algorithm: BIRCH Example of the BIRCH Algorithm Conclusions

April 30, 2012

Clustering
Introduction
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

help visualize data and guide data analysis.

A good clustering method will produce high quality clusters with

high intra-class similarity low inter-class similarity

April 30, 2012

A good quality clustering can help find hidden patterns. Main methods

Partitioning : K-Means Hierarchical : Agglomerative, Divisive, BIRCH, ROCK.

Density-based: DBSCAN,

April 30, 2012

Introduction to BIRCH
Designed for very large numerical data sets

Time and memory are limited. Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Does not need the whole data set in advance

Exploit the non uniformity of data treat dense areas as one, and remove outliers (noise)

April 30, 2012

Two key phases:

Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes

Introduces two concepts clustering feature and clustering feature tree. Overcomes the difficulties of agglomerative clustering methods:

Scalability. Inability to undo what was done in previous step

April 30, 2012

Clustering Parameters
Centroid: Euclidian center

Radius: average distance from member points to centroid

April 30, 2012

Diameter: average pair-wise distance within a cluster

Radius and Diameter reflects the tightness of the cluster around the Centriod.

April 30, 2012

centroid Euclidean distance: centroid Manhattan distance: average inter-cluster: average intra-cluster: variance increase:
April 30, 2012 9

Clustering Feature
The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. CF is a compact storage for data of points in a cluster. Additivity theorem allows us to merge sub-clusters.

A clustering feature (CF ) is a three dimensional vector summarizing information about clusters.
April 30, 2012 10

Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (n, LS, SS) CF=<n,LS,SS> where,

n - number of points in cluster.

LS - the linear sum of n points.

SS - square sum of data points.
11

April 30, 2012

LS P i
P N i

P N i

P i

Additivity theorem: allows to merge sub-clusters consistently and increasingly If CF1 = (N1, LS1, SS1), and CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint subclusters.

April 30, 2012

Suppose that there are three points (2,5), (3,2), (4,3) in a cluster C1 .The clustering feature of C1 is
CF1 =<3,(2+3+4, 5+2+3),(22 + 32 + 42, 52 +22 +32)> CF1 =<3,(9,10),(29, 38)>

And C2 be the second cluster with clustering feature

CF2 =<3,(35,36),(417,440)>

CF3 = CF1 + CF2

CF3 =<3+3,(9+35,10+36),(29+417, 38+440)>

CF3 = =<6,(44,46),(446,478)>
April 30, 2012 13

CF-Tree

Each non-leaf node has at most B entries

Each leaf node has at most L CF entries, each of which satisfies threshold T

April 30, 2012

CF-Tree Insertion

Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node

Traverse back Updating CFs on the path or splitting nodes

April 30, 2012

CF-Tree Rebuilding
If we run out of space, increase threshold T By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over

The larger T allows different CFs to group together Reducibility theorem

Increasing T will result in a CF-tree smaller than the original

Rebuilding needs at most h extra pages of memory

April 30, 2012

Example of BIRCH
New subcluster sc8 sc1 sc3 sc4 sc5 sc6 LN3 sc7

sc2 LN1

LN2

LN1

Root LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2

April 30, 2012 17

Insertion Operation in BIRCH

If the branching factor of a leaf node can not exceed 3, then LN1 is split.

sc8

sc1

sc3 sc2

sc4 sc5 LN2

sc6 LN3

sc7

LN1

LN1 LN1

Root LN1LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2

April 30, 2012 18

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one.

sc8 sc1

sc3 sc2

sc4 sc5

sc6 LN3

sc7

LN1 LN1

LN2 Root NLN1 NLN2 LN1 LN1 LN2 LN3

April 30, 2012

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

BIRCH Clustering Algorithm

Phase 1: Load data into memory by building a CF tree Phase 2 (optional): Condense into desirable range by building smaller CF trees Phase 3: Global Clustering Phase 4 (optional): Cluster Refining

April 30, 2012

Phase 1
Start with initial threshold T and insert points into tree If we run out of memory, increase T and rebuild

Re-insert leaf entries from old tree into new tree remove outliers

Methods for initializing and adjusting T are adhoc After phase 1:

data reduced to fit in memory subsequent processing occurs entirely in memory (no I/O)
21

April 30, 2012

Phase 2
Optional No. of clusters produced inPhase 1 may be not be suitable for algorithms usedin Phase 3

Shrink tree as necessary

remove more outliers crowded subclusters are merged

April 30, 2012

Phase 3
Problems after Phase 1:

input order affect results

splitting triggered

Use leaf nodes of CF tree as input to a standard (global) clustering algorithm

KMeans, HC

April 30, 2012

Phase 1 has reduced the size of the input dataset enough so that the standard algorithm can work entirely in memory 23

Phase 4
Optional Scan through data again and assign each data point to a cluster

choose cluster whose centroid is closest

This redistributes data points amongst clusters in moreaccurate fashion than originalCF cluster
Can be repeated for improvedrefinement of clusters
April 30, 2012 24

Experimental Results
Input parameters:
Memory (M): 5% of data set Disk space (R): 20% of M Distance equation: D2 Quality equation: weighted average diameter (D) Initial threshold (T): 0.0

Page size (P): 1024 bytes

April 30, 2012 25

Experimental Results
KMEANS clustering
DS 1 2 3 Time 43.9 13.2 32.9 D 2.09 4.43 3.66 # Scan 289 51 187 DS 1o 2o 3o Time 33.8 12.7 36.0 D 1.97 4.20 4.35 # Scan 197 29 241

BIRCH clustering
DS 1 2 3 Time 11.5 10.7 11.4 D 1.87 1.99 3.95 # Scan 2 2 2 DS 1o 2o 3o Time 13.6 12.1 12.2 D 1.87 1.99 3.99 # Scan 2 2 2

April 30, 2012

Conclusions
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. Given a limited amount of main memory, BIRCH can minimize the time required for I/O. BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.
April 30, 2012 27

Exam Questions

What is the main limitation of BIRCH?

Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesnt always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesnt perform well because it uses the notion of radius or diameter to control the boundary of a cluster.

April 30, 2012

Exam Questions

Name the two algorithms in BIRCH clustering:

CF-Tree Insertion CF-Tree Rebuilding

What is the purpose of phase 4 in BIRCH?

Do additional passes over the dataset and reassign data points to the closest centroid .

April 30, 2012

References
Data Mining, concepts and techniques, Jiawei Han and Micheline Kamber, second edition(408-414) BIRCH:An Efficient Data Clustering Method For Very Large Databases, Tian Zhang, Raghu Ramakrishnan, Miron Livny

April 30, 2012

Thank you

April 30, 2012

Birch Clustering
No ratings yet
Birch Clustering
11 pages
Reasoning Complete PDF
No ratings yet
Reasoning Complete PDF
162 pages
MODULE_5
No ratings yet
MODULE_5
43 pages
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
No ratings yet
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
20 pages
Birch alg
No ratings yet
Birch alg
23 pages
Week-10
No ratings yet
Week-10
84 pages
Birch
No ratings yet
Birch
30 pages
Clustering Part 2
No ratings yet
Clustering Part 2
28 pages
unit5_CSM_ML
No ratings yet
unit5_CSM_ML
32 pages
Chemistry Paper-6 notes and model answers
No ratings yet
Chemistry Paper-6 notes and model answers
42 pages
Write a program to declare a Square, Matrixx a[][] of order m*n where ‘m’ is the number of rows and ‘n’ is the number of columns, such that m and n both must be greater than 2 and less than 10 . accept the value of m and n as user input. Allow the user to input the integers into Matrixx and perform the following task.
No ratings yet
Write a program to declare a Square, Matrixx a[][] of order m*n where ‘m’ is the number of rows and ‘n’ is the number of columns, such that m and n both must be greater than 2 and less than 10 . accept the value of m and n as user input. Allow the user to input the integers into Matrixx and perform the following task.
5 pages
JBL SR4719X: Technical Manual
100% (1)
JBL SR4719X: Technical Manual
2 pages
2504.06980v1
No ratings yet
2504.06980v1
25 pages
Dominikus - Jurnal Reaksi Perisiklik 3
100% (1)
Dominikus - Jurnal Reaksi Perisiklik 3
2 pages
DMBI IAT-2 IMP QUES SOLN
No ratings yet
DMBI IAT-2 IMP QUES SOLN
43 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
log_20250108
No ratings yet
log_20250108
2 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
28 pages
DOC-20231118-WA0008new Unit 5
No ratings yet
DOC-20231118-WA0008new Unit 5
15 pages
Clustering-Part2
No ratings yet
Clustering-Part2
40 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
13_BIRCH
No ratings yet
13_BIRCH
8 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
ClusteringAlgorithms ConventionalandRecent
No ratings yet
ClusteringAlgorithms ConventionalandRecent
30 pages
Digital Forensics of Scanned QR Code Images For Printer Source Identification Using Bottleneck Residual Block
No ratings yet
Digital Forensics of Scanned QR Code Images For Printer Source Identification Using Bottleneck Residual Block
13 pages
Hierarchical ClusteringAlgorithm
No ratings yet
Hierarchical ClusteringAlgorithm
32 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
Key Value Drivers Corporate
No ratings yet
Key Value Drivers Corporate
35 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
Electronics 11 02735 v2
No ratings yet
Electronics 11 02735 v2
19 pages
4.6 Birch
No ratings yet
4.6 Birch
21 pages
Lesson 3.6 - Supervised Learning Neural Networks
No ratings yet
Lesson 3.6 - Supervised Learning Neural Networks
35 pages
Hierarchical Document Clustering Using Frequent Itemsets: Benjamin Fung
No ratings yet
Hierarchical Document Clustering Using Frequent Itemsets: Benjamin Fung
37 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
An Incremental CFS Algorithm For Clustering Large Data in Industrial Internet of Things
No ratings yet
An Incremental CFS Algorithm For Clustering Large Data in Industrial Internet of Things
16 pages
Technical Data Sheet: Model
No ratings yet
Technical Data Sheet: Model
2 pages
(INTI
No ratings yet
(INTI
9 pages
Chapter 13 Binomial Tree Complete Version Fall 2022-20221101
No ratings yet
Chapter 13 Binomial Tree Complete Version Fall 2022-20221101
74 pages
Haze Battery Company LTD
No ratings yet
Haze Battery Company LTD
8 pages
Survey of Clustering Algorithms
No ratings yet
Survey of Clustering Algorithms
37 pages
Recommended MCQs - 123 Questions The Solid State Chemistry NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Ques
No ratings yet
Recommended MCQs - 123 Questions The Solid State Chemistry NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Ques
1 page
Clustering
No ratings yet
Clustering
7 pages
Package Corenlp': R Topics Documented
No ratings yet
Package Corenlp': R Topics Documented
11 pages
Clustering
No ratings yet
Clustering
45 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Data Mining
No ratings yet
Data Mining
4 pages
Spesifikasi Alat Baru HOPPER
No ratings yet
Spesifikasi Alat Baru HOPPER
159 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
ML Module Iv
No ratings yet
ML Module Iv
27 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Acid Hydrolysis of Organic Materials: Sessions Biogeochemistry Lab July, 2009
No ratings yet
Acid Hydrolysis of Organic Materials: Sessions Biogeochemistry Lab July, 2009
2 pages
Pokemon 4 29 Vivid Voltage Draft
No ratings yet
Pokemon 4 29 Vivid Voltage Draft
115 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Grade 4 IMO in PDF
No ratings yet
Grade 4 IMO in PDF
4 pages
Evaluation of BIRCH Clustering Algorithm For Big Data
No ratings yet
Evaluation of BIRCH Clustering Algorithm For Big Data
5 pages
Birch 09
No ratings yet
Birch 09
33 pages
BIRCH: A New Data Clustering Algorithm and Its Applications
No ratings yet
BIRCH: A New Data Clustering Algorithm and Its Applications
42 pages
Birch
No ratings yet
Birch
6 pages
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
No ratings yet
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
19 pages
Class 6
No ratings yet
Class 6
10 pages
Clustering
No ratings yet
Clustering
28 pages
Name: Saad Ashfaq Roll No: 2221 Subject: GPS Assignment: Global Positioning System and Its Segments
No ratings yet
Name: Saad Ashfaq Roll No: 2221 Subject: GPS Assignment: Global Positioning System and Its Segments
5 pages
Clustering
No ratings yet
Clustering
110 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
33 pages
Excel Functions - Full List
No ratings yet
Excel Functions - Full List
6 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Epson DFX-8000 Service Manual
No ratings yet
Epson DFX-8000 Service Manual
209 pages
Section I Maximum Marks - 60 Time - 45 Minutes Instructions
No ratings yet
Section I Maximum Marks - 60 Time - 45 Minutes Instructions
9 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Continous Humidification Processes: Water-Cooling Tower - Packed Tower
100% (1)
Continous Humidification Processes: Water-Cooling Tower - Packed Tower
30 pages
Air Dryers
60% (5)
Air Dryers
99 pages
Cluster
100% (1)
Cluster
72 pages
Pohl's Resonator
No ratings yet
Pohl's Resonator
8 pages
MMW Questions Answer 1. T-Test For Correlated Sample
No ratings yet
MMW Questions Answer 1. T-Test For Correlated Sample
21 pages
Collection of Grammar Structures (Essay)
No ratings yet
Collection of Grammar Structures (Essay)
18 pages
(Specialist) 2000 Heffernan Exam 2
No ratings yet
(Specialist) 2000 Heffernan Exam 2
15 pages

Birch 09

Uploaded by

Birch 09

Uploaded by

BIRCH:

Balanced Iterative Reducing and Clustering using Hierarchies

Sugandha Saha 211CS3298

Main Techniques in Clustering

April 30, 2012

help visualize data and guide data analysis.

high intra-class similarity low inter-class similarity

April 30, 2012

Partitioning : K-Means Hierarchical : Agglomerative, Divisive, BIRCH, ROCK.

April 30, 2012

April 30, 2012

Two key phases:

Scalability. Inability to undo what was done in previous step

April 30, 2012

Radius: average distance from member points to centroid

April 30, 2012

Diameter: average pair-wise distance within a cluster

April 30, 2012

n - number of points in cluster.

LS - the linear sum of n points.

April 30, 2012

April 30, 2012

And C2 be the second cluster with clustering feature

CF3 = CF1 + CF2

Each non-leaf node has at most B entries

April 30, 2012

Traverse back Updating CFs on the path or splitting nodes

April 30, 2012

Rebuilding "pushes" CFs over

Increasing T will result in a CF-tree smaller than the original

April 30, 2012

Root LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2

Insertion Operation in BIRCH

sc4 sc5 LN2

Root LN1LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2

LN2 Root NLN1 NLN2 LN1 LN1 LN2 LN3

April 30, 2012

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

BIRCH Clustering Algorithm

April 30, 2012

Methods for initializing and adjusting T are adhoc After phase 1:

April 30, 2012

Shrink tree as necessary

remove more outliers crowded subclusters are merged

April 30, 2012

input order affect results

Use leaf nodes of CF tree as input to a standard (global) clustering algorithm

April 30, 2012

choose cluster whose centroid is closest

Page size (P): 1024 bytes

April 30, 2012

What is the main limitation of BIRCH?

April 30, 2012

Name the two algorithms in BIRCH clustering:

What is the purpose of phase 4 in BIRCH?

April 30, 2012

April 30, 2012

April 30, 2012

You might also like