0% found this document useful (0 votes)
103 views31 pages

Birch 09

BIRCH is a clustering algorithm designed for large datasets. It has four phases: 1) it builds a CF-tree representing the dataset, 2) optionally condenses the tree, 3) performs global clustering on the leaf nodes, and 4) optionally refines clusters. The CF-tree compactly represents clusters with clustering features and allows incremental updates. When memory is full, it increases the threshold to merge nodes. Experiments show BIRCH uses less time and scans than k-means clustering.

Uploaded by

Sugandha Saha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views31 pages

Birch 09

BIRCH is a clustering algorithm designed for large datasets. It has four phases: 1) it builds a CF-tree representing the dataset, 2) optionally condenses the tree, 3) performs global clustering on the leaf nodes, and 4) optionally refines clusters. The CF-tree compactly represents clusters with clustering features and allows incremental updates. When memory is full, it increases the threshold to merge nodes. Experiments show BIRCH uses less time and scans than k-means clustering.

Uploaded by

Sugandha Saha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 31

BIRCH:

Balanced Iterative Reducing and Clustering using Hierarchies

Sugandha Saha 211CS3298

Outline
Introduction to Clustering

Main Techniques in Clustering


Hybrid Algorithm: BIRCH Example of the BIRCH Algorithm Conclusions

April 30, 2012

Clustering
Introduction
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

help visualize data and guide data analysis.


A good clustering method will produce high quality clusters with

high intra-class similarity low inter-class similarity


3

April 30, 2012

A good quality clustering can help find hidden patterns. Main methods

Partitioning : K-Means Hierarchical : Agglomerative, Divisive, BIRCH, ROCK.

Density-based: DBSCAN,

April 30, 2012

Introduction to BIRCH
Designed for very large numerical data sets

Time and memory are limited. Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Does not need the whole data set in advance

Exploit the non uniformity of data treat dense areas as one, and remove outliers (noise)

April 30, 2012

Two key phases:

Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes

Introduces two concepts clustering feature and clustering feature tree. Overcomes the difficulties of agglomerative clustering methods:

Scalability. Inability to undo what was done in previous step


6

April 30, 2012

Clustering Parameters
Centroid: Euclidian center

Radius: average distance from member points to centroid

April 30, 2012

Diameter: average pair-wise distance within a cluster

Radius and Diameter reflects the tightness of the cluster around the Centriod.

April 30, 2012

centroid Euclidean distance: centroid Manhattan distance: average inter-cluster: average intra-cluster: variance increase:
April 30, 2012 9

Clustering Feature
The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. CF is a compact storage for data of points in a cluster. Additivity theorem allows us to merge sub-clusters.

A clustering feature (CF ) is a three dimensional vector summarizing information about clusters.
April 30, 2012 10

Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (n, LS, SS) CF=<n,LS,SS> where,

n - number of points in cluster.

LS - the linear sum of n points.


SS - square sum of data points.
11

April 30, 2012

LS P i
P N i

SS

P N i

P i

Additivity theorem: allows to merge sub-clusters consistently and increasingly If CF1 = (N1, LS1, SS1), and CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint subclusters.

April 30, 2012

12

Suppose that there are three points (2,5), (3,2), (4,3) in a cluster C1 .The clustering feature of C1 is
CF1 =<3,(2+3+4, 5+2+3),(22 + 32 + 42, 52 +22 +32)> CF1 =<3,(9,10),(29, 38)>

And C2 be the second cluster with clustering feature


CF2 =<3,(35,36),(417,440)>

CF3 = CF1 + CF2


CF3 =<3+3,(9+35,10+36),(29+417, 38+440)>

CF3 = =<6,(44,46),(446,478)>
April 30, 2012 13

CF-Tree

Each non-leaf node has at most B entries

Each leaf node has at most L CF entries, each of which satisfies threshold T

April 30, 2012

14

CF-Tree Insertion

Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node

Traverse back Updating CFs on the path or splitting nodes

April 30, 2012

15

CF-Tree Rebuilding
If we run out of space, increase threshold T By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over


The larger T allows different CFs to group together Reducibility theorem

Increasing T will result in a CF-tree smaller than the original


Rebuilding needs at most h extra pages of memory

April 30, 2012

16

Example of BIRCH
New subcluster sc8 sc1 sc3 sc4 sc5 sc6 LN3 sc7

sc2 LN1

LN2

LN1

Root LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2


April 30, 2012 17

Insertion Operation in BIRCH


If the branching factor of a leaf node can not exceed 3, then LN1 is split.

sc8

sc1

sc3 sc2

sc4 sc5 LN2

sc6 LN3

sc7

LN1

LN1 LN1

Root LN1LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7 sc2


April 30, 2012 18

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one.

sc8 sc1

sc3 sc2

sc4 sc5

sc6 LN3

sc7

LN1 LN1

LN2 Root NLN1 NLN2 LN1 LN1 LN2 LN3

April 30, 2012

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

19

BIRCH Clustering Algorithm

Phase 1: Load data into memory by building a CF tree Phase 2 (optional): Condense into desirable range by building smaller CF trees Phase 3: Global Clustering Phase 4 (optional): Cluster Refining

April 30, 2012

20

Phase 1
Start with initial threshold T and insert points into tree If we run out of memory, increase T and rebuild

Re-insert leaf entries from old tree into new tree remove outliers

Methods for initializing and adjusting T are adhoc After phase 1:

data reduced to fit in memory subsequent processing occurs entirely in memory (no I/O)
21

April 30, 2012

Phase 2
Optional No. of clusters produced inPhase 1 may be not be suitable for algorithms usedin Phase 3

Shrink tree as necessary

remove more outliers crowded subclusters are merged


22

April 30, 2012

Phase 3
Problems after Phase 1:

input order affect results


splitting triggered

Use leaf nodes of CF tree as input to a standard (global) clustering algorithm

KMeans, HC

April 30, 2012

Phase 1 has reduced the size of the input dataset enough so that the standard algorithm can work entirely in memory 23

Phase 4
Optional Scan through data again and assign each data point to a cluster

choose cluster whose centroid is closest

This redistributes data points amongst clusters in moreaccurate fashion than originalCF cluster
Can be repeated for improvedrefinement of clusters
April 30, 2012 24

Experimental Results
Input parameters:
Memory (M): 5% of data set Disk space (R): 20% of M Distance equation: D2 Quality equation: weighted average diameter (D) Initial threshold (T): 0.0

Page size (P): 1024 bytes


April 30, 2012 25

Experimental Results
KMEANS clustering
DS 1 2 3 Time 43.9 13.2 32.9 D 2.09 4.43 3.66 # Scan 289 51 187 DS 1o 2o 3o Time 33.8 12.7 36.0 D 1.97 4.20 4.35 # Scan 197 29 241

BIRCH clustering
DS 1 2 3 Time 11.5 10.7 11.4 D 1.87 1.99 3.95 # Scan 2 2 2 DS 1o 2o 3o Time 13.6 12.1 12.2 D 1.87 1.99 3.99 # Scan 2 2 2

April 30, 2012

26

Conclusions
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. Given a limited amount of main memory, BIRCH can minimize the time required for I/O. BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.
April 30, 2012 27

Exam Questions

What is the main limitation of BIRCH?


Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesnt always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesnt perform well because it uses the notion of radius or diameter to control the boundary of a cluster.

April 30, 2012

28

Exam Questions

Name the two algorithms in BIRCH clustering:


CF-Tree Insertion CF-Tree Rebuilding

What is the purpose of phase 4 in BIRCH?


Do additional passes over the dataset and reassign data points to the closest centroid .

April 30, 2012

29

References
Data Mining, concepts and techniques, Jiawei Han and Micheline Kamber, second edition(408-414) BIRCH:An Efficient Data Clustering Method For Very Large Databases, Tian Zhang, Raghu Ramakrishnan, Miron Livny

April 30, 2012

30

Thank you

April 30, 2012

31

You might also like