Birch 09
Birch 09
Outline
Introduction to Clustering
Clustering
Introduction
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.
A good quality clustering can help find hidden patterns. Main methods
Density-based: DBSCAN,
Introduction to BIRCH
Designed for very large numerical data sets
Time and memory are limited. Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Does not need the whole data set in advance
Exploit the non uniformity of data treat dense areas as one, and remove outliers (noise)
Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes
Introduces two concepts clustering feature and clustering feature tree. Overcomes the difficulties of agglomerative clustering methods:
Clustering Parameters
Centroid: Euclidian center
Radius and Diameter reflects the tightness of the cluster around the Centriod.
centroid Euclidean distance: centroid Manhattan distance: average inter-cluster: average intra-cluster: variance increase:
April 30, 2012 9
Clustering Feature
The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. CF is a compact storage for data of points in a cluster. Additivity theorem allows us to merge sub-clusters.
A clustering feature (CF ) is a three dimensional vector summarizing information about clusters.
April 30, 2012 10
Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (n, LS, SS) CF=<n,LS,SS> where,
LS P i
P N i
SS
P N i
P i
Additivity theorem: allows to merge sub-clusters consistently and increasingly If CF1 = (N1, LS1, SS1), and CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint subclusters.
12
Suppose that there are three points (2,5), (3,2), (4,3) in a cluster C1 .The clustering feature of C1 is
CF1 =<3,(2+3+4, 5+2+3),(22 + 32 + 42, 52 +22 +32)> CF1 =<3,(9,10),(29, 38)>
CF3 = =<6,(44,46),(446,478)>
April 30, 2012 13
CF-Tree
Each leaf node has at most L CF entries, each of which satisfies threshold T
14
CF-Tree Insertion
Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node
15
CF-Tree Rebuilding
If we run out of space, increase threshold T By increasing the threshold, CFs absorb more data
16
Example of BIRCH
New subcluster sc8 sc1 sc3 sc4 sc5 sc6 LN3 sc7
sc2 LN1
LN2
LN1
sc8
sc1
sc3 sc2
sc6 LN3
sc7
LN1
LN1 LN1
If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one.
sc8 sc1
sc3 sc2
sc4 sc5
sc6 LN3
sc7
LN1 LN1
19
Phase 1: Load data into memory by building a CF tree Phase 2 (optional): Condense into desirable range by building smaller CF trees Phase 3: Global Clustering Phase 4 (optional): Cluster Refining
20
Phase 1
Start with initial threshold T and insert points into tree If we run out of memory, increase T and rebuild
Re-insert leaf entries from old tree into new tree remove outliers
data reduced to fit in memory subsequent processing occurs entirely in memory (no I/O)
21
Phase 2
Optional No. of clusters produced inPhase 1 may be not be suitable for algorithms usedin Phase 3
Phase 3
Problems after Phase 1:
KMeans, HC
Phase 1 has reduced the size of the input dataset enough so that the standard algorithm can work entirely in memory 23
Phase 4
Optional Scan through data again and assign each data point to a cluster
This redistributes data points amongst clusters in moreaccurate fashion than originalCF cluster
Can be repeated for improvedrefinement of clusters
April 30, 2012 24
Experimental Results
Input parameters:
Memory (M): 5% of data set Disk space (R): 20% of M Distance equation: D2 Quality equation: weighted average diameter (D) Initial threshold (T): 0.0
Experimental Results
KMEANS clustering
DS 1 2 3 Time 43.9 13.2 32.9 D 2.09 4.43 3.66 # Scan 289 51 187 DS 1o 2o 3o Time 33.8 12.7 36.0 D 1.97 4.20 4.35 # Scan 197 29 241
BIRCH clustering
DS 1 2 3 Time 11.5 10.7 11.4 D 1.87 1.99 3.95 # Scan 2 2 2 DS 1o 2o 3o Time 13.6 12.1 12.2 D 1.87 1.99 3.99 # Scan 2 2 2
26
Conclusions
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. Given a limited amount of main memory, BIRCH can minimize the time required for I/O. BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.
April 30, 2012 27
Exam Questions
28
Exam Questions
29
References
Data Mining, concepts and techniques, Jiawei Han and Micheline Kamber, second edition(408-414) BIRCH:An Efficient Data Clustering Method For Very Large Databases, Tian Zhang, Raghu Ramakrishnan, Miron Livny
30
Thank you
31