0% found this document useful (0 votes)
8 views

Lecture 18

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 18

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Hierarchical Clustering

1
2
Bottom-up Top-down

3
Hierarchical Clustering
• Two broad

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative


(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

4
Hierarchical Agglomerative Clustering: Linkage
Methods
• The single linkage method is based on minimum
distance, or the nearest neighbor rule.

• The complete linkage method is based on the


maximum distance or the furthest neighbor approach.

• The average linkage method the distance between two


clusters is defined as the average of the distances
between all pairs of objects

5
Linkage Methods of Clustering
Single Linkage
Minimum
Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average
Cluster 1 Distance Cluster 2 6
• Yet another distance between clusters is,

7
8
Dendrogram

9
• Single-link method can be seen as a graph
based method.
• Nodes are points.
• Every pair has an edge with distance as its
cost.
• Single-link is minimum spanning tree
clustering only.

10
Minimum spanning tree clustering

11
Single-link Vs. Complete-link

12
Single link is sensitive to noise, but is
good with arbitrary shaped clusters

13
AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

14
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• The cluster is split according to some principle, such as the
maximum Euclidean distance between the closest neighboring
objects in the cluster

15
More on Hierarchical Clustering Methods

■ Major weakness of agglomerative clustering methods


2
■ do not scale well: time complexity of at least O(n ),

where n is the number of total objects


■ can never undo what was done previously

■ Integration of hierarchical with distance-based clustering


■ BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters


■ CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the


cluster by a specified fraction
■ CHAMELEON (1999): hierarchical clustering using

dynamic modeling
16
BIRCH (1996)
■ Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
■ Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
■ Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
■ Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
■ Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
■ Weakness: handles only numeric data, and sensitive to the
order of the data record.
17
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: ∑Ni=1=Xi
SS: ∑Ni=1=Xi2 CF = (5, (16,30),(54,190))

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

18
19
CF Additive Theorem
● Suppose cluster C1 has CF1=(N1, LS1 ,SS1), cluster
C2 has CF2 =(N2,LS2,SS2)
● If we merge C1 with C2, the CF for the merged
cluster C is

● Why CF?
● Summarized info for single cluster
● Summarized info for two clusters
● Additive theorem

21
22
CF Tree Root
CF1 CF2 CF3 CF6
child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

23
• A CF tree is a height-balanced tree that stores
the CFs in its nodes.
• Nonleaf nodes store sums of the CFs of their
children.
– Thus summarizes about their children
• A CF tree has two parameters: branching
factor B, and threshold T.
• B is maximum number of children a nonleaf
node can have.

24
• Threshold T is the maximum diameter of
subclusters stored at the leaf nodes of the
tree.

25
26
• CF tree is built incrementally.
• An object is inserted in to the closest leaf
entry (subcluster).
• If the diameter of the subcluster stored in the
leaf node after the insertion is larger than T,
the leaf node is split.
– This can result in splitting of the parent node(s)
• Like B+ tree insertion.

27

You might also like