0% found this document useful (0 votes)
10 views

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

raadsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

raadsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Mining:

Concepts and
Techniques
(3rd ed.)

— Chapter 10 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
By Weka App. Do K-Mean cluster
 Open Weka -> explore
 Choose iris.arff file
 Click Cluster button.
 Choose clusterers -> SimpleKmeans
 From classifier setting option change
numclusters to be 3 (clusters)
 From Cluster mode section select “Classes to
clusters evaluation” radio button
 Click Start bottom.
 You can visualize output by right click on result
list an select “Visualize cluster assignments”
2
 You can ignore attributes then Click Start button
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

3
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
4
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

5
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

6
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

7
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,


Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,


Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
8
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point


of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

9
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

Can never undo what was done previously

Do not scale well: time complexity of at least O(n2), where
n is the number of total objects
 Integration of hierarchical & distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters

CHAMELEON (1999): hierarchical clustering using
dynamic modeling
10
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

11
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)


N: Number of data points
N
LS: linear sum of N points:  X i
i 1

CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
 Xi
9

(2,6)
8

i 1
7

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

12
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,

and 2nd moments of the subcluster from the statistical point


of view
 Registers crucial measurements for computing cluster and

utilizes storage efficiently


A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters


 Branching factor: max # of children

 Threshold: max diameter of sub-clusters stored at the leaf

nodes 13
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

14
The Birch Algorithm
 Cluster Diameter 1 2
 (x  x )
n( n  1) i j

 For each point in the input


 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be so

natural
 Clusters tend to be spherical given the radius and diameter

measures
15
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
16
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 17
CHAMELEON (Clustering Complex
Objects)

18
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
19
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:

 The probability that a point xi ∈ X is generated by the


model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the


the maximum
parameters μ and σ2 such that likelihood

20
A Probabilistic Hierarchical Clustering
Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can
be measured by,

where P() is the maximum likelihood


 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
21
By Weka App. Do Hierarchical
Cluster
 Open Weka -> explore
 Choose iris.arff file
 Click Cluster button.
 Choose clusterers -> HierarchicalCluster
 From Cluster mode section select “Classes to
clusters evaluation” radio button
 Click Start bottom.
 You can visualize output by right click on result
list an select “Visualize tree”

22
By Weka App. Do Hierarchical
cluster
 Open Weka -> explore
 Choose contact-lenses.arff file
 Click Cluster button.
 Choose clusterers -> HierarchicalCluster
 From setting Chose “distanceFunction” and change
it to “ManhattanDistance”.
 From setting change numClusters to 6.
 Click Start bottom.
 You can visualize output by right click on result list an
select “Visualize tree”
 Try it with LinkType = Complete
23
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

24

You might also like