0% found this document useful (0 votes)

10 views

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

raadsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

raadsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Mining:

Concepts and
Techniques
(3rd ed.)

— Chapter 10 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
By Weka App. Do K-Mean cluster
 Open Weka -> explore
 Choose iris.arff file
 Click Cluster button.
 Choose clusterers -> SimpleKmeans
 From classifier setting option change
numclusters to be 3 (clusters)
 From Cluster mode section select “Classes to
clusters evaluation” radio button
 Click Start bottom.
 You can visualize output by right click on result
list an select “Visualize cluster assignments”
2
 You can ignore attributes then Click Start button
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

3
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
4
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

5
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each
connected component forms a cluster

6
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

7
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
8
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point

of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

9
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

Can never undo what was done previously

Do not scale well: time complexity of at least O(n2), where
n is the number of total objects
 Integration of hierarchical & distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters

CHAMELEON (1999): hierarchical clustering using
dynamic modeling
10
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

11
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
N
LS: linear sum of N points:  X i
i 1

CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
 Xi
9

(2,6)
8

i 1
7

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

12
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,

and 2nd moments of the subcluster from the statistical point

of view
 Registers crucial measurements for computing cluster and

utilizes storage efficiently

A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters

 Branching factor: max # of children

 Threshold: max diameter of sub-clusters stored at the leaf

nodes 13
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

14
The Birch Algorithm
 Cluster Diameter 1 2
 (x  x )
n( n  1) i j

 For each point in the input

 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be so

natural
 Clusters tend to be spherical given the radius and diameter

measures
15
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
16
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 17
CHAMELEON (Clustering Complex
Objects)

18
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
19
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:

 The probability that a point xi ∈ X is generated by the

model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the

the maximum
parameters μ and σ2 such that likelihood

20
A Probabilistic Hierarchical Clustering
Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can
be measured by,

where P() is the maximum likelihood

 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
21
By Weka App. Do Hierarchical
Cluster
 Open Weka -> explore
 Choose iris.arff file
 Click Cluster button.
 Choose clusterers -> HierarchicalCluster
 From Cluster mode section select “Classes to
clusters evaluation” radio button
 Click Start bottom.
 You can visualize output by right click on result
list an select “Visualize tree”

22
By Weka App. Do Hierarchical
cluster
 Open Weka -> explore
 Choose contact-lenses.arff file
 Click Cluster button.
 Choose clusterers -> HierarchicalCluster
 From setting Chose “distanceFunction” and change
it to “ManhattanDistance”.
 From setting change numClusters to 6.
 Click Start bottom.
 You can visualize output by right click on result list an
select “Visualize tree”
 Try it with LinkType = Complete
23
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

Youth Cribbage Program Teachers Manual (Fourth Edition
No ratings yet
Youth Cribbage Program Teachers Manual (Fourth Edition
20 pages
Talbi 2021
No ratings yet
Talbi 2021
32 pages
Mod 3 Guidelines in Giving Emergency CAre
67% (3)
Mod 3 Guidelines in Giving Emergency CAre
5 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
Clustering
No ratings yet
Clustering
45 pages
Clustering
No ratings yet
Clustering
110 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Agnes
No ratings yet
Agnes
25 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
39 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
1629189889 ML TCS Lecture Hierarchical 1608
No ratings yet
1629189889 ML TCS Lecture Hierarchical 1608
41 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Survey of Clustering Algorithms
No ratings yet
Survey of Clustering Algorithms
37 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Hierarchical ClusteringAlgorithm
No ratings yet
Hierarchical ClusteringAlgorithm
32 pages
Clustering
No ratings yet
Clustering
28 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering
No ratings yet
Clustering
32 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
DM_C6
No ratings yet
DM_C6
37 pages
Hierarchical-Clustering-in-Machine-Learning
No ratings yet
Hierarchical-Clustering-in-Machine-Learning
10 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
clustering
No ratings yet
clustering
16 pages
13_BIRCH
No ratings yet
13_BIRCH
8 pages
2.11 Hierarchical clustering - Agglomerative & Divisive Clustering
No ratings yet
2.11 Hierarchical clustering - Agglomerative & Divisive Clustering
11 pages
AI20- Hierarchical-clustering
No ratings yet
AI20- Hierarchical-clustering
31 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
75 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
24 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
UNIT5
No ratings yet
UNIT5
60 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
DOC-20231118-WA0008new Unit 5
No ratings yet
DOC-20231118-WA0008new Unit 5
15 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
Sudoku New: Workouts to sharpen your mind
From Everand
Sudoku New: Workouts to sharpen your mind
Sahil Gupta
No ratings yet
Introducing Autodesk Maya 2015: Autodesk Official Press
From Everand
Introducing Autodesk Maya 2015: Autodesk Official Press
Dariush Derakhshani
No ratings yet
Python Machine Learning
From Everand
Python Machine Learning
Wei-Meng Lee
5/5 (1)
A Detailed Lesson Plan in Science 9
0% (1)
A Detailed Lesson Plan in Science 9
5 pages
RESUME 2023
No ratings yet
RESUME 2023
3 pages
Ekta Singh: E-Mail: Mobile: +91 7047703231
No ratings yet
Ekta Singh: E-Mail: Mobile: +91 7047703231
3 pages
Spearman Rank Order Correlation Coefficient
No ratings yet
Spearman Rank Order Correlation Coefficient
18 pages
CV_2024120211580095
No ratings yet
CV_2024120211580095
3 pages
MLT Imlt Content Guideline
No ratings yet
MLT Imlt Content Guideline
11 pages
Big Theories Revisited 2 1st Edition Gregory Arief D. Liem instant download
100% (1)
Big Theories Revisited 2 1st Edition Gregory Arief D. Liem instant download
65 pages
Vacancy Announement - Nepal Airlines Corporation-HKG
No ratings yet
Vacancy Announement - Nepal Airlines Corporation-HKG
2 pages
Bank Management: Master in Finance and Banking
No ratings yet
Bank Management: Master in Finance and Banking
27 pages
Learning Task (Democratic Interventions Prevailing in Political and Social Institutions)
No ratings yet
Learning Task (Democratic Interventions Prevailing in Political and Social Institutions)
3 pages
OTM 16489 - Create a Solution for Closing Inventory With Pending RI's
No ratings yet
OTM 16489 - Create a Solution for Closing Inventory With Pending RI's
2 pages
Ai 007
No ratings yet
Ai 007
17 pages
Bikaner Technical University: Student's Declaration
No ratings yet
Bikaner Technical University: Student's Declaration
10 pages
Addition and Subtraction Basic Math Packet 2009-2010
100% (2)
Addition and Subtraction Basic Math Packet 2009-2010
301 pages
The Islamia University of Bahawalpur, Pakistan
No ratings yet
The Islamia University of Bahawalpur, Pakistan
5 pages
Recruitmemnt For The Post of Medical Technologist (Critical Care) and Nutritionist NRC1
No ratings yet
Recruitmemnt For The Post of Medical Technologist (Critical Care) and Nutritionist NRC1
4 pages
Dressmaking NC II CG
No ratings yet
Dressmaking NC II CG
25 pages
DSPY Lab Project (Formatted) 2
No ratings yet
DSPY Lab Project (Formatted) 2
14 pages
Science Education: For Responsible Citizenship
No ratings yet
Science Education: For Responsible Citizenship
88 pages
Entrep Midterm Modules
100% (1)
Entrep Midterm Modules
43 pages
The Perception Process
No ratings yet
The Perception Process
19 pages
Position Paper
No ratings yet
Position Paper
4 pages
Class Xii Computer Science Project- Final
No ratings yet
Class Xii Computer Science Project- Final
39 pages
GroundProbe Goes To Campus
No ratings yet
GroundProbe Goes To Campus
6 pages
Full Download New Frontiers in Japanese Studies 1st Edition Akihiro Ogawa PDF DOCX
100% (3)
Full Download New Frontiers in Japanese Studies 1st Edition Akihiro Ogawa PDF DOCX
55 pages
Lineage of The Plum Village Tradition
100% (2)
Lineage of The Plum Village Tradition
11 pages
Lesson Plan in Arts 9
No ratings yet
Lesson Plan in Arts 9
4 pages

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

Chp10 Cluster Analysis Basic Concepts and Methods

Uploaded by

Data Mining:

Jiawei Han, Micheline Kamber, and Jian Pei

A clustering of the data objects is obtained by cutting

 Introduced in Kaufmann and Rousseeuw (1990)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

 Radius: square root of average distance from any point

Clustering Feature (CF): CF = (N, LS, SS)

and 2nd moments of the subcluster from the statistical point

utilizes storage efficiently

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters

 Threshold: max diameter of sub-clusters stored at the leaf

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Leaf node Leaf node

 For each point in the input

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

 Since we fix the size of leaf nodes, so clusters may not be so

 The probability that a point xi ∈ X is generated by the

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the

where P() is the maximum likelihood

You might also like