17 GM ASAP Data Mining - Clustering
17 GM ASAP Data Mining - Clustering
2
Clustering
Cluster analysis or simply clustering is the process
of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that
objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of
clusters resulting from a cluster analysis can be
referred to as a clustering.
In this context, different clustering methods may
generate different clusterings on the same data set.
The same clustering method equipped with different
parameters or even
different initializations may also produce different
clusterings. Such partitioning is not performed by
humans, but by a clustering algorithm. Hence,
clustering is useful in that it can lead to the discovery
of previously unknown groups within the data.
• Cluster analysis
– What is cluster analysis?
– Requirements for cluster analysis
– Overview of basic clustering methods
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Cluster analysis?
Clustering is also called data segmentation in some
applications because clustering partitions large data sets into
groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are
“far away” from any cluster) may be more interesting than
common cases. Applications of outlier detection include the
detection of credit card frauds and the monitoring of criminal
activities in electronic commerce.
For example, exceptional cases in credit card transactions,
such as very expensive and infrequent purchases at unusual
locations, may be of interest as possible fraudulent activities.
Data clustering is under vigorous development. Contributing
areas of research include data mining, statistics, machine
learning and deep learning, spatial database technology,
information retrieval,
Web search, biology, marketing, and many other application
areas. Owing to the huge amounts of data collected in
databases, cluster analysis has become a highly active topic
in data mining research.
Cluster analysis?
As a branch of statistics, cluster analysis has been
extensively studied, with the main focus on distance-based cluster
analysis. Cluster analysis tools based on k-means, k-medoids, and
several other methods also have been built into many statistical
analysis software packages or systems, such as SPlus, SPSS, and
SAS. In machine learning, classification is known as supervised
learning
because the class label information is given, that is, the learning
algorithm is supervised in that it is told the class membership of
each training tuple.
Clustering is known as unsupervised learning because the
class label information is not present. For this reason, clustering is a
form of learning by observation, rather than learning by
examples. In data mining, efforts have focused on finding methods
for efficient and effective cluster analysis in large data sets. Active
themes of research focus on the scalability of clustering methods,
the effectiveness of methods for clustering complex shapes (e.g.,
nonconvex) and types of data (e.g., text, graphs, and images), high-
dimensional clustering techniques (e.g., clustering objects with
thousands or even millions of features), and methods for clustering
mixed numerical and nominal data in large data sets.
Cluster analysis
Cluster analysis
– Partitioning methods
– Hierarchical methods
– Density-based and grid-based methods
– Evaluation of clustering
Cluster Analysis
• When flying over a city, one can easily identify fields,
forests, commercial areas, and residential areas based on
their features, without anyone’s explicit “training”—This is
the power of cluster analysis
• This chapter and the next systematically study cluster
analysis methods and help answer the following:
– What are the different proximity measures for effective
clustering?
– Can we cluster a massive number of data points
efficiently?
– Can we find clusters of arbitrary shape? At multiple levels
of granularity?
– How can we judge the quality of the clusters discovered
by our system?
The Value of Cluster Analysis
• Cluster analysis
• Partitioning methods
– K-means: a centroid-based techniques
– Variations of k-means
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Partitioning Algorithms: Basic Concepts
The K-Means Clustering Method
• K-Means
– Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means clustering
algorithm is outlined as follows
• Select K points as initial centroids
• Repeat
– Form K clusters by assigning each point to its closest
centroid
– Re-compute the centroids (i.e., mean point) of each
cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
– Manhattan distance (L1 norm), Euclidean distance (L2
norm), Cosine similarity
Example: K-Means Clustering
Assign
points to
clusters Recompute
cluster
centers
9
9 9
8
8 8
Arbitrar Assign
7
7 7
6
6 6
5
y each
5 5
4 choose 4
remainin 4
3 K 3
g object 3
2
object 2 to 2
1
as 1 nearest 1
0 1 2 3 4 5 6 7 8 9 10
initial 0
0 1 2 3 4 5 6 7 8 9 10
medoids 0
0 1 2 3 4 5 6 7 8 9 10
medoid
K=2 s
Randomly select a
non-medoid
Select initial K medoids randomly object,Oramdom
10 10
Repeat 9
Compute
9
Swapping
8 8
O and 6
of 6
Oramdom
Swap medoid m with oi if it swapping
5 5
4 4
If quality is
improves the clustering quality
3 3
improved 2 2
1 1
Original
Space
Example: Kernel K-Means
Clustering
The original data The result of K-Means The result of Gaussian Kernel K-Means
set clustering clustering
The above data set cannot generate quality clusters by K-Means since it contains
non-covex clusters
Gaussian RBF Kernel transformation maps data to a kernel matrix2 K for any two
|| X i X j || /2 2
i jj: ( xi ) ( x j )
pointsKxx ix, x e K(Xi, Xj) =
and Gaussian kernel:
K-Means clustering is conducted on the mapped data, generating quality clusters
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
– Basic concepts of hierarchical clustering
– Agglomerative hierarchical clustering
– Divisive hierarchical clustering
– BIRCH: scalable hierarchical clustering using
clustering feature trees
– Probabilistic hierarchical clustering
• Density-based and grid-based methods
• Evaluation of clustering
Hierarchical Clustering: Basic Concepts
• Hierarchical clustering
– Generate a clustering hierarchy (drawn as a
dendrogram)
– Not required to specify K, the number of clusters
– More deterministic
– No iterative refinement
• Two categories of algorithms
– Agglomerative: Start with singleton clusters,
continuously merge two clusters at a time to build a
bottom-up hierarchy of clusters
– Divisive: Start with a huge macro-cluster, split it
continuously into two groups, generating a top-down
hierarchy of clusters
Agglomerative vs. Divisive Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in Hierarchical Clustering
• Single link (nearest neighbor)
– The similarity between two clusters is the similarity
between their most similar (nearest neighbor) members
– Local similarity-based: Emphasizing more on close
regions, ignoring the overall structure of the cluster
– Capable of clustering non-elliptical shaped group of
objects
– Sensitive to noise and outliers
X
X
X X
Ca: Cb:
Na Nb
X
X
Agglomerative Clustering with
Ward’s Criterion
Divisive Clustering
• DIANA (Divisive Analysis)
– Implemented in some statistical analysis
packages, e.g., Splus
• Inverse order of AGNES: Eventually each
node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down Approach
• The process starts at the root with all the points
as one cluster
• It recursively splits the higher level clusters to
build the dendrogram
• Can be considered as a global approach
• More efficient when compared with
agglomerative clustering
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for Divisive Clustering
• Choosing which cluster to split
– Check the sums of squared errors of the clusters
and choose the one with the largest value
• Splitting criterion: Determining how to split
– One may use Ward’s criterion to chase for greater
reduction in the difference in the SSE criterion as
a result of a split
– For categorical data, Gini-index can be used
• Handling the noise
– Use a threshold to determine the termination
criterion (do not generate clusters that are too
small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Weakness of the agglomerative & divisive
hierarchical clustering methods
– No revisit: cannot undo any merge/split decisions made
before
– Scalability bottleneck: Each merge/split needs to
examine many possible options
• Time complexity: at least O(n2), where n is the number of total
objects
• Several other hierarchical clustering algorithms
– BIRCH (1996): Use CF-tree and incrementally adjust
the quality of sub-clusters
– CURE (1998): Represent a cluster using a set of well-
scattered representative points
– CHAMELEON (1999): Use graph partitioning methods
on the K-nearest neighbor graph of the data
BIRCH: A Multi-Phase Hierarchical Clustering
Method
• BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
– Developed by Zhang, Ramakrishnan & Livny (SIGMOD’96)
– Impact many new clustering methods and applications (received
2006 SIGMOD Test of Time award)
• Major innovation
– Integrating hierarchical clustering (initial micro-clustering phase)
and other clustering methods (at the later macro-clustering phase)
• Multi-phase hierarchical clustering
– Phase1 (initial micro-clustering): Scan DB to build an initial CF tree,
a multi-level compression of the data to preserve the inherent
clustering structure of the data
– Phase 2 (later macro-clustering): Use an arbitrary clustering
algorithm (e.g., iterative partitioning) to cluster flexibly the leaf nodes
of the CF-tree
Clustering Feature Vector
8
(3,4)
(2,6)
7
3 (4,5)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30);
2
0
(4,7)
SS=(32+22+42+42+32) + (42+62+52+72+82)= 54+190=244
0 1 2 3 4 5 6 7 8 9 10
(3,8)
Clustering Feature: a Summary of the Statistics for the
Given Cluster
10
9
(3,4)
8
3 (4,7)
2
1 (3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Essential Measures of Cluster: Centroid, Radius
and Diameter
X
Example
CF Tree: A Height-Balanced Tree Storing Clustering
Features for Hierarchical Clustering
Root
B=7 CF1 CF2 CF3 CF6
L=6
child1 child2 child3 child6
Non-leaf node
CF11 CF12 CF13 CF15
prev CFx1 CFx2 CFx6 next prev CFy1 CFy2 CFy5 next
BIRCH: A Scalable and Flexible Clustering
Method
p2
q
p q
o
DBSCAN: The Algorithm
Outlier
Outlier/noise:
Border not in a
cluster
Core point:
Cor dense
e neighborhood
Border point: in cluster
but neighborhood is not
dense
DBSCAN Is Sensitive to the Setting
of Parameters
Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8),
1999
OPTICS: Ordering Points To Identify
Clustering Structure
• OPTICS (Ankerst, Breunig, Kriegel, and Sander,
SIGMOD’99)
– DBSCAN is sensitive to parameter setting
– An extension: finding clustering structure
• Observation: Given a MinPts, density-based clusters w.r.t. a
higher density are completely contained in clusters w.r.t. to a
lower density
• Idea: Higher density points should be processed first—find
high-density clusters first
• OPTICS stores such a clustering order using two pieces of
information:
– Core distance and reachability distance
Visualization
• Since points belonging to a cluster have a
low reachability distance to their nearest
neighbor, valleys correspond to clusters
• The deeper the valley, the denser the
cluster
Reachability plot for a dataset
Reachability-
distance
undefine
d
’
Cluster-order of the
objects
OPTICS: An Extension from
DBSCAN
• Core distance of an object p: The smallest
value ε such
• that the ε-neighborhood of p has at least MinPts
objects
– Let Nε(p): ε-neighborhood of p, where ε is a
Reachabilit
y
distance
distance
undefined
value
– Core-distanceε, MinPts(p) = Undefined if
card(Nε(p))
’
< MinPts; MinPts-distance(p),
otherwise
Cluster-order of the
objects
OPTICS: An Extension from
DBSCAN
• Reachability distance of object q from core
object p is the min. radius value that
makes q density-reachable from p
Reachability-distanceε, MinPts(p, q) =
Undefined, if p is not a core object
max(core-distance(p), distance (p, q)), otherwise
cell, including
– count, mean, s(standard deviation), min, max i-th layer
C C
– Rag bag better than alien: putting a heterogeneous object into a pure 1 2
cluster should be penalized more than putting it into a rag bag (i.e.,
“miscellaneous” or “other” category)
– Small cluster preservation: splitting a small category into pieces is
more harmful than splitting a large category into pieces
Commonly Used Extrinsic Methods
• Matching-based methods
– Examine how well the clustering results match the ground
Ground truthtruth
partitioning G
in partitioning the objects in the data set G
Cluster
1
Cluster
2
C1 C2
• Information theory-based methods
– Compare the distribution of the clustering results and that of the
ground truth
– Information theory (e.g., entropy) used to quantify the
comparison
– Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
– Treat each group in the ground truth as a class, and then check
the pairwise consistency of the objects in the clustering results
– Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods
Ground Truth G
G1
Cluster C2 C
C1 2 3
Matching-Based Methods: Example
• Consider 11 objects Ground Truth
G1
Cluster C2
G
C
C1 2 3
• Other methods:
– maximum matching; F-measure
Information Theory-Based Methods
(I)
Conditional Entropy
• A clustering can be regarded as a Ground Truth
G1
Cluster C2
G
C
of objects
• The better the clustering results approach
the ground-truth, the less amount of
information is needed
• This idea leads to the use of conditional
entropy
Information Theory-Based
Methods (I)
Conditional Entropy
Ground Truth G1 G2
Cluster C1 C2 C3
Example
• Consider 11 objects Ground Truth
G1
Cluster C2
G
C
C1 2 3
Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two clusters
Information Theory-Based Methods (II)
Normalized Mutual Information (NMI)
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same
cluster/group, the assignment is regarded as positive; otherwise,
negative
– Depending on assignments, we have four possible cases:
Note: Total # of n
N
pairs of points 2