0% found this document useful (0 votes)
4 views

17 GM ASAP Data Mining - Clustering

Uploaded by

George Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

17 GM ASAP Data Mining - Clustering

Uploaded by

George Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 107

ASAP Data mining: Cluster

Analysis: Basic Concepts and


Methods
FOUNDATION TO DATA SCIENCE
Business Analytics

UNIT 2.2: DATA MINING


17. Discovering Patterns from Data- Clustering

Prof. Dr. George Mathew


B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD
Mailid: [email protected]
Site: https://relsoft.in/

2
Clustering
Cluster analysis or simply clustering is the process
of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that
objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of
clusters resulting from a cluster analysis can be
referred to as a clustering.
In this context, different clustering methods may
generate different clusterings on the same data set.
The same clustering method equipped with different
parameters or even
different initializations may also produce different
clusterings. Such partitioning is not performed by
humans, but by a clustering algorithm. Hence,
clustering is useful in that it can lead to the discovery
of previously unknown groups within the data.
• Cluster analysis
– What is cluster analysis?
– Requirements for cluster analysis
– Overview of basic clustering methods
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Cluster analysis?
Clustering is also called data segmentation in some
applications because clustering partitions large data sets into
groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are
“far away” from any cluster) may be more interesting than
common cases. Applications of outlier detection include the
detection of credit card frauds and the monitoring of criminal
activities in electronic commerce.
For example, exceptional cases in credit card transactions,
such as very expensive and infrequent purchases at unusual
locations, may be of interest as possible fraudulent activities.
Data clustering is under vigorous development. Contributing
areas of research include data mining, statistics, machine
learning and deep learning, spatial database technology,
information retrieval,
Web search, biology, marketing, and many other application
areas. Owing to the huge amounts of data collected in
databases, cluster analysis has become a highly active topic
in data mining research.
Cluster analysis?
As a branch of statistics, cluster analysis has been
extensively studied, with the main focus on distance-based cluster
analysis. Cluster analysis tools based on k-means, k-medoids, and
several other methods also have been built into many statistical
analysis software packages or systems, such as SPlus, SPSS, and
SAS. In machine learning, classification is known as supervised
learning
because the class label information is given, that is, the learning
algorithm is supervised in that it is told the class membership of
each training tuple.
Clustering is known as unsupervised learning because the
class label information is not present. For this reason, clustering is a
form of learning by observation, rather than learning by
examples. In data mining, efforts have focused on finding methods
for efficient and effective cluster analysis in large data sets. Active
themes of research focus on the scalability of clustering methods,
the effectiveness of methods for clustering complex shapes (e.g.,
nonconvex) and types of data (e.g., text, graphs, and images), high-
dimensional clustering techniques (e.g., clustering objects with
thousands or even millions of features), and methods for clustering
mixed numerical and nominal data in large data sets.
Cluster analysis
Cluster analysis
– Partitioning methods
– Hierarchical methods
– Density-based and grid-based methods
– Evaluation of clustering
Cluster Analysis
• When flying over a city, one can easily identify fields,
forests, commercial areas, and residential areas based on
their features, without anyone’s explicit “training”—This is
the power of cluster analysis
• This chapter and the next systematically study cluster
analysis methods and help answer the following:
– What are the different proximity measures for effective
clustering?
– Can we cluster a massive number of data points
efficiently?
– Can we find clusters of arbitrary shape? At multiple levels
of granularity?
– How can we judge the quality of the clusters discovered
by our system?
The Value of Cluster Analysis

• What is the value of cluster analysis?


– Cluster analysis helps you partition massive data into groups
based on its features
– Cluster analysis will often help subsequent data mining
processes such as pattern discovery, classification, and
outlier analysis
• What roles does cluster analysis play in the Data
Mining Specialization?
– You will learn various scalable methods to find clusters from
massive data
– You will learn how to mine different kinds of clusters
effectively
– You will also learn how to evaluate the quality of the clusters
you find
– Cluster analysis will help with classification, outlier analysis,
and other data mining tasks
Broad Applications of Cluster Analysis
• Data summarization, compression, and reduction
– Examples: Image processing or vector quantization
• Collaborative filtering, recommendation systems, or
customer segmentation
– Finding like-minded users or similar products
• Dynamic trend detection
– Clustering stream data and detecting trends and patterns
• Multimedia data analysis, biological data analysis, and
social network analysis
– Examples: Clustering video/audio clips or gene/protein
sequences
• A key intermediate step for other data mining tasks
– Generating a compact summary of data for classification,
pattern discovery, and hypothesis generation and testing
– Outlier detection: Outliers are those “far away” from any
cluster
What Is Cluster Analysis?
• What is a cluster?
– A cluster is a collection of data objects which are
• Similar (or related) to one another within the same group (i.e.,
cluster)
• Dissimilar (or unrelated) to the objects in other groups (i.e.,
clusters)
• Cluster analysis (or clustering, data segmentation, …)
– Given a set of data points, partition them into a set of groups
(i.e., clusters) which are as similar as possible
• Cluster analysis is unsupervised learning (i.e., no
predefined classes)
– This contrasts with classification (i.e., supervised learning)
• Typical ways to use/apply cluster analysis
– As a stand-alone tool to get insight into data distribution, or
– As a preprocessing (or intermediate) step for other
algorithms
Cluster Analysis: Applications
• A key intermediate step for other data mining tasks
– Generating a compact summary of data for classification, pattern
discovery, hypothesis generation and testing, etc.
– Outlier detection: Outliers—those “far away” from any cluster
• Data summarization, compression, and reduction
– Ex. Image processing: Vector quantization
• Collaborative filtering, recommendation systems, or customer
segmentation
– Find like-minded users or similar products
• Dynamic trend detection
– Clustering stream data and detecting trends and patterns
• Multimedia data analysis, biological data analysis and social
network analysis
– Ex. Clustering images or video/audio clips, gene/protein sequences, etc.
Considerations for Cluster
Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable, e.g., grouping topical
terms)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region)
vs. non-exclusive (e.g., one document may belong to more
than one class)
• Similarity measure
– Distance-based (e.g., Euclidean, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
• Clustering space
– Full space (often when low dimensional) vs. subspaces
(often in high-dimensional clustering)
Requirements and Challenges
• Quality
– Ability to deal with different types of attributes: Numerical,
categorical, text, multimedia, networks, and mixture of multiple
types
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
• Scalability
– Clustering all the data instead of only on samples
– High dimensionality
– Incremental or stream clustering and insensitivity to input order
• Constraint-based clustering
– User-given preferences or constraints; domain knowledge; user
queries
• Interpretability and usability
Cluster Analysis: A Multi-
Dimensional Categorization
• Technique-Centered
– Distance-based methods
– Density-based and grid-based methods
– Probabilistic and generative models
– Leveraging dimensionality reduction methods
– High-dimensional clustering
– Scalable techniques for cluster analysis
• Data Type-Centered
– Clustering numerical data, categorical data, text data,
multimedia data, time-series data, sequences, stream data,
networked data, uncertain data
• Additional Insight-Centered
– Visual insights, semi-supervised, ensemble-based, validation-
based
Typical Clustering Methodologies
• Distance-based methods
– Partitioning algorithms: K-Means, K-Medians, K-Medoids
– Hierarchical algorithms: Agglomerative vs. divisive methods
• Density-based and grid-based methods
– Density-based: Data space is explored at a high-level of granularity
and then post-processing to put together dense regions into an
arbitrary shape
– Grid-based: Individual regions of the data space are formed into a
grid-like structure
• Probabilistic and generative models: Modeling data from a
generative process
– Assume a specific form of the generative model (e.g., mixture of
Gaussians)
– Model parameters are estimated with the Expectation-Maximization
(EM) algorithm (using the available dataset, for a maximum
likelihood fit)
– Then estimate the generative probability of the underlying data
points
• High-dimensional clustering
High-dimensional Clustering
• Subspace clustering: Find clusters on various subspaces
– Bottom-up, top-down, correlation-based methods vs. δ-cluster
methods
• Dimensionality reduction: A vertical form (i.e., columns)
of clustering
– Columns are clustered; may cluster rows and columns
together (co-clustering)
– Probabilistic latent semantic indexing (PLSI) then LDA: Topic
modeling of text data
• A cluster (i.e., topic) is associated with a set of words (i.e.,
dimensions) and a set of documents (i.e., rows) simultaneously
– Nonnegative matrix factorization (NMF) (as one kind of co-
clustering)
• A nonnegative matrix A (e.g., word frequencies in documents) can
be approximately factorized two non-negative low rank matrices U
and V
– Spectral clustering: Use the spectrum of the similarity matrix
of the data to perform dimensionality reduction for clustering
in fewer dimensions
Clustering Different Types of Data (I)
• Numerical data
– Most earliest clustering algorithms were designed for numerical data
• Categorical data (including binary data)
– Discrete data, no natural order (e.g., sex, race, zip-code, and
market-basket)
• Text data: Popular in social media, Web, and social networks
– Features: High-dimensional, sparse, value corresponding to word
frequencies
– Methods: Combination of k-means and agglomerative; topic
modeling; co-clustering
• Multimedia data: Image, audio, video (e.g., on Flickr, YouTube)
– Multi-modal (often combined with text data)
– Contextual: Containing both behavioral and contextual attributes
• Images: Position of a pixel represents its context, value represents its
behavior
• Video and music data: Temporal ordering of records represents its meaning
Clustering Different Types of Data (II)
• Time-series data: Sensor data, stock markets, temporal tracking,
forecasting, etc.
– Data are temporally dependent
– Time: contextual attribute; data value: behavioral attribute
– Correlation-based online analysis (e.g., online clustering of stock to
find stock tickers)
– Shape-based offline analysis (e.g., cluster ECG based on overall
shapes)
• Sequence data: Weblogs, biological sequences, system
command sequences
– Contextual attribute: Placement (rather than time)
– Similarity functions: Hamming distance, edit distance, longest
common subsequence
– Sequence clustering: Suffix tree; generative model (e.g., Hidden
Markov Model)
• Stream data:
– Real-time, evolution and concept drift, single pass algorithm
– Create efficient intermediate representation, e.g., micro-clustering
Clustering Different Types of Data (III)
• Graphs and homogeneous networks
– Every kind of data can be represented as a graph with similarity
values as edges
– Methods: Generative models; combinatorial algorithms (graph cuts);
spectral methods; non-negative matrix factorization methods
• Heterogeneous networks
– A network consists of multiple typed nodes and edges (e.g.,
bibliographical data)
– Clustering different typed nodes/links together (e.g., NetClus)
• Uncertain data: Noise, approximate values, multiple possible
values
– Incorporation of probabilistic information will improve the quality of
clustering
• Big data: Model systems may store and process very big data
(e.g., weblogs)
– Ex. Google’s MapReduce framework
• Use Map function to distribute the computation across different machines
• Use Reduce function to aggregate results obtained from the Map step
User Insights and Interactions in Clustering
• Visual insights: One picture is worth a thousand words
– Human eyes: High-speed processor linking with a rich knowledge-
base
– A human can provide intuitive insights; HD-eye: visualizing HD
clusters
• Semi-supervised insights: Passing user’s insights or intention to
system
– User-seeding: A user provides a number of labeled examples,
approximately representing categories of interest
• Multi-view and ensemble-based insights
– Multi-view clustering: Multiple clusterings represent different
perspectives
– Multiple clustering results can be ensembled to provide a more
robust solution
• Validation-based insights: Evaluation of the quality of clusters
generated
– May use case studies, specific measures, or pre-existing labels
Outline

• Cluster analysis
• Partitioning methods
– K-means: a centroid-based techniques
– Variations of k-means
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Partitioning Algorithms: Basic Concepts
The K-Means Clustering Method
• K-Means
– Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means clustering
algorithm is outlined as follows
• Select K points as initial centroids
• Repeat
– Form K clusters by assigning each point to its closest
centroid
– Re-compute the centroids (i.e., mean point) of each
cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
– Manhattan distance (L1 norm), Euclidean distance (L2
norm), Cosine similarity
Example: K-Means Clustering

Assign
points to
clusters Recompute
cluster
centers

The original data


points & randomly Execution of the K-Means Clustering Algorithm Redo point
select K = 2 assignment
centroids
Select K points as initial centroids
Repeat
•Form K clusters by assigning each point to its closest centroid
•Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
Discussion on the K-Means
Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of
iterations
– Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
– Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
– There are ways to automatically determine the “best” K
– In practice, one often runs a range of values and selected the “best” K
value
• Sensitive to noisy data and outliers
– Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional
space
– Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
– Using density-based clustering, kernel K-means, etc.
Variations of K-Means
• Choosing better initial centroid estimates
– K-means++, Intelligent K-Means, Genetic K-
Means
• Choosing different representative
prototypes for the clusters
– K-Medoids, K-Medians, K-Modes
• Applying feature transformation
techniques
– Weighted K-Means, Kernel K-Means
Initialization of K-Means
• Different initializations may generate rather
different clustering results (some could be far from
optimal)
• Original proposal (MacQueen’67): Select K seeds
randomly
– Need to run the algorithm multiple times using
different seeds
• There are many methods proposed for better initialization of k seeds

• K-Means++ (Arthur & Vassilvitskii’07):


– The first centroid is selected at random
– The next centroid selected is the one that is farthest from the currently
selected (selection is based on a weighted probability score)
– The selection continues until K centroids are obtained
Example: Poor Initialization May
Lead to Poor Clustering
Assign Recompute
points to cluster
clusters centers

Another random selection of k


centroids for the same data
points

 Rerun of the K-Means using another random K seeds


 This run of K-Means generates a poor quality clustering
Handling Outliers: From K-
Means to K-Medoids
PAM: A Typical K-Medoids
Algorithm
10
10 10

9
9 9

8
8 8

Arbitrar Assign
7
7 7

6
6 6

5
y each
5 5

4 choose 4
remainin 4

3 K 3
g object 3

2
object 2 to 2

1
as 1 nearest 1

0 1 2 3 4 5 6 7 8 9 10
initial 0
0 1 2 3 4 5 6 7 8 9 10
medoids 0
0 1 2 3 4 5 6 7 8 9 10

medoid
K=2 s
Randomly select a
non-medoid
Select initial K medoids randomly object,Oramdom
10 10

Repeat 9

Compute
9

Swapping
8 8

Object re-assignment 7 total cost 7

O and 6
of 6

Oramdom
Swap medoid m with oi if it swapping
5 5

4 4

If quality is
improves the clustering quality
3 3

improved 2 2

1 1

Until convergence criterion is satisfied 0


0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
Discussion on K-Medoids
Clustering
• K-Medoids Clustering: Find representative objects (medoids) in
clusters
• PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
– Starts from an initial set of medoids, and
– Iteratively replaces one of the medoids by one of the non-medoids if it
improves the total sum of the squared errors (SSE) of the resulting
clustering
– PAM works effectively for small data sets but does not scale well for
large data sets (due to the computational complexity)
– Computational complexity: PAM: O(K(n − K)2) (quite expensive!)
• Efficiency improvements on PAM
– CLARA (Kaufmann & Rousseeuw, 1990):
• PAM on samples; O(Ks2 + K(n − K)), s is the sample size
– CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring
efficiency + quality
K-Medians: Handling Outliers by
Computing Medians
K-Modes: Clustering Categorical
Data
Kernel K-Means Clustering
• Kernel K-Means can be used to detect non-convex clusters
– K-Means can only detect clusters that are linearly separable
• Idea: Project data onto the high-dimensional kernel space,
and then perform K-Means clustering
– Map data points in the input space onto a high-dimensional
feature space using the kernel function
– Perform K-Means on the mapped feature space
• Computational complexity is higher than K-Means
– Need to compute and store n x n kernel matrix generated from
the kernel function on the original data
• The widely studied spectral clustering can be considered as a
variant of Kernel K-Means clustering
Kernel Functions and Kernel K-
Means Clustering
Example: Kernel Functions and
Kernel K-Means Clustering
Example: Kernel Functions and
Kernel K-Means Clustering

Original
Space
Example: Kernel K-Means
Clustering

The original data The result of K-Means The result of Gaussian Kernel K-Means
set clustering clustering
 The above data set cannot generate quality clusters by K-Means since it contains
non-covex clusters
 Gaussian RBF Kernel transformation maps data to a kernel matrix2 K for any two
 || X i  X j || /2 2
i jj:  ( xi )   ( x j )
pointsKxx ix, x e K(Xi, Xj) =
and Gaussian kernel:
 K-Means clustering is conducted on the mapped data, generating quality clusters
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
– Basic concepts of hierarchical clustering
– Agglomerative hierarchical clustering
– Divisive hierarchical clustering
– BIRCH: scalable hierarchical clustering using
clustering feature trees
– Probabilistic hierarchical clustering
• Density-based and grid-based methods
• Evaluation of clustering
Hierarchical Clustering: Basic Concepts
• Hierarchical clustering
– Generate a clustering hierarchy (drawn as a
dendrogram)
– Not required to specify K, the number of clusters
– More deterministic
– No iterative refinement
• Two categories of algorithms
– Agglomerative: Start with singleton clusters,
continuously merge two clusters at a time to build a
bottom-up hierarchy of clusters
– Divisive: Start with a huge macro-cluster, split it
continuously into two groups, generating a top-down
hierarchy of clusters
Agglomerative vs. Divisive Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)

a
ab

b
abcde

c
cde

d
de

e
divisive
(DIANA)

Step 4 Step 3 Step 2 Step 1 Step 0


Dendrogram: How Clusters are Merged
• Dendrogram: Decompose a set of data objects into a tree
of clusters by multi-level nested partitioning
• A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster

Hierarchical clustering generates a


dendrogram (a hierarchy of
clusters)
Agglomerative Clustering Algorithm
• AGNES (AGglomerative NESting)
– Use the single-link method and the dissimilarity
matrix
– Continuously merge nodes that have the least
dissimilarity
– Eventually all nodes belong to the same cluster
• Agglomerative clustering varies on different similarity measures among clusters
– Single link (nearest neighbor)
– Complete link (diameter)
– Average link (group average)
– Centroid link (centroid similarity)
Agglomerative Clustering
Algorithm
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in Hierarchical Clustering
• Single link (nearest neighbor)
– The similarity between two clusters is the similarity
between their most similar (nearest neighbor) members
– Local similarity-based: Emphasizing more on close
regions, ignoring the overall structure of the cluster
– Capable of clustering non-elliptical shaped group of
objects
– Sensitive to noise and outliers
X
X

• Complete link (diameter)


– The similarity between two clusters is the similarity
between their most dissimilar members
– Merge two clusters to form one with the smallest
diameter
– Nonlocal in behavior, obtaining compact shaped
clusters
– Sensitive to outliers
X
X
Agglomerative Clustering: Average vs. Centroid Links

X X

Ca: Cb:
Na Nb

X
X
Agglomerative Clustering with
Ward’s Criterion
Divisive Clustering
• DIANA (Divisive Analysis)
– Implemented in some statistical analysis
packages, e.g., Splus
• Inverse order of AGNES: Eventually each
node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down Approach
• The process starts at the root with all the points
as one cluster
• It recursively splits the higher level clusters to
build the dendrogram
• Can be considered as a global approach
• More efficient when compared with
agglomerative clustering

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for Divisive Clustering
• Choosing which cluster to split
– Check the sums of squared errors of the clusters
and choose the one with the largest value
• Splitting criterion: Determining how to split
– One may use Ward’s criterion to chase for greater
reduction in the difference in the SSE criterion as
a result of a split
– For categorical data, Gini-index can be used
• Handling the noise
– Use a threshold to determine the termination
criterion (do not generate clusters that are too
small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Weakness of the agglomerative & divisive
hierarchical clustering methods
– No revisit: cannot undo any merge/split decisions made
before
– Scalability bottleneck: Each merge/split needs to
examine many possible options
• Time complexity: at least O(n2), where n is the number of total
objects
• Several other hierarchical clustering algorithms
– BIRCH (1996): Use CF-tree and incrementally adjust
the quality of sub-clusters
– CURE (1998): Represent a cluster using a set of well-
scattered representative points
– CHAMELEON (1999): Use graph partitioning methods
on the K-nearest neighbor graph of the data
BIRCH: A Multi-Phase Hierarchical Clustering
Method
• BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
– Developed by Zhang, Ramakrishnan & Livny (SIGMOD’96)
– Impact many new clustering methods and applications (received
2006 SIGMOD Test of Time award)
• Major innovation
– Integrating hierarchical clustering (initial micro-clustering phase)
and other clustering methods (at the later macro-clustering phase)
• Multi-phase hierarchical clustering
– Phase1 (initial micro-clustering): Scan DB to build an initial CF tree,
a multi-level compression of the data to preserve the inherent
clustering structure of the data
– Phase 2 (later macro-clustering): Use an arbitrary clustering
algorithm (e.g., iterative partitioning) to cluster flexibly the leaf nodes
of the CF-tree
Clustering Feature Vector

CF1 = <5, (16,30), 244>


10

8
(3,4)
(2,6)
7

3 (4,5)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30);
2

0
(4,7)
SS=(32+22+42+42+32) + (42+62+52+72+82)= 54+190=244
0 1 2 3 4 5 6 7 8 9 10

(3,8)
Clustering Feature: a Summary of the Statistics for the
Given Cluster

CF1 = <5, (16,30), 244>

10

9
(3,4)
8

n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30);


7 (2,6)
6

SS=(32+22+42+42+32) + (42+62+52+72+82)= 54+190=244 5 (4,5)


4

3 (4,7)
2

1 (3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Essential Measures of Cluster: Centroid, Radius
and Diameter

X
Example
CF Tree: A Height-Balanced Tree Storing Clustering
Features for Hierarchical Clustering

• Incremental insertion of new points (similar to B+-tree)


• For each point in the input
– Find its closest leaf entry
– Add point to leaf entry and update CF
– If entry diameter > max_diameter
• Split leaf, and possibly parents
• A CF tree has two parameters
– Branching factor: Maximum number of children
– Maximum diameter of sub-clusters stored at the leaf nodes
• A CF tree: A height-balanced tree that stores the
clustering features (CFs)
• The non-leaf nodes store sums of the CFs of their
children
CF Tree: A Height-Balanced Tree Storing Clustering
Features for Hierarchical Clustering

Root
B=7 CF1 CF2 CF3 CF6
L=6
child1 child2 child3 child6

Non-leaf node
CF11 CF12 CF13 CF15

child11 child12 child13 child15

Leaf node Leaf node

prev CFx1 CFx2 CFx6 next prev CFy1 CFy2 CFy5 next
BIRCH: A Scalable and Flexible Clustering
Method

• An integration of agglomerative clustering


with other (flexible) clustering methods
• Low-level micro-clustering
– Exploring CF-feature and BIRCH tree structure
– Preserving the inherent clustering structure of
the data
• Higher-level macro-clustering
– Provide sufficient flexibility for integration with
other clustering methods
BIRCH: Pros and Cons
• Strength: Good quality of clustering; linear
scalability in large/stream databases;
effective for incremental and dynamic
clustering of incoming objects
• Weaknesses
– Due to the fixed size of leaf nodes, clusters so
formed may not be very natural
– Clusters tend to beImages
spherical
time
given
like this may give BIRCH athe
hard radius

and diameter measures


Probabilistic Hierarchical
Clustering
• Algorithmic hierarchical clustering
– Nontrivial to choose a good distance measure
– Hard to handle missing attribute values
– Optimization goal not clear: heuristic, local search
• Probabilistic hierarchical clustering
– Use probabilistic models to measure distances between clusters
– Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
– Easy to understand, same efficiency as algorithmic
agglomerative clustering method, can handle partially observed
data
• In practice, assume the generative models adopt common
distribution functions, e.g., Gaussian distribution or Bernoulli
distribution, governed by parameters
Generative Model
A Probabilistic Hierarchical
Clustering Algorithm
Example
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
– DBSCAN: density-based clustering based on
connected regions with high density
– DENCLUE: clustering based on density distribution
functions
– Grid-based methods
• Evaluation of clustering
Density-Based Clustering Methods
• Clustering based on density (a local cluster criterion),
such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan (only examine the local region to justify density)
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99)
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (also, grid-based)
DBSCAN: A Density-Based
Spatial Clustering Algorithm
p
• DBSCAN (M. Ester, H.-P. Kriegel, J. Sander, and X.
Xu, KDD’96) MinPts =
– Discovers clusters of arbitrary shape: Density-Based Spatial q 5
Clustering of Applications with Noise Eps = 1
• A density-based notion of cluster cm
– A cluster is defined as a maximal set of density-connected
points
– Two parameters: Outlier
– Eps (ε): Maximum radius of the neighborhood Outlier/noise:
– MinPts: Minimum number of points in the Border not in a
– Eps-neighborhood of a point cluster
Core point:
• The Eps(ε)-neighborhood of a point q: Cor
– NEps(q): {p belongs to D | dist(p, q) ≤ Eps} dense
e neighborhood
Border point: in cluster
but neighborhood is not
dense
DBSCAN: Density-Reachable
and Density-Connected
p
MinPts = 5
q Eps = 1
cm

p2
q

p q

o
DBSCAN: The Algorithm
Outlier
Outlier/noise:
Border not in a
cluster
Core point:
Cor dense
e neighborhood
Border point: in cluster
but neighborhood is not
dense
DBSCAN Is Sensitive to the Setting
of Parameters

Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8),
1999
OPTICS: Ordering Points To Identify
Clustering Structure
• OPTICS (Ankerst, Breunig, Kriegel, and Sander,
SIGMOD’99)
– DBSCAN is sensitive to parameter setting
– An extension: finding clustering structure
• Observation: Given a MinPts, density-based clusters w.r.t. a
higher density are completely contained in clusters w.r.t. to a
lower density
• Idea: Higher density points should be processed first—find
high-density clusters first
• OPTICS stores such a clustering order using two pieces of
information:
– Core distance and reachability distance
Visualization
• Since points belonging to a cluster have a
low reachability distance to their nearest
neighbor, valleys correspond to clusters
• The deeper the valley, the denser the
cluster
Reachability plot for a dataset
Reachability-
distance
undefine
d 

’

Cluster-order of the
objects
OPTICS: An Extension from
DBSCAN
• Core distance of an object p: The smallest
value ε such
• that the ε-neighborhood of p has at least MinPts
objects
– Let Nε(p): ε-neighborhood of p, where ε is a
Reachabilit
y
distance
distance
undefined

value
– Core-distanceε, MinPts(p) = Undefined if
card(Nε(p))
’
< MinPts; MinPts-distance(p),
otherwise
Cluster-order of the
objects
OPTICS: An Extension from
DBSCAN
• Reachability distance of object q from core
object p is the min. radius value that
makes q density-reachable from p
Reachability-distanceε, MinPts(p, q) =
Undefined, if p is not a core object
max(core-distance(p), distance (p, q)), otherwise

• Complexity: O(N logN) (if index-based),


where N: # of points
OPTICS: Finding Hierarchically
Nested Clustering Structures
• OPTICS produces a special cluster-ordering of
the data points with respect to its density-
based clustering structure
• The cluster-ordering contains information
equivalent to the density-based clusterings
corresponding to a broad range of parameter
settings
• Good for both automatic and interactive cluster
analysis—finding intrinsic, even hierarchically
nested clustering structures
OPTICS: Finding Hierarchically
Nested Clustering Structures

Finding nested clustering structures with different parameter settings


Grid-Based Clustering Methods
• Grid-Based Clustering: Explore multi-resolution grid data structure in
clustering
– Partition the data space into a finite number of cells to form a grid
structure
– Find clusters (dense regions) from the cells in the grid structure
• Features and challenges of a typical grid-based algorithm
– Efficiency and scalability: # of cells << # of data points
– Uniformity: Uniform, hard to handle highly irregular data distributions
– Locality: Limited by predefined cell sizes, borders, and the density
threshold
– Curse of dimensionality: Hard to cluster high-dimensional data
• Methods to be introduced
– STING (a STatistical INformation Grid approach) (Wang, Yang and
Muntz, VLDB’97)
– CLIQUE (Agrawal, Gehrke, Gunopulos, and Raghavan, SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information
Grid Approach
• STING (Statistical Information Grid)
(Wang, Yang and Muntz, VLDB’97)
• The spatial area is divided into rectangular
cells at different levels of resolution, and
1st layer

these cells form a tree structure


(i-1) st layer

• A cell at a high level contains a number of


smaller cells of the next lower level
i-th layer
STING: A Statistical Information
Grid Approach
• Statistical information of each cell is
calculated and stored beforehand and is
used to answer queries
• Parameters of higher level cells can be
1st layer

easily calculated from that of lower level


(i-1) st layer

cell, including
– count, mean, s(standard deviation), min, max i-th layer

– type of distribution—normal, uniform, etc.


Query Processing in STING and Its
Analysis
• To process a region query
– Start at the root and proceed to the next lower level, using the
STING index
– Calculate the likelihood that a cell is relevant to the query at
some confidence level using the statistical information of the cell
– Only children of likely relevant cells are recursively explored
– Repeat this process until the bottom layer is reached
• Advantages
– Query-independent, easy to parallelize, incremental update
– Efficiency: Complexity is O(K)
• K: # of grid cells at the lowest level, and K << N (i.e., # of data points)
• Disadvantages
– Its probabilistic nature may imply a loss of accuracy in query
processing
CLIQUE: Grid-Based Subspace
Clustering
• CLIQUE (Clustering In QUEst) (Agrawal, Gehrke, Gunopulos,
Raghavan: SIGMOD’98)
• CLIQUE is a density-based and grid-based subspace clustering
algorithm
– Grid-based: It discretizes the data space through a grid and estimates
the density by counting the number of points in a grid cell
– Density-based: A cluster is a maximal set of connected dense units in a
subspace
• A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– Subspace clustering: A subspace cluster is a set of neighboring dense
cells in an arbitrary subspace. It also discovers some minimal
descriptions of the clusters
• It automatically identifies subspaces of a high dimensional data
space that allow better clustering than original space using the
Apriori principle
CLIQUE: SubSpace Clustering with
Aprori Pruning
• Start at 1-D space and discretize numerical
intervals in each axis into grid
• Find dense regions (clusters) in each
subspace and generate their minimal
descriptions
– Use the dense regions to find promising
candidates in 2-D space based on the Apriori
principle
– Repeat the above in level-wise manner in higher
dimensional subspaces
CLIQUE: SubSpace Clustering with
Aprori Pruning
Major Steps of the CLIQUE
Algorithm
• Identify subspaces that contain clusters
– Partition the data space and find the number of points that
lie inside each cell of the partition
– Identify the subspaces that contain clusters using the
Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of
interests
• Generate minimal descriptions for the clusters
– Determine maximal regions that cover a cluster of
connected dense units for each cluster
– Determine minimal cover for each cluster
Pros and Cons of CLIQUE
• Strengths
– Automatically finds subspaces of the highest dimensionality
as long as high density clusters exist in those subspaces
– Insensitive to the order of records in input and does not
presume some canonical data distribution
– Scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weaknesses
– As in all grid-based clustering approaches, the quality of
the results crucially depends on the appropriate choice of
the number and width of the partitions and grid cells
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
– Assessing clustering tendency
– Determining the number of clusters
– Measuring clustering quality: extrinsic methods
– Intrinsic methods
Evaluation of Clustering: Basic
Concepts
• Evaluation of clustering
– Assess the feasibility of clustering analysis on a data set
– Evaluate the quality of the results generated by a
clustering method
• Major issues on clustering assessment and validation
– Clustering tendency: assessing the suitability of
clustering: whether the data has any inherent grouping
structure
– Determining the Number of Clusters: determining for a
dataset the right number of clusters that may lead to a
good quality clustering
– Clustering quality evaluation: evaluating the quality of the
clustering results
Clustering Tendency: Whether
the Data Contains Inherent
Grouping
• Assess the suitability of clusteringStructure
– Whether the data has any “inherent grouping structure” — non-random
structure that may lead to meaningful clusters
• Determine clustering tendency or clusterability
– A hard task because there are so many different definitions of clusters
• Different definitions: Partitioning, hierarchical, density-based and graph-based
– Even fixing a type, still hard to define an appropriate null model for a
data set
• There are some clusterability assessment methods, such as
– Spatial histogram: Contrast the histogram of the data with that
generated from random samples
– Distance distribution: Compare the pairwise point distance from the data
with those from the randomly generated samples
– Hopkins Statistic: A sparse sampling test for spatial randomness
Testing Clustering Tendency: A
Spatial Histogram Approach
• Spatial Histogram Approach: Contrast the
d-dimensional histogram of the input
dataset D with the histogram generated
from random samples
– Dataset D is clusterable if the distributions of
two histograms are rather different

(a) Input dataset (b) Data generated from random samples


Testing Clustering Tendency: A
Spatial Histogram Approach
• Method outline
– Divide each dimension into equi-width bins,
count how many points lie in each cell, and
obtain the empirical joint probability mass
function (EPMF)
• Do the same for the randomly sampled
data
• Compute how much they differ using the
Kullback-Leibler (KL) divergence value
Determining the Number of
Clusters
• The appropriate number of clusters controls
the proper granularity of cluster analysis
– Finding a good balance between compressibility
and accuracy in cluster analysis
• Two undesirable extremes
– The whole data set is one cluster: No value of
clustering
– Treating each point as a cluster: No data
summarization
Determining the Number of
Clusters
Finding the Number of Clusters: the
Elbow Method
• Use the turning point in the curve of the
sum of within cluster variance with respect
to the # of clusters
– Increasing the # of clusters can help reduce
the sum of within-cluster variance of each
cluster
– But splitting a cohesive cluster gives only a
small reduction
Finding K, the Number of
Clusters: A Cross Validation
Method
• Divide a given data set into m parts, and use m – 1
parts to obtain a clustering model
• Use the remaining part to test the quality of the
clustering
– For example, for each point in the test set, find the
closest centroid, and use the sum of squared distance
between all points in the test set and their closest
centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall
quality measure w.r.t. different k’s, and find # of
clusters that fits the data the best
Measuring Clustering Quality
• Clustering Evaluation: Evaluating how good the clustering
results are
– No commonly recognized best suitable measure in practice
• Extrinsic vs. intrinsic methods: depending on whether ground
truth is used
– Ground truth: the ideal clustering built by using human experts
• Extrinsic: Supervised, employ criteria not inherent to the
dataset
– Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain clustering quality
measure
• Intrinsic: Unsupervised, criteria derived from data itself
– Evaluate the goodness of a clustering by considering how well
the clusters are separated and how compact the clusters are
(e.g., silhouette coefficient)
General Criteria for Measuring
Clustering Quality with Extrinsic
Methods
• Given the ground truth Cg, Q(C, Cg) is the quality measure
for a clustering C
• Q(C, Cg) is good if it satisfies the following four essential
criteria
– Cluster homogeneity: the purer, the better
– Cluster completeness: assign objects belonging to the same Ground truth partitioning
G
G

category in the ground truth to the same cluster Cluster


1
Cluster
2

C C
– Rag bag better than alien: putting a heterogeneous object into a pure 1 2

cluster should be penalized more than putting it into a rag bag (i.e.,
“miscellaneous” or “other” category)
– Small cluster preservation: splitting a small category into pieces is
more harmful than splitting a large category into pieces
Commonly Used Extrinsic Methods
• Matching-based methods
– Examine how well the clustering results match the ground
Ground truthtruth
partitioning G
in partitioning the objects in the data set G
Cluster
1
Cluster
2

C1 C2
• Information theory-based methods
– Compare the distribution of the clustering results and that of the
ground truth
– Information theory (e.g., entropy) used to quantify the
comparison
– Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
– Treat each group in the ground truth as a class, and then check
the pairwise consistency of the objects in the clustering results
– Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods
Ground Truth G
G1
Cluster C2 C
C1 2 3
Matching-Based Methods: Example
• Consider 11 objects Ground Truth
G1
Cluster C2
G
C
C1 2 3

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) =


11/11 = 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

• Other methods:
– maximum matching; F-measure
Information Theory-Based Methods
(I)
Conditional Entropy
• A clustering can be regarded as a Ground Truth
G1
Cluster C2
G
C

compressed representation of a given set


C1 2 3

of objects
• The better the clustering results approach
the ground-truth, the less amount of
information is needed
• This idea leads to the use of conditional
entropy
Information Theory-Based
Methods (I)
Conditional Entropy

Ground Truth G1 G2

Cluster C1 C2 C3
Example
• Consider 11 objects Ground Truth
G1
Cluster C2
G
C
C1 2 3

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) =


11/11 = 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two clusters
Information Theory-Based Methods (II)
Normalized Mutual Information (NMI)
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same
cluster/group, the assignment is regarded as positive; otherwise,
negative
– Depending on assignments, we have four possible cases:

Note: Total # of  n
N  
pairs of points  2

– Jaccard coefficient: Ignoring the true negatives (thus asymmetric)


– Jaccard = TP/(TP + FN + FP) [i.e., denominator ignores TN]
• Jaccard = 1 if perfect clustering
• Many other measures are based on the pairwise comparison
statistics:
– Rand statistic
– Fowlkes-Mallows measure
Intrinsic Methods (I): Dunn Index
Intrinsic Methods (II): Silhouette
Coefficient

You might also like