Data Mining Unit 5
Data Mining Unit 5
Clustering is the process of grouping a set of data objects into multiple groups or clusters so that
objects within a cluster have high similarity, but are very dissimilar to objects in other clusters.
What Is Cluster Analysis?
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are
similar to one another, yet di similar to objects in other clusters. The set of clusters
i:esulting from a cluster analysis can be referred to as a clustering.
Because a cluster is a collection of data objects that are similar to one another within the
cluster and dissimilar to objects in other clusters, a cluster of data objects can be treated
as an implicit class. In this sense, clustering is sometimes called automatic classification.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are "far away" from any cluster)
may be more interesting than common cases.
Applications of outlier detection include the detection of credit card fraud and the
monitoring of criminal activities in electronic commerce.
For example, exceptional cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.
Clustering is known as unsupervised learning because the class label information is not
present.
For this reason, clustering is a form of learning by observation, rather than learning by
examples.
Requirements for Cluster Analysis
Scalability - Clustering all the data instead of only on samples
Ability to deal with different types of attributes: Numerical, binary, categorical, ordinal,
linked, and mixture of these
Discovery of clusters with arbitrary shape:
Requirements for domain knowledge to determine input parameters:
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
Hierarchical Methods
A hierarchical clustering method work by grouping data objects into a hierarchy or "tree"
of clusters
Rep re eating data objects in the form of a hierarchy is usefuJ for data summarization and
visualization.
Hierarchical clustering methods can encounter difficulties regarding the selection of
merge or split points. Such a decision is critical, because once a group of objects is
merged or split, the process at the next step will operate on the newly generated clusters.
It will neither undo what was done previously, nor perform object swapping between
clusters. Thus, merge or split decisions, if not well chosen, may lead to low-quality
clusters.
Moreover, the methods do not scale well because each decision of merge or split needs
to examine and evaluate many objects or clusters.
A promising direction for improving the clustering quality of hierarchical methods is to
integrate hierarchical clustering with other clustering techniques, resulting in multiple-
phase (or multiphase) clustering.
There are two types of hierarchical clustering
1. Agglomerative hierarchical
clustering.
2. Divisive hierarchical clustering
Agglomerative hierarchical clustering
Group data objects in a bottom-up fashion.
Initially each data object is in its own cluster.
Then we merge these atomic clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied.
A user can specify the desired number of clusters as a termination condition.
Divisive hierarchical clustering
Groups data objects in a top-down fashion.
Initially all data objects are in one cluster.
We then subdivide the cluster into smaller and smaller clusters, until each object forms
cluster on its own or satisfies certain termination conditions, such as a desired number of
clusters is obtained
AGNES & DIANA
Application of AGNES( AGglomerative NESting) and DIANA( Divisive ANAlysis) to a
data set of five objects, {a, b, c, d, e}.
AGNES-Explored
1. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the
basic process of Johnson's (1967) hierarchical clustering is this:
2. Start by assigning each item to its own cluster, so that if you have N items, you now have
N clusters, each containing just one item. Let the distances (similarities) between the
clusters equal the distances (similarities) between the items they contain.
3. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one less cluster.
4. Compute distances (similarities) between the new cluster and each of the old clusters.
5. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
6. Step 3 can be done in different ways, which is what distinguishes single-link from
complete-link and average-link clustering
single-link clustering
distance = shortest distance from any member of one cluster to any member of the other
cluster
complete-link clustering
distance = longest distance from any member of one cluster to any member of the other
cluster
average-link clustering
distance = average distance from any member of one cluster to any member of the other
cluster
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2), where n is the number of
total objects
CF-Tree
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
(i)Branching factor: maximum number of children
(ii) Threshold: max diameter of sub-clusters stored at the leaf nodes
These two parameters implicitly control the resulting tree's size.
CF tree can be viewed as a multilevel compression of the data that tries to preserve the
inherent clustering structure of the data
A non-leaf node entry is a CF triplet and a child node link
Each non-leaf node contains a number of child nodes. The number of children that a
non-leaf node can contain is limited by a threshold called the branching factor.
A leaf node is a collection of CF triplets and links to the next and previous leaf nodes
Each leaf node contains a number of subclusters that contains a group of data instances
The diameter of a subcluster under a leaf node can not exceed a threshold.
For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus, the method is
incremental.
An object is inserted into the closest leaf entry (subcluster).
If the diameter of the subcluster stored in the leaf node after insertion is larger than the
threshold value, then the leaf node and possibly other nodes are split.
After the insertion of the new object, information about the object is passed toward the
root of the tree.
The size of the CF-tree can be changed by modifying the threshold.
If the size of the memory that is needed for storing the CF-tree is larger than the size of
the main memory, then a larger threshold value can be specified and the CF-tree is
rebuilt.
The rebuild process is performed by building a new tree from the leaf nodes of the old
tree. Thus, the process of rebuilding the tree is done without the necessity of rereading all
the objects or points.
This is similar to the insertion and node split in the construction of B+-trees. Therefore,
for building the tree, data has to be read just once.
Some heuristics and methods have been introduced to deal with outliers and improve the
quality of CF-trees by additional scans of the data.
Once the CF-tree is built, any clustering algorithm, such as a typical partitioning
algorithm, can be used with the CF-tree in Phase 2.
BIRCH overview
Effectiveness of BIRCH
Given a limited amount of main memory, BIRCH can minimize the time required
for I/O.
BIRCH is a scalable clustering algorithm with respect to the number of objects,
and good quality of clustering of the data.
Data Set
Merge Partition
Final Clusters
To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity and the closeness of the clusters.
Specifically, Chameleon determines the similarity between each pair of dusters Ci
and Cj according to their relative interconnectivityRI( C;, Cj), and their relative
closeness, RC(C;,Cj).
Relative inter-connectivity:
where EC[c;.Cj) is the edge cut as previously defined for a cluster containing both
C;and q. Similarly, ECc; (or ECci ) is the minimum sum of the cut edges that
partition Ci (or Cj) into two roughly equal parts.
Relative closeness
where SEcc c;. Ci l is the average weight of the edges that connect vertices in Ci
to vertices in q, and SEcc. (or SEcc. ) is the average weight of the edges that
belong to the min' ) cut bisector of cluster C; (or Cj).
Probabilistic hierarchical clustering
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered as a sample of the
underlying data generation mechanism to be analyzed
Easy to understand, same efficiency as algorithmic agglomerative clustering
method, can handle partially observed data.
In practice, assume the generative models adopt common distributions functions,
e.g., Gaussian distribution or Bernoulli distribution, governed by parameters
Generative Model
Given a set of 1-D points X = {x1, …, xn} for clustering analysis & assuming they are
generated by a Gaussian distribution:
The task of learning the generative model: find the parameters μ and σ 2 such that
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can be
measured by,
Density-Based Methods
•To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in
the data space, separated by sparse regions. This is the main strategy behind density-based
clustering methods, which can discover clusters of nonspherical shape.
Several interesting methods:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
Density-reachable:
A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain
of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from
pi
Density-connected
A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o
such that both, p and q are density-reachable from o w.r.t. Eps and MinPts
The hill-climbing procedure stops at step k > 0 iff(x1<+ 1) <f(xk), and assigns x to
the density attractor x"' = x1<. An object x is an outlier or noise if it converges in the
hill climbing procedure to a local maximum x• with J (x*) < ~.
A cluster in DENCLUE is a set of density attractor X and a set of input objects C
such that each object in C is assigned to a density attractor in X, and there exists a
path between every pair of density attractors where the density is above ~. By using
multiple density attractors connected by paths, DEN CLUE can find clusters of
arbitrary shape
Grid-Based Methods
The grid-based clustering approach uses a multi resolution grid data structure.
It quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically
independent of the number of data objects, yet dependent on only the number of cells
in each dimension in the quantized space.
Finding regions
After we have got all the relevant cells at the final level we need to output regions
that satisfies the query
We can do it using Breadth First Search
Breadth First Search
we examine cells within a certain distance from the center of current cell
If the average density within this small area is greater than the density specified mark
this area
Put the relevant cells just examined in the queue.
Take element from queue repeat the same procedure except that only those relevant
cells that are not examined before are enqueued. When queue is empty we have
identified one region
where the embedding data space contains three dimensions: age, salary, and
vacation.
A 2-D cell, say in the subspace formed by age and salary, contains l points
only if theprojection of this cell in every dimension, that is, age and salary,
respectively, contains at least l points.
CLIQUE performs clustering in two steps.
In the first step, CLIQUE partitions the d-dimensional data space into non
overlapping rectangular units, identifying the dense units among these.
CLIQUE finds dense cells in all of the subspaces.
The subspaces representing these dense units are intersected to form a
candidate search space in which dense units of higher dimensionality may
exist.
This approach of selecting candidates is quite similar to Apiori Gen process of
generating candidates.
Here it is expected that if some thing is dense in higher dimensional space it
cant be sparse in lower dimension state.
If a k-dimensional unit is dense, then so are its projections in (k-1)-
dimensional space.
Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection
unit is not dense then kth dimensional unit cannot be dense
So,we can generate candidate dense units in k-dimensional space from the
dense units found in (k-1)-dimensional space
The resulting space searched is much smaller than the original space.
The dense units are then examined in order to determine the clusters.
Figure :Dense units found with respect to age for the dimensions salary and vacation are intersected in order
to provide a candidate search space for dense units of higher dimensionality
When analyzing in the gene dimension, we treat each gene as an object
and treat the samples/conditions as attributes. By mining in the gene
dimension, we may find patterns shared by multiple genes, or cluster
genes into groups. For example, we may find a group of genes that
express themselves similarly, which is highly interesting in
bioinformatics, such as in finding pathways.
When analyzing in the sample/condition dimension, we treat each
sample/condition as an object and treat the genes as attributes. In this
way, we may find patterns of samples/conditions, or cluster
samples/conditions into groups. For example, we may find the
differences in gene expression by comparing a group of tumor samples
and nontumor samples.
Ex. 2. Clustering customers and products
Another bi-clustering problem
Types of Biclusters
Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
A bi-cluster: A submatrix where genes and conditions follow some consistent patterns
4 types of bi-clusters (ideal cases)
Bi-clusters with constant values:
for any i in I and j in J, eij = c
Bi-clusters with constant values on rows:
eij = c + αi
Also, it can be constant values on columns
Bi-clusters with coherent values (aka. pattern-based clusters)
eij = c + αi + βj
Bi-Clustering Methods
In real
data sets, such perfect biclusters rarely exist.
Methods
(i) Optimization-based methods (ii) enumeration methods
Optimization-based methods
Try to find a submatrix at a time that achieves the best significance as a bi-cluster
Due to the cost in computation, greedy search is employed to find local optimal
bi-clusters
Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)
Enumeration methods
Use a tolerance threshold to specify the degree of noise allowed in the bi-clusters
to be mined
Then try to enumerate all submatrices as bi-clusters that satisfy the requirements
Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al.,
ICDM’2003)
Optimization Using the δ Cluster Algorithm
Ex. To cluster the points in the right figure, any subspace of the original one, X and Y,
cannot help, since all the three clusters will be projected into the overlapping areas in X
and Y axes.
Construct a new dimension as the dashed one, the three clusters become apparent
when the points projected into the new dimension
Dimensionality reduction methods
Feature selection and extraction: But may not focus on clustering structure finding
Spectral clustering: Combining feature extraction and clustering (i.e., use the
spectrum of the similarity matrix of the data to perform dimensionality reduction
for clustering in fewer dimensions)
Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000) The Ng-Jordan-Weiss algorithm
Algorithm: SCAN for clusters on graph ·data.
Input: a graph G = ( V, E), a similarity threshold e, and a population threshold μ,
Output: a set of clusters
Method: set all vertices in V unlabeled
for all unlabeled vertex t1 do
if t1 is a core then
generate a new cluster-id c
insert all v E N,(u) into a queue Q
while Q# do
w *-" the first vertex in Q
R *-" the set of vertices that can be directly reached from w
for all s E R do
ifs is not unlabeled or labeled as nonmember then
assign the current cluster-id c to s
endif
ifs is unlabeled then
insert s into queue Q
endif
end for
remove w from Q
end while
else
label u as nonmember
endif
endfor
for all vertex u labeled nonmember do
i£3x,y E f(u): X and y have different cluster-ids then
label u as hub
else
label u as outlier
endif
endfor
o There exist nontrivial gaps between data mining principles and domain-specific
applications
o Retail industry
o Telecommunication industry
Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality
Design and construction of data warehouses for multidimensional data analysis and data
mining
o View the debt and revenue changes by month, by region, by sector, and by other
factors
o Access statistical information such as max, min, total, average, trend, etc.
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Ex. 1. Design and construction of data warehouses based on the benefits of data mining
A rapidly expanding and highly competitive industry and a great demand for data mining
Tremendous number of ways that the nucleotides can be ordered and sequenced to form
distinct genes
o Data cleaning and data integration methods developed in data mining will help
o Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)
o Association analysis may help determine the kinds of genes that are likely to
co-occur together in target samples