DM UNIT-4 Part2
DM UNIT-4 Part2
Cluster Analysis: Basic Concepts and Methods- Cluster Analysis, partitioning methods,
Hierarchical Methods and evaluation of Clustering
What is a Cluster?
The given data is divided into different groups by combining similar objects into a group. This group is
nothing but a cluster.
A cluster is nothing but a collection of similar data which is grouped together.
Clustering is known as unsupervised learning because the class label information is not present. For
this reason, clustering is a form of learning by observation, rather than learning by examples.
A good clustering algorithm aims to obtain clusters whose:
• The intra-cluster similarities are high, It implies that the data present inside the cluster is
similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is not similar to
other data.
Ability to deal with different types of attributes: Many algorithms are designed to cluster numeric (interval-
based) data. However, applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types. Recently, more and more applications need
clustering techniques for complex data types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures .Algorithms based on such distance measures tend to find spherical
clusters with similar size and density. However, a cluster could be of any shape. Consider sensors, for example,
CMREC-CSE Page | 1
which are often deployed for environment surveillance. Cluster analysis on sensor readings can detect
interesting phenomena. We may want to use clustering to find the frontier of a running forest fire, which is
oftennot spherical. It is important to develop algorithms that can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input parameters: Many clustering algorithms
Require users to provide domain knowledge in the form of input parameters such as the desired number of
clusters. Consequently, the clustering results may be sensitive to such parameters. Parameters are often hard
to determine, especially for high-dimensionality data sets and where users have yet to grasp a deep
understanding of their data. Requiring the specification of domain knowledge not only burdens users, but
also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world data sets contain outliers and/or missing, unknown, or
erroneous data. Sensor readings, for example, are often noisy—some readings may be inaccurate due to the
sensing mechanisms, and some readings may be erroneous due to interferences from surrounding transient
objects. Clustering algorithms can be sensitive to such noise and may produce poor-quality clusters. Therefore,
we need clustering methods that are robust to noise.
Incremental clustering and insensitivity to input order: In many applications, incremental updates
(representing newer data) may arrive at any time. Some clustering algorithms cannot incorporate
incremental updates into existing clustering structures and, instead, have to recompute a new clustering from
scratch.Clustering algorithms may also be sensitive to the input data order. That is, given a set of data objects,
clusteringalgorithms may return dramatically different clusterings depending on the order in which the objects
arepresented. Incremental clustering algorithms and algorithms that are insensitive to the input order are
needed. Capability of clustering high-dimensionality data: A data set can contain numerous dimensions or
attributes.When clustering documents, for example, each keyword can be regarded as a dimension, and
there are often thousands of keywords. Most clustering algorithms are good at handling low-dimensional data
such as data sets involving only two or three dimensions. Finding clusters of data objects in a
highdimensional space is challenging, especially considering that such data can be very sparse and highly
skewed.
Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of new automatic teller
machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such
as the city’s rivers and highway networks and the types and number of customers per cluster. A challenging
taskis to find data groups with good clustering behavior that satisfy specified constraints.
Interpretability and usability: Users want clustering results to be interpretable, comprehensible, and
usable. It is important to study how an application goal may influence the selection of clustering features and
clustering methods.
The following are orthogonal aspects with which clustering methods can be compared:
The partitioning criteria: In some methods, all the objects are partitioned so that no hierarchy exists among
the clusters. That is, all the clusters are at the same level conceptually. Such a method is useful, for example, for
partitioning customers into groups so that each group has its own manager.
Separation of clusters: Some methods partition data objects into mutually exclusive clusters. When clustering
customers into groups so that each group is taken care of by one manager, each customer may belong to
only one group.
Similarity measure: Some methods determine the similarity between two objects by the distance between
them. Such a distance can be defined on Euclidean space. Similarity measures play a fundamental role in the
design of clustering methods.
Clustering space: Many clustering methods search for clusters within the entire given data space. These
methods are useful for low-dimensionality data sets. With highdimensional data, however, there can be many
irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the
full space are often meaningless. It’s often better to instead search for clusters within different subspaces of the
same data set. Subspace clustering discovers clusters and subspaces (often of low dimensionality) that
manifest object similarity.
CMREC-CSE Page | 2
Overview of Basic Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
Partitioning Method
It is used to make partitions on the data in order to form clusters.
If “n” partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p.
The two conditions which need to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
• In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical clustering
Hierarchical clustering is a method of data mining that groups similar data points into clusters by
creating a hierarchical structure of the clusters.
Identify the 2 clusters which can be closest together, and Merge the 2 maximum comparable clusters.
We need to continue these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy.
CMREC-CSE Page | 3
A grid-based method first quantizes the object space into a finite number of cells that form a grid structure,
and then performs clustering on the grid structure. STING is a typical example of a grid-based method based on
statistical information stored in grid cells. CLIQUE is a grid-based and subspace clustering algorithm.
Partitioning Methods:
• k-Means: A Centroid-Based Technique
• k-Medoids: A Representative Object-Based Technique
k-Means: A Centroid-Based Technique:
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into
k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤ k).
An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one
another but dissimilar to objects in other clusters. This is, the objective function aims for high intracluster
similarity and low intercluster similarity.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
To measure the distance between data points and centroid,
we can use any method such as
• Euclidean distance or
• Manhattan distance.
where E is the sum of the squared error for all objects in the data set; p is the point in space representing a given
object; and ci is the centroid of cluster Ci (both p and ci are multidimensional).
“How does the k-means algorithm work?”
The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. It
proceeds as follows.
CMREC-CSE Page | 4
First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or center. For
each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the
Euclidean distance between the object and the cluster mean.
The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the
new mean using the objects assigned to the cluster in the previous iteration.
All the objects are then reassigned using the updated means as the new cluster centers. The iterations continue
until the assignment is stable, that is, the clusters formed in the current round are the same as those formed in
the previous round.
The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative
relocation
Example: Clustering by k-means partitioning. Consider a set of objects located in 2-D space, as depicted in
the following Figure Let k = 2, that is, the user would like the objects to be partitioned into two clusters.
Drawbacks of k-means:
• The k-means method is not suitable for discovering clusters with nonconvex shapes or clusters of very
different size.
• Moreover, it is sensitive to noise and outlier data points because a small number of such data can
substantially influence the mean value.
“How can we make the k-means algorithm more scalable?”
CMREC-CSE Page | 5
• One approach to making the k-means method more efficient on large data sets is to use a good-sized set
of samples in clustering.
• Another is to employ a filtering approach that uses a spatial hierarchical data index to save costs when
computing means.
• A third approach explores the microclustering idea, which first groups nearby objects into
“microclusters” and then performs k-means clustering on the microclusters.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
where E is the sum of the absolute error for all objects p in the data set, and oi is the representative object of Ci .
This is the basis for the k-medoids method, which groups n objects into k clusters by minimizing the absolute
error.
The Partitioning Around Medoids (PAM) algorithm is a popular realization of k-medoids clustering. It
tackles the problem in an iterative, greedy way. Like the k-means algorithm, the initial representative objects
(called seeds) are chosen arbitrarily. We consider whether replacing a representative object by a non-
CMREC-CSE Page | 6
representative object would improve the clustering quality. All the possible replacements are tried out.
The iterative process of replacing representative objects by other objects continues until the quality of the
resulting clustering cannot be improved by any replacement.
This quality is measured by a cost function of the average dissimilarity between an object and the
representative object of its cluster.
Each time a reassignment occurs, a difference in absolute error, E, is contributed to the cost function. Therefore,
the cost function calculates the difference in absolute-error value if a current representative object is replaced
by a no representative object. The total cost of swapping is the sum of costs incurred by all no representative
objects. If the total cost is negative, then oj is replaced or swapped with o random because the actual absolute-
error E is reduced. If the total cost is positive, the current representative object, oj , is considered acceptable,
and nothing is changed in the iteration.
Hierarchical Methods:
• Agglomerative versus Divisive Hierarchical Clustering
• Distance Measures in Algorithmic Methods
• BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees
• Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
• Probabilistic Hierarchical Clustering
Hierarchical Methods:
A hierarchical clustering method works by grouping data objects into a hierarchy or “tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data summarization and visualization.
A hierarchical method creates a hierarchical decomposition of the given set of data objects. The method can
be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical
decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical
agglomeration can be improved by analyzing object linkages at each hierarchical partitioning (e.g., in
Chameleon), or by first performing microclustering (that is, grouping objects into “microclusters”) and then
operating on the microclusters with other clustering techniques such as iterative relocation (as in BIRCH).
For example, as the manager of human resources at AllElectronics, you may organize your employees
intomajor groups such as executives, managers, and staff. You can further partition these groups into smaller
subgroups. For instance, the general group of staff can be further divided into subgroups of senior officers,
officers, and trainees. All these groups form a hierarchy. We can easily summarize or characterize the data that
are organized into a hierarchy, which can be used to find, say, the average salary of managers and of officers.
A hierarchical clustering method can be either agglomerative or divisive, depending on whether the
hierarchical decomposition is formed in a bottom-up (merging) or topdown (splitting) fashion. Let’s have a
closer look at these strategies.
CMREC-CSE Page | 8
where each cluster contains at least one object, an agglomerative method requires at most n iterations.
Example :
Agglomerative versus divisive hierarchical clustering. Figure shows the application of AGNES (AGglomerative
NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, on a data set of five objects, {a,b,c,d, e}.
Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
The clusters are then merged step-by-step according to some criterion.
For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum
Euclidean distance between any two objects from different clusters.
This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and the
similarity between two clusters is measured by the similarity of the closest pair of data points belonging to
different clusters.
The cluster-merging process repeats until all the objects are eventually merged to form one cluster.
Figure-2 : Dendrogram representation for hierarchical clustering of data objects {a,b,c,d, e}.
DIANA, the divisive method, proceeds in the contrasting way. All the objects are used to form one initial cluster.
The cluster is split according to some principle such as the maximum Euclidean distance between the closest
neighboring objects in the cluster. The cluster-splitting process repeats until, eventually, each new cluster
contains only a single object.
Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It
shows how objects are grouped together (in an agglomerative method) or partitioned (in a divisive method)
step-by-step. Figure-2 shows a dendrogram for the five objects presented in Figure-1, where l = 0 shows the five
CMREC-CSE Page | 9
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first cluster, and
they stay together at all subsequent levels. We can also use a vertical axis to show the similarity scale between
clusters. For example, when the similarity of two groups of objects, {a,b} and {c,d, e}, is roughly 0.16, they are
merged together to form a single cluster.
A challenge with divisive methods is how to partition a large cluster into several smaller ones. For example,
there are 2n−1 − 1 possible ways to partition a set of n objects into two exclusive subsets, where n is the number
of objects.
When an algorithm uses the minimum distance, dmin(Ci ,Cj), to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm.
CMREC-CSE Page | 10
Example :
Single versus complete linkages. Let us apply hierarchical clustering to the data set of Figure 10.8(a). Figure
10.8(b) shows the dendrogram using single linkage. Figure 10.8(c) shows the case using complete linkage,
where the edges between clusters {A,B,J,H} and {C,D,G,F,E} are omitted for ease of presentation. This example
shows that by using single linkages we can find hierarchical clusters defined by local proximity, whereas
complete linkage tends to find clusters opting for global closeness.
A clustering feature is essentially a summary of the statistics for the given cluster. Using a clustering feature,
we can easily derive many useful statistics of a cluster. For example, the cluster’s centroid, x0, radius, R, and
diameter, D, are
Here, R is the average distance from member objects to the centroid, and D is the average pairwise distance
within a cluster. Both R and D reflect the tightness of the cluster around the centroid.
Summarizing a cluster using the clustering feature can avoid storing the detailed information about individual
objects or points. Instead, we only need a constant size of space to store the clustering feature. This is the key to
BIRCH efficiency in space. Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and
C2, with the clustering features CF1 = <n1,LS1,SS1> and CF2 = <n2,LS2,SS2>, respectively, the clustering feature
for the cluster that formed by merging C1 and C2 is simply
Example: Clustering feature. Suppose there are three points, (2,5),(3,2), and (4,3), in a cluster, C1.
CMREC-CSE Page | 11
Suppose that C1 is disjoint to a second cluster, C2, where CF2 = h3,(35,36),(417,440)i. The clustering feature of a
new cluster, C3, that is formed by merging C1 and C2, is derived by adding CF1 and CF2. That is,
A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.
An example is shown in Figure. By definition, a nonleaf node in a tree has descendants or “children.” The nonleaf
nodes store sums of the CFs of their children, and thus summarize clustering information about their children.
Parameters of BIRCH Algorithm :
• threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree
can hold.
• branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node
(internal node).
• n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is complete i.e.,
number of clusters after the final clustering step. If set to None, the final clustering step is not performed
and intermediate clusters are returned.
Specifically, Chameleon determines the similarity between each pair of clusters Ci and Cj according to their
relative interconnectivity, RI(Ci ,Cj), and their relative closeness, RC(Ci ,Cj).
The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is defined as the absolute
interconnectivity between Ci and Cj , normalized with respect to the internal interconnectivity of the two
clusters, Ci and Cj . That is,
where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci and Cj . Similarly, ECCi (or
ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly equal parts.
The relative closeness, RC(Ci ,Cj), between a pair of clusters, Ci and Cj , is the absolute closeness between Ci
and Cj , normalized with respect to the internal closeness of the two clusters, Ci and Cj . It is defined as
where SEC{Ci ,Cj } is the average weight of the edges that connect vertices in Ci to vertices in Cj , and SECCi (or
SECCj ) is the average weight of the edges that belong to the mincut bisector of cluster Ci (or Cj).
Fig: Merging clusters in probabilistic hierarchical clustering: (a) Merging clusters C1 and C2 leads to an
increase in overall cluster quality, but merging clusters (b) C3 and (c) C4 does not.
A drawback of using probabilistic hierarchical clustering is that it outputs only one hierarchy with respect to a
chosen probabilistic model. It cannot handle the uncertainty of cluster hierarchies.
CMREC-CSE Page | 14
Evaluation of Clustering:
Clustering evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results
generated by a clustering method. The tasks include assessing clustering tendency, determining the number of
clusters, and measuring clustering quality.
Determining the number of clusters in a data set: Determining the optimal number of clusters in a data set
is a fundamental issue in partitioning clustering, such as k-means clustering, which requires the user to specify
the number of clusters k to be generated.
Measuring clustering quality: After applying a clustering method on a data set, we want to assess how good
the resulting clusters are. A number of measures can be used. Some methods measure how well the clusters fit
the data set, while others measure how well the clusters match the ground truth, if such truth is available. There
are also measures that score clusterings and thus can compare two sets of clustering results on the same data
set.
2. Sample n points, q1, . . . , qn, uniformly from D. For each qi (1 ≤ i ≤ n), we find the nearest neighbor of qi
in D − {qi}, and let yi be the distance between qi and its nearest neighbor in D− {qi}. That is,
CMREC-CSE Page | 15
.
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:
Cluster Homogeneity: This requires that the more pure the clusters in a clustering are, the better the clustering.
Suppose that ground truth says that the objects in a data set, D, can belong to categories L1,...,Ln. Consider
clustering, C1, wherein a cluster C ∈ C1 contains objects from two categories Li ,Lj (1 ≤ i < j ≤ n). Also consider
clustering C2, which is identical to C1 except that C2 is split into two clusters containing the objects in Li and Lj
, respectively. A clustering quality measure, Q, respecting cluster homogeneity should give a higher score to C2
than C1, that is, Q(C2,Cg ) > Q(C1,Cg ).
Cluster completeness: Cluster completeness is the essential parameter for good clustering, if any two data
objects are having similar characteristics then they are assigned to the same category of the cluster
CMREC-CSE Page | 16
according to ground truth. Cluster completeness is high if the objects are of the same category .
Rag bag: In some situations, there can be a few categories in which the objects of those categories cannot be
merged with other objects. Then the quality of those cluster categories is measured by the Rag Bag method.
According to the rag bag method, we should put the heterogeneous object into a rag bag category.
Small cluster preservation: If a small category of clustering is further split into small pieces, then those
small pieces of cluster become noise to the entire clustering and thus it becomes difficult to identify that
small category from the clustering. The small cluster preservation criterion states that are splitting a small
category into pieces is not advisable and it further decreases the quality of clusters as the pieces of clusters
are distinctive.
Many clustering quality measures satisfy some of these four criteria. Here, we introduce the BCubed precision
and recall metrics, which satisfy all four criteria.
BCubed evaluates the precision and recall for every object in a clustering on a given data set according to
ground truth.
The precision of an object indicates how many other objects in the same cluster belong to the same category
as the object.
The recall of an object reflects how many objects of the same category are assigned to the same cluster.
Then, for two objects, oi and oj , (1 ≤ i,j,≤ n,i 6= j), the correctness of the relation between oi and oj in clustering
C is given by
CMREC-CSE Page | 17
Intrinsic Methods:
When the ground truth of a data set is not available, we have to use an intrinsic method to assess the clustering
quality
Silhouette Coefficient:
Silhouette refers to a method of interpretation and validation of consistency within clusters of data.
For a data set, D, of n objects, suppose D is partitioned into k clusters, C1,...,Ck . For each object
o D, we calculate a(o) as the average distance between o and all other objects in the cluster to which o
belongs. Similarly, b(o) is the minimum average distance from o to all clusters to which o does not belong.
Formally, suppose o ∈ Ci (1 ≤ i ≤ k); then
The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own
cluster and poorly matched to neighboring clusters.
CMREC-CSE Page | 18