Clustering
Clustering
Introduction
Cluster analysis is the best-known
descriptive data mining method. Given a
data matrix composed of n observations
(rows) and p variables (columns), the
objective of cluster analysis is to cluster
the observations into groups that are
internally homogeneous (internal
cohesion) and heterogeneous from
group to group (external separation).
Euclidean distance
Consider a data matrix containing only
quantitative (or binary) variables. If x and y are
rows from the data matrix then a function d(x, y)
is said to be a distance between two
observations if it satisfies the following
properties:
Non-negativity. d(x, y) 0, for all x and y.
Identity. d(x, y) = 0 x = y, for all x and y.
Symmetry. d(x, y) = d(y, x), for all x and y.
Triangular inequality. d(x, y) d(x, z) + d(y,
z), for all x, y and z.
Similarity measures
Given a finite set of observations ui U, a function
S(ui, uj) = Sij from U U to r is called an index of
similarity if it satisfies the following properties:
Non-negativity. Sij 0, for all ui, uj U.
Normalisation. Sii = 1, for all ui U.
Symmetry. Sij = Sji, for all ui, uj U.
Unlike distances, the indexes of similarity can be
applied to all kinds of variables, including qualitative
variables. They are defined with reference to the
observation indexes, rather than to the
corresponding row vectors, and they assume values
in the closed interval [0, 1], making them easy to
interpret.
In our example
Its complement to one (a dissimilarity index)
corresponds to the average of the squared Euclidean
distance between the two vectors of binary variables
associated to the observations:
Example
Why is Clustering
Important?
Clusters and their generated descriptions are useful in several
decision-making situations like
classification,
prediction, etc.
Taxonomy of clustering
approaches
Hard
Clustering
Soft
Clustering
Partitional Clustering
Partitional clustering algorithms
generate a hard or soft partition of
the data.
The most popular of this category of
algorithms is the k-means algorithm.
k-Means Algorithm
A simple description of the k-means algorithm is given
below.
Step 1: Select k out of the given n patterns as the initial
cluster centres. Assign each of the remaining nk
patterns to one of the k clusters; a pattern is assigned to
its closest centre/cluster.
Step 2: Compute the cluster centres based on the
current assignment of patterns.
Step 3: Assign each of the n patterns to its closest
centre/cluster.
Step 4: If there is no change in the assignment of
patterns to clusters during two successive iterations,
then stop; else, goto Step 2.
k-Means Clustering
Algorithm
1. Choose a value of k.
2. Select k objects in an arbitrary fashion.
Use these as the initial set of k centroids.
3. Assign each of the objects to the cluster
for which it is nearest to the centroid.
4. Recalculate the centroids of the k
clusters.
5. Repeat steps 3 and 4 until the centroids
no longer move.
Example
Progress of k-means
clustering
Popular Initialization
Methods
In his classical paper [33], MacQueen
proposed a simple initialization method
which chooses K seeds at random. This
is the simplest method and has been
widely used in the literature.
The other popular K-means initialization
methods which have been successfully
used to improve the clustering
performance are given below.
Variations of K-Means
The simple framework of the K-means algorithm
makes it very flexible to modify and build more
efficient algorithms on top of it. Some of the
variations proposed to the K-means algorithm are
based on
(i) Choosing different representative prototypes for
the clusters (K-medoids, K-medians, K-modes),
(ii) choosing better initial centroid estimates
(Intelligent K-means, Genetic K-means), and
(iii) applying some kind of feature transformation
technique (Weighted K-means, Kernel K-means).
K-Medoids Clustering
K-medoids is a clustering algorithm which is more
resilient to outliers compared to K-means [38]. Similar to
K-means, the goal of K-medoids is to find a clustering
solution that minimizes a predefined objective function.
The K-medoids algorithm chooses the actual data points
as the prototypes and is more robust to noise and
outliers in the data. The K-medoids algorithm aims to
minimize the absolute error criterion rather than the SSE.
Similar to the K-means clustering algorithm, the Kmedoids algorithm proceeds iteratively until each
representative object is actually the medoid of the
cluster.
The basic K-medoids clustering algorithm is given below.
Algorithm: K-Medoids
Clustering
1: Select K points as the initial representative
objects.
2: repeat
3: Assign each point to the cluster with the nearest
representative object.
4: Randomly select a non-representative object xi.
5: Compute the total cost S of swapping the
representative object m with xi.
6: If S < 0, then swap m with xi to form the new set
of K representative objects.
7: until Convergence criterion is met.
K-Medians Clustering
The K-medians clustering calculates the median for
each cluster as opposed to calculating the mean of
the cluster (as done in K-means). K-medians
clustering algorithm chooses K cluster centers that
aim to minimize the sum of a distance measure
between each point and the closest cluster center.
The distance measure used in the K-medians
algorithm is the L1-norm as opposed to the square
of the L2-norm used in the K-means algorithm. The
criterion function for the K-medians algorithm is
defined as follows:
K-Modes Clustering
One of the major disadvantages of K-means is its inability
to deal with non-numerical attributes [51, 3]. Using
certain data transformation methods, categorical data
can be transformed into new feature spaces, and then
the K-means algorithm can be applied to this newly
transformed space to obtain the final clusters.
However, this method has proven to be very ineffective
and does not produce good clusters. It is observed that
the SSE function and the usage of the mean are not
appropriate when dealing with categorical data. Hence,
the K-modes clustering algorithm [21] has been proposed
to tackle this challenge.
Algorithm: K-Modes
Clustering
1: Select K initial modes.
2: repeat
3: Form K clusters by assigning all the
data points to the cluster with the
nearest mode using the matching
metric.
4: Recompute the modes of the
clusters.
5: until Convergence criterion is met.
The basic algorithm works similarly to Kmeans where the algorithm minimizes the SSE
iteratively followed by updating w xik and ck.
This process is continued until the
convergence of centroids. As in K-means, the
FCM algorithm is sensitive to outliers and the
final solutions obtained will correspond to the
local minimum of the objective function.
There are further extensions of this algorithm
in the literature such as Rough C-means[34]
and Possibilistic C-means [30].
Hierarchical Clustering
Algorithms
Hierarchical Clustering
Algorithms
Hierarchical clustering algorithms [23] were
developed to overcome some of the
disadvantages associated with flat or
partitional-based clustering methods.
Partitional methods generally require a user
predefined parameter K to obtain a clustering
solution and they are often nondeterministic in
nature.
Hierarchical algorithms were developed to build
a more deterministic and flexible mechanism
for clustering the data objects.
Hierarchical Algorithms
Hierarchical algorithms produce a nested sequence of
data partitions. The sequence can be depicted using
a tree structure that is popularly known as a
dendrogram.
The algorithms are either
divisive or
agglomerative.
Hierarchical Algorithms
Agglomerative algorithms, on the other hand, use
a bottom-up strategy.
They start with n singleton clusters when the
input data set is of size n, where each input
pattern is in a different cluster. At successive
levels, the most similar pair of clusters is merged
to reduce the size of the partition by 1.
An important property of agglomerative
algorithms is that once two patterns are placed in
the same cluster at a level, they remain in the
same cluster at all subsequent levels.
Similarly, in the divisive algorithms, once two
patterns are placed in two different clusters at a
Divisive Hierarchical
Clustering
Divisive algorithms are either
polythetic where the division is based on more than one
feature or
monothetic when only one feature is considered at a time.
Agglomerative Clustering
There are different kinds of agglomerative
clustering methods which primarily differ from
each other in the similarity measures that they
employ.
The widely studied algorithms in this category
are the following:
single link(nearest neighbour),
complete link(diameter),
group average(average link),
centroid similarity,and
Wards criterion(minimum variance).
Agglomerative Clustering
Typically, an agglomerative clustering algorithm goes
through the following steps:
Step 1: Compute the similarity/dissimilarity matrix
between all pairs of patterns. Initialise each cluster
with a distinct pattern.
Step 2: Find the closest pair of clusters and merge them.
Update the proximity matrix to reflect the merge.
Step 3: If all the patterns are in one cluster, stop. Else,
goto Step 2.
Single Link
In single link clustering[36, 46], the similarity of two
clusters is the similarity between their most similar
(nearest neighbor) members. This method
intuitively gives more importance to the regions
where clusters are closest, neglecting the overall
structure of the cluster.
Hence, this method falls under the category of a
local similarity-based clustering method. Because of
its local behavior, single linkage is capable of
effectively clustering non-elliptical, elongated
shaped groups of data objects.
However, one of the main drawbacks of this method
is its sensitivity to noise and outliers in the data.
Complete Link
Complete link clustering [27] measures the
similarity of two clusters as the similarity of
their most dissimilar members. This is
equivalent to choosing the cluster pair whose
merge has the smallest diameter.
As this method takes the cluster structure into
consideration it is non-local in behavior and
generally obtains compact shaped clusters.
However, similar to single link clustering, this
method is also sensitive to outliers.
Example: Agglomerative
Clustering
The single-link algorithm can be explained
with the help of the data shown in the
previous example.
The dendrogram corresponding to the
single-link algorithm is shown below.
Note that there are 8 clusters to start with,
where each cluster has one element.
The distance matrix using city-block
distance or Manhattan distance is given in
Table 9.9.
(D,
F)
Wards Criterion
Wards criterion [49, 50] was proposed to compute the distance between
two clusters during agglomerative clustering. This process of using
Wards criterion for cluster merging in agglomerative clustering is also
called as Wards agglomeration.
It uses the K-means squared error criterion to determine the distance. For
any two clusters, Ca and Cb, the Wards criterion is calculated by
measuring the increase in the value of the SSE criterion for the clustering
obtained by merging them into Ca Cb.
The Wards criterion is defined as follows:
Algorithm: Agglomerative
Hierarchical Clustering
1: Compute the dissimilarity matrix between
all the data points.
2: repeat
3: Merge clusters as Cab = Ca Cb. Set new
clusters cardinality as Nab = Na + Nb.
4: Insert a new row and column containing
the distances between the new cluster
Cab and the remaining clusters.
5: until Only one maximal cluster remains.
Density-Based Methods
Application of Cluster
Analysis
Data Reduction
Hypothesis generation and Testing
Prediction based on groups
Finding K-nearest neighbours
Outlier detection
References:
Pattern Recognition - An Algorithmic
Approach- M. Narasimha Murty V. Susheela
Devi.
Applied Data Mining for Business and Industry
Paolo Giudici, Silvia Figini; 2nd Edition.
Principles of Data Mining Max Bramer.
Data Mining Multimedia, Soft Computing and
Bioinformatics Sushmita Mitra, Tinku Acharya.
Data Clustering Algorithms and Applications
Charu C. Aggarwal, Chandan K. Reddy.
References
[1] D. Arthur and S. Vassilvitskii.K-means++: The advantages of careful
seeding. In Proceedings of the Eighteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 10271035. Society for
Industrial and Applied Mathematics, 2007.
[2] G. H. Ball and D. J. Hall. ISODATA, a novel method of data analysis
and pattern classification. Technical report, DTIC Document, 1965.
[3] P. Berkhin. A survey of clustering data mining techniques. In
Grouping Multidimensional Data, J. Kogan, C. Nicholas, and M.
Teoulle, Eds., Springer, Berlin Heidelberg, pages 2571, 2006.
[4] J. C. Bezdek.Pattern recognition with fuzzy objective function
algorithms. Kluwer Academic Publishers, 1981.
[5] P. S. Bradley and U. M. Fayyad. Refining initial points fork-means
clustering. In Proceedings of the Fifteenth International Conference
on Machine Learning, volume 66. San Francisco, CA, USA, 1998.
[6] T. Cali nski and J. Harabasz. A dendrite method for cluster analysis.
Communications in StatisticsTheory and Methods, 3(1):127, 1974.