0% found this document useful (1 vote)
341 views

Clustering

This document discusses clustering techniques in data mining. It begins by defining clustering as an unsupervised method of grouping unlabeled observations into internally homogeneous and externally heterogeneous clusters based on their measured characteristics. Different types of distance and similarity measures are introduced to quantify the similarity between observations, including Euclidean distance, Manhattan distance, and similarity indexes. Hard and soft clustering methods are described as generating non-overlapping and overlapping clusters, respectively. Key concepts like centroids, medoids, and outliers are also summarized.

Uploaded by

kamaruddin
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
341 views

Clustering

This document discusses clustering techniques in data mining. It begins by defining clustering as an unsupervised method of grouping unlabeled observations into internally homogeneous and externally heterogeneous clusters based on their measured characteristics. Different types of distance and similarity measures are introduced to quantify the similarity between observations, including Euclidean distance, Manhattan distance, and similarity indexes. Hard and soft clustering methods are described as generating non-overlapping and overlapping clusters, respectively. Key concepts like centroids, medoids, and outliers are also summarized.

Uploaded by

kamaruddin
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 127

Clustering

Introduction
Cluster analysis is the best-known
descriptive data mining method. Given a
data matrix composed of n observations
(rows) and p variables (columns), the
objective of cluster analysis is to cluster
the observations into groups that are
internally homogeneous (internal
cohesion) and heterogeneous from
group to group (external separation).

In order to compare observations, we


need to introduce the idea of a distance
measure, or proximity, among them.
Distances are normally used to measure
the similarity or dissimilarity between
two data objects Xi and Xj. The larger the
similarity, the smaller the dissimilarity
and hence the smaller the distance
between the pair of objects.

When the variables of interest are quantitative,


the indexes of proximity typically used are called
distances.
If the variables are qualitative, the distance
between observations can be measured by
indexes of similarity.
If the data are contained in a contingency table,
the chi-squared distance can also be employed.
There are also indexes of proximity that are used
on a mixture of qualitative and quantitative
variables.

In the numeric domain, a popular


measure is the Minkowski distance. It
is defined as:
where q is a positive integer and n is
the number of attributes involved. If
q = 1, then d is termed Manhattan
distance, while for q = 2 it is called
Euclidean distance.

Euclidean distance
Consider a data matrix containing only
quantitative (or binary) variables. If x and y are
rows from the data matrix then a function d(x, y)
is said to be a distance between two
observations if it satisfies the following
properties:
Non-negativity. d(x, y) 0, for all x and y.
Identity. d(x, y) = 0 x = y, for all x and y.
Symmetry. d(x, y) = d(y, x), for all x and y.
Triangular inequality. d(x, y) d(x, z) + d(y,
z), for all x, y and z.

All such distances can be represented in a


matrix of distances. A distance matrix can be
represented in the following way:

where the generic element dij is a measure of


distance between the row vectors x i and xj.
The Euclidean distance is the most commonly
used distance measure. It is defined, for any
two units indexed by i and j, as the square root
of the difference between the corresponding
vectors, in the p-dimensional Euclidean space:

The Euclidean distance can be strongly


influenced by a single large difference in one
dimension of the values, because the square will
greatly magnify that difference.
Dimensions having different scales (e.g. some
values measured in centimetres, others in
metres) are often the source of these overstated
differences.
To overcome such limitation, the Euclidean
distance is often calculated, not on the original
variables, but on useful transformations of them.
The most common choice is to standardise the
variables.

After standardisation, every transformed


variable contributes to the calculation of the
distance with equal weight. When the variables
are standardised, they have zero mean and unit
variance; furthermore, it can be shown that, for
i, j = 1, ..., p:

Where rij is the correlation coefficient between


the observations xi and xj. Thus the Euclidean
distance between two observations is a function
of the correlation coefficient between them.

Similarity measures
Given a finite set of observations ui U, a function
S(ui, uj) = Sij from U U to r is called an index of
similarity if it satisfies the following properties:
Non-negativity. Sij 0, for all ui, uj U.
Normalisation. Sii = 1, for all ui U.
Symmetry. Sij = Sji, for all ui, uj U.
Unlike distances, the indexes of similarity can be
applied to all kinds of variables, including qualitative
variables. They are defined with reference to the
observation indexes, rather than to the
corresponding row vectors, and they assume values
in the closed interval [0, 1], making them easy to
interpret.

Consider data on n visitors to a website, which has


P pages. Correspondingly, there are P binary
variables, which assume the value 1 if the specific
page has been visited, or else the value 0.
To demonstrate the application of similarity
indexes, we now analyse only data concerning the
behaviour of the first two visitors (2 of the n
observations) to the website , among the P = 28
web pages that they can visit.
Table 4.1 summarises the behaviour of the two
visitors, treating each page as a binary variable.

CP, for co-presence, or positive matches


CA, for co-absences or negative matches
PA for presenceabsence and AP for absencepresence,
where the first letter refers to visitor A and the second to
visitor B.
Note that, of the 28 pages considered, two have been
visited by both visitors, there is a frequency of 21 equal to
the number of pages that are visited neither by A nor by B.
Finally, the frequencies of 4 and 1 indicate the number of
pages that only one of the two navigators visits.

RusselRao similarity index


The RusselRao similarity index is a
function of the co-presences and is
equal to the ratio between the
number of the co-presences and the
total number of binary variables
considered, P:
From Table 4.1 we have

Jaccard similarity index


This index is the ratio between the
number of co-presences and the total
number of variables, excluding those
that manifest co-absences:
Note that this index cannot be defined if
two visitors or, more generally, the two
observations, manifest only coabsences (CA = P).
In the example above we have

SokalMichener similarity index


This is the ratio between the number of co-presences
or co-absences and the total number of the variables:

In our example
Its complement to one (a dissimilarity index)
corresponds to the average of the squared Euclidean
distance between the two vectors of binary variables
associated to the observations:

It is one of the commonly used indexes of similarity.

Patterns to be clustered are either labelled or unlabelled. Based on this,


we have:
Clustering algorithms which typically group sets of unlabelled patterns
These types
of paradigms are so popular that clustering is viewed as an
unsupervised
classification of unlabelled patterns.
Algorithms which cluster labelled patterns These types of paradigms
are practically
important and are called supervised clustering or labelled clustering.
Supervised

The process of clustering is carried out


so that patterns in the same cluster are
similar in some sense and patterns in
different clusters are dissimilar in a
corresponding sense.
The Euclidean distance between any
two points belonging to the same
cluster is smaller than that between any
two points belonging to different
clusters.

Let us consider a two-dimensional data set of 10


vectors given by
X1=(1,1) X2=(2,1) X3=(1,2) X4=(2,2) X5=(6,1)
X6=(7,1) X7=(6,2) X8=(6,7) X9=(7,7) X10=(7,6)

Let us say that any two points belong to the


same cluster if the distance between them is
less than a threshold. Specifically, in this
example, we use the squared Euclidean
distance to characterise the distance between
the points and use a threshold of 5 units to
cluster them. The squared Euclidean distance
between two points xi and xj is defined as
where d is the dimensionality of the points.

It is possible to represent the squared Euclidean


distance between pairs of points using the matrix.

Note the following points:


1. In general, the clustering structure may not be so
obvious from the distance matrix. For example, by
considering the patterns in the order X 1, X5, X2, X8, X3,
X6, X9, X4, X7, X10 , the sub-matrices corresponding to the
three clusters are not easy to visualise.
2. In this example, we have used the squared Euclidean
distance to denote the distance between two points. It
is possible to use other distance functions also.
3. Here, we defined each cluster to have patterns such
that the distance between any two patterns in a cluster
(intra-cluster distance) is less than 5 units and the
distance between two points belonging to two different
clusters (inter-cluster distance) is greater than 5 units.

Partition generated by clustering is either


hard or soft. A hard clustering algorithm
generates clusters that are non-overlapping.
Consider a set of patterns
where the ith cluster
,
is such that
, and no Xi = . If
in addition, Xi Xj = , for all i and j, i j,
then we have a hard partition.
It is possible to generate a soft partition
where a pattern belongs to more than one
cluster. In such a case, we will have
overlapping clusters.

In general, for a set of n patterns, it


is possible to show that the number
of 2-partitions is 2n11.
the number of partitions of n
patterns into m blocks (clusters) and
is given by

Knowledge is implicit in clustering. So, partitioning is


done based on the domain knowledge or user-provided
information in different forms; this could include choice
of similarity measure, number of clusters, and
threshold values. Further, because of the well-known
theorem of the ugly duckling, clustering is not possible
without using extra-logical information.
The theorem of the ugly duckling states that the
number of predicates shared by any two patterns is
constant when all possible predicates are used to
represent patterns. So, if we judge similarity based on
the number of predicates shared, then any two patterns
are equally similar.

A cluster of points is represented by its


centroid or its medoid.
The centroid stands for the sample mean of
the points in cluster C; it is given by
where NC is the
number of patterns in cluster C.
The medoid is the most centrally located
point in the cluster; more formally, the
medoid is that point in the cluster from
which the sum of the distances from the
points in the cluster is the minimum.

The point which is far off from any of the


points in the cluster is an outlier.
The centroid of the data can shift without
any bound based on the location of the
outlier; however, the medoid does not shift
beyond the boundary of the original cluster
irrespective of the location of the outlier.
So, clustering algorithms that use
medoids are more robust in the
presence of noisy patterns or outliers.

Example

It is possible to have more than one representative per


cluster; for example, four extreme points labelled e1, e2, e3, e4
can represent the cluster as shown in Figure.
It is also possible to use a logical description to characterise
the cluster. For example, using a conjunction of disjunctions,
we have the following description of the cluster:
(x=a1, ..., a2)(y=b1, ..., b3)

Why is Clustering
Important?
Clusters and their generated descriptions are useful in several
decision-making situations like
classification,
prediction, etc.

The number of cluster representatives obtained is smaller than the


number of input patterns and so there is data reduction.
Clustering can also be used on the set of features in order to
achieve dimensionality reduction which is useful in improving the
classification accuracy at a reduced computational resource
requirement.
One of the important applications of clustering is in data reorganisation.
Another important application of clustering is in identifying outliers.
It is also possible to use clustering for guessing missing values in a
pattern matrix.

For example, let the value of the jth


feature of the ith pattern be missing. We
can still cluster the patterns based on
features other than the jth one.
Based on this grouping, we find the
cluster to which the ith pattern belongs.
The missing value can then be estimated
based on the jth feature value of all the
other patterns belonging to the cluster.

The important steps in the clustering process are


1. Pattern representation
2. Definition of an appropriate similarity/dissimilarity
function
3. Selecting a clustering algorithm and using it to
generate a partition/description of the clusters
4. Using these abstractions in decision making

At the top, there is a distinction between hard


and soft clustering paradigms based on whether
the partitions generated are overlapping or not.
Hard clustering algorithms are either
hierarchical, where a nested sequence of
partitions is generated, or partitional where a
partition of the given data set is generated.
Soft clustering algorithms are based on fuzzy
sets, rough sets, artificial neural nets (ANNs),or
evolutionary algorithms, specifically genetic
algorithms (GAs).

Taxonomy of clustering
approaches

Hard
Clustering

Soft
Clustering

Partitional Clustering
Partitional clustering algorithms
generate a hard or soft partition of
the data.
The most popular of this category of
algorithms is the k-means algorithm.

k-Means Algorithm
A simple description of the k-means algorithm is given
below.
Step 1: Select k out of the given n patterns as the initial
cluster centres. Assign each of the remaining nk
patterns to one of the k clusters; a pattern is assigned to
its closest centre/cluster.
Step 2: Compute the cluster centres based on the
current assignment of patterns.
Step 3: Assign each of the n patterns to its closest
centre/cluster.
Step 4: If there is no change in the assignment of
patterns to clusters during two successive iterations,
then stop; else, goto Step 2.

k-Means Clustering
Algorithm
1. Choose a value of k.
2. Select k objects in an arbitrary fashion.
Use these as the initial set of k centroids.
3. Assign each of the objects to the cluster
for which it is nearest to the centroid.
4. Recalculate the centroids of the k
clusters.
5. Repeat steps 3 and 4 until the centroids
no longer move.

There are a variety of schemes for selecting


the initial cluster centres; these include
selecting the first k of the given n patterns,
selecting k random patterns out of the given
n patterns, and
viewing the initial cluster seed selection as
an optimisation problem, and using a robust
tool to search for the globally optimal
solution.

An important property of the k-means


algorithm is that it implicitly minimises the sum
of squared deviations of patterns in a cluster
from the centre.
More formally, if Ci is the ith cluster and i is its
centre, then the criterion function minimised by
the algorithm is
In clustering literature, it is called the sum-ofsquared-error criterion, within-group-error sumof-squares, or simply squared-error criterion.

Example

Progress of k-means
clustering

It can be proved that the k-means algorithm will always terminate,


but it does not necessarily find the best set of clusters,
corresponding to minimising the value of the objective function.
The initial selection of centroids can significantly affect the result. To
overcome this, the algorithm can be run several times for a given
value of k, each time with a different choice of the initial k centroids,
the set of clusters with the smallest value of the objective function
then being taken.
The k-means algorithm does not guarantee the globally optimal
partition.
The time complexity of the algorithm is O(nkdl), where n is the
number of patterns, k is the number of clusters, d is the
dimensionality and l is the number of iterations. The space
requirement is O(kd).These features make the algorithm very
attractive.
k-means can be applied in applications involving large volumes of
data, for example, satellite image data. It is best to use the k-means
algorithm when the clusters are hyper-spherical. It does not
generate the intended partition if the partition has non-spherical
clusters.

The objective function which is employed by K-means


is called the Sum of Squared Errors (SSE) or Residual
Sum of Squares (RSS).
Given a dataset D = {x1,x2, ... , xN} consists of N
points, let us denote the clustering obtained after
applying K-means clustering by C = {C1, C2, ...,
Ck, ... , CK}. The SSE for this clustering is defined in
the following Equation where ck is the centroid of
cluster Ck. The objective is to find a clustering that
minimizes the SSE score. The iterative assignment
and update steps of the K-means algorithm aim to
minimize the SSE score for the given set of centroids.

Minimization of Sum of Squared


Errors
K-means clustering is essentially an optimization
problem with the goal of minimizing the SSE
objective function.
Let us prove the reason behind choosing the mean of
the data points in a cluster as the prototype
representative for a cluster in the K-means algorithm.
Let us denote Ck as the kth cluster, xi is a point in Ck,
and ck is the mean of the kth cluster.
We can solve for the representative of Cj which
minimizes the SSE by differentiating the SSE with
respect to cj and setting it equal to zero.

Hence, the best representative for minimizing the SSE


of a cluster is the mean of the points in the cluster. In
K-means, the SSE monotonically decreases with each
iteration. This monotonically decreasing behaviour
will eventually converge to a local minimum.

Factors Affecting K-Means


The major factors that can impact
the performance of the K-means
algorithm are the following:
1. Choosing the initial centroids.
2. Estimating the number of clusters K.

Popular Initialization
Methods
In his classical paper [33], MacQueen
proposed a simple initialization method
which chooses K seeds at random. This
is the simplest method and has been
widely used in the literature.
The other popular K-means initialization
methods which have been successfully
used to improve the clustering
performance are given below.

1. Hartigan and Wong[19]: Using the concept of


nearest neighbour density, this method suggests
that the points which are well separated and have
a large number of points within their surrounding
multi-dimensional sphere can be good candidates
for initial points.
The average pair-wise Euclidean distance between
points is calculated using following Equation.
Subsequent points are chosen in the order of their
decreasing density and simultaneously maintaining
the separation of d1 from all previous seeds.

2. Bradley and Fayyad[5]: Choose random


subsamples from the data and apply K-means
clustering to all these subsamples using
random seeds. The centroids from each of
these subsamples are then collected, and a
new dataset consisting of only these centroids
is created.
This new dataset is clustered using these
centroids as the initial seeds. The minimum
SSE obtained guarantees the best seed set
chosen.

3. K-Means++[1]: The K-means++ algorithm


carefully selects the initial centroids for K-means
clustering. The algorithm follows a simple
probability-based approach where initially the first
centroid is selected at random. The next centroid
selected is the one which is farthest from the
currently selected centroid. This selection is
decided based on a weighted probability score.
The selection is continued until we have K
centroids and then K-means clustering is done
using these centroids.

Estimating the Number of


Clusters
The problem of estimating the correct number of clusters
(K) is one of the major challenges for the K-means
clustering. Several researchers have proposed new
methods for addressing this challenge in the literature.
1. CalinskiHarabasz Index[6]: The CalinskiHarabasz index
is defined by the following equation

where N represents the number of data points. The


number of clusters is chosen by maximizing the function
given in the equation. Here B(K) and W(K)are the
between and within cluster sum of squares, respectively
(with K clusters).

2. Silhouette Coefficient[26]: This is formulated by


considering both the intra- and inter-cluster
distances.
For a given point xi, first the average of the distances
to all points in the same cluster is calculated. This
value is set to ai.
Then for each cluster that does not contain x i, the
average distance of xi to all the data points in each
cluster is computed. This value is set to b i.
Using these two values the silhouette coefficient of a
point is estimated. The average of all the silhouettes
in the dataset is called the average silhouettes width
for all the points in the dataset. To evaluate the
quality of a clustering one can compute the average
silhouette coefficient of all points.

3. Newman and Girvan[40]: In this method, the


dendrogram is viewed as a graph, and a
betweenness score (which will be used as a
dissimilarity measure between the edges) is
proposed. The procedure starts by calculating the
betweenness score of all the edges in the graph.
Then the edge with the highest betweenness score
is removed. This is followed by recomputing the
betweenness scores of the remaining edges until
the final set of connected components is obtained.
The cardinality of the set derived through this
process serves as a good estimate for K.

4. ISODATA[2]: ISODATA was proposed for


clustering the data based on the nearest
centroid method. In this method, first K-means
is run on the dataset to obtain the clusters.
Clusters are then merged if their distance is
less than a threshold or if they have fewer
than a certain number of points.
Similarly, a cluster is split if the within cluster
standard deviation exceeds that of a user
defined threshold.

Variations of K-Means
The simple framework of the K-means algorithm
makes it very flexible to modify and build more
efficient algorithms on top of it. Some of the
variations proposed to the K-means algorithm are
based on
(i) Choosing different representative prototypes for
the clusters (K-medoids, K-medians, K-modes),
(ii) choosing better initial centroid estimates
(Intelligent K-means, Genetic K-means), and
(iii) applying some kind of feature transformation
technique (Weighted K-means, Kernel K-means).

K-Medoids Clustering
K-medoids is a clustering algorithm which is more
resilient to outliers compared to K-means [38]. Similar to
K-means, the goal of K-medoids is to find a clustering
solution that minimizes a predefined objective function.
The K-medoids algorithm chooses the actual data points
as the prototypes and is more robust to noise and
outliers in the data. The K-medoids algorithm aims to
minimize the absolute error criterion rather than the SSE.
Similar to the K-means clustering algorithm, the Kmedoids algorithm proceeds iteratively until each
representative object is actually the medoid of the
cluster.
The basic K-medoids clustering algorithm is given below.

In the K-medoids clustering algorithm, specific cases are


considered where an arbitrary random point x i is used to
replace a representative point m.
Following this step the change in the membership of the
points that originally belonged to m is checked. The
change in membership of these points can occur in one
of the two ways.
These points can now be closer to x i (new representative
point) or to any of the other set of representative points.
The cost of swapping is calculated as the absolute error
criterion for K-medoids. For each reassignment operation
this cost of swapping is calculated and this contributes to
the overall cost function.

Algorithm: K-Medoids
Clustering
1: Select K points as the initial representative
objects.
2: repeat
3: Assign each point to the cluster with the nearest
representative object.
4: Randomly select a non-representative object xi.
5: Compute the total cost S of swapping the
representative object m with xi.
6: If S < 0, then swap m with xi to form the new set
of K representative objects.
7: until Convergence criterion is met.

To deal with the problem of executing multiple swap


operations while obtaining the final representative points
for each cluster, a modification of the K-medoids
clustering called Partitioning Around Medoids (PAM)
algorithm is proposed [26]. This algorithm operates on the
dissimilarity matrix of a given dataset. PAM minimizes the
objective function by swapping all the non-medoid points
and medoids iteratively until convergence.
K-medoids is more robust compared to K-means but the
computational complexity of K-medoids is higher and
hence is not suitable for large datasets.
PAM was also combined with a sampling method to
propose the Clustering LARge Application (CLARA)
algorithm. CLARA considers many samples and applies
PAM on each one of them to finally return the set of
optimal medoids.

K-Medians Clustering
The K-medians clustering calculates the median for
each cluster as opposed to calculating the mean of
the cluster (as done in K-means). K-medians
clustering algorithm chooses K cluster centers that
aim to minimize the sum of a distance measure
between each point and the closest cluster center.
The distance measure used in the K-medians
algorithm is the L1-norm as opposed to the square
of the L2-norm used in the K-means algorithm. The
criterion function for the K-medians algorithm is
defined as follows:

Where xij represents the jth attribute of the instance x i


and medkj represents the median for the jth attribute in
the kth cluster Ck. K-medians is more robust to outliers
compared to K-means.
The goal of the K-Medians clustering is to determine
those subsets of median points which minimize the cost
of assignment of the data points to the nearest medians.
The overall outline of the algorithm is similar to that of Kmeans.
The two steps that are iterated until convergence are
(i) All the data points are assigned to their nearest
median and
(ii) the medians are recomputed using the median of the
each individual feature.

K-Modes Clustering
One of the major disadvantages of K-means is its inability
to deal with non-numerical attributes [51, 3]. Using
certain data transformation methods, categorical data
can be transformed into new feature spaces, and then
the K-means algorithm can be applied to this newly
transformed space to obtain the final clusters.
However, this method has proven to be very ineffective
and does not produce good clusters. It is observed that
the SSE function and the usage of the mean are not
appropriate when dealing with categorical data. Hence,
the K-modes clustering algorithm [21] has been proposed
to tackle this challenge.

Algorithm: K-Modes
Clustering
1: Select K initial modes.
2: repeat
3: Form K clusters by assigning all the
data points to the cluster with the
nearest mode using the matching
metric.
4: Recompute the modes of the
clusters.
5: until Convergence criterion is met.

Fuzzy K-Means Clustering


This is also popularly known as Fuzzy C-Means
clustering.
Performing hard assignments of points to clusters is
not feasible in complex datasets where there are
overlapping clusters. To extract such overlapping
structures, a fuzzy clustering algorithm can be used.
In fuzzy C-means clustering algorithm (FCM) [12, 4],
the membership of points to different clusters can
vary from 0 to 1. The
SSE function for FCM is provided in the following
equation.

Here wxik is the membership weight of


point xi belonging to Ck. This weight is
used during the update step of fuzzy
C-means. The weighted centroid
according to the fuzzy weights for C is

The basic algorithm works similarly to Kmeans where the algorithm minimizes the SSE
iteratively followed by updating w xik and ck.
This process is continued until the
convergence of centroids. As in K-means, the
FCM algorithm is sensitive to outliers and the
final solutions obtained will correspond to the
local minimum of the objective function.
There are further extensions of this algorithm
in the literature such as Rough C-means[34]
and Possibilistic C-means [30].

Hierarchical Clustering
Algorithms

Hierarchical Clustering
Algorithms
Hierarchical clustering algorithms [23] were
developed to overcome some of the
disadvantages associated with flat or
partitional-based clustering methods.
Partitional methods generally require a user
predefined parameter K to obtain a clustering
solution and they are often nondeterministic in
nature.
Hierarchical algorithms were developed to build
a more deterministic and flexible mechanism
for clustering the data objects.

Hierarchical Algorithms
Hierarchical algorithms produce a nested sequence of
data partitions. The sequence can be depicted using
a tree structure that is popularly known as a
dendrogram.
The algorithms are either
divisive or
agglomerative.

The former starts with a single cluster having all the


patterns; at each successive step, a cluster is split.
This process continues till we end up with one pattern
in a cluster or a collection of singleton clusters.
Divisive algorithms use a top-down strategy for
generating partitions of the data.

Hierarchical Algorithms
Agglomerative algorithms, on the other hand, use
a bottom-up strategy.
They start with n singleton clusters when the
input data set is of size n, where each input
pattern is in a different cluster. At successive
levels, the most similar pair of clusters is merged
to reduce the size of the partition by 1.
An important property of agglomerative
algorithms is that once two patterns are placed in
the same cluster at a level, they remain in the
same cluster at all subsequent levels.
Similarly, in the divisive algorithms, once two
patterns are placed in two different clusters at a

Divisive Hierarchical
Clustering
Divisive algorithms are either
polythetic where the division is based on more than one
feature or
monothetic when only one feature is considered at a time.

A scheme for polythetic clustering involves finding all


possible 2-partitions of the data and choosing the best
partition.
Here, the partition with the least sum of the sample
variances of the two clusters is chosen as the best. From
the resulting partition, the cluster with the maximum
sample variance is selected and is split into an optimal
2-partition.
This process is repeated till we get singleton clusters.

The sum of the sample variances is


calculated in the following way. If the
patterns are split into two partitions
with m patterns X1,...,Xm in one
cluster and n patterns Y1,...,Yn in the
other cluster with the centroids of
the two clusters being C1 and C2
respectively, then the sum of the
sample variances will be

Here, to obtain the optimal 2-partition of a cluster of


size n, 2n1 2-partitions are to be considered and
the best of them is chosen; so, O(2 n)effort is
required to generate all possible 2-partitions and
select the most appropriate partition.
It is possible to use one feature at a time to partition
(monothetic clustering) the given data set. In such a
case, a feature direction is considered and the data
is partitioned into two clusters based on the gap in
the projected values along the feature direction.
That is, the data set is split into two parts at a point
that corresponds to the mean value of the
maximum gap found among the values of the datas
feature. Each of these clusters is further partitioned
sequentially using the remaining features.

Example: Monothetic divisive


clustering
There are 8 two-dimensional patterns. The patterns are as
follows :
A = (0.5, 0.5); B = (2, 1.5); C = (2, 0.5); D = (5, 1);
E = (5.75, 1); F = (5, 3); G = (5.5, 3); H = (2, 3).

Example: Monothetic divisive


clustering

A major problem with this approach is that in the worst


case, the initial data set is split into2 d clusters by
considering all the d features. This many clusters may
not be typically acceptable, so an additional merging
phase is required.
In such a phase, a pair of clusters is selected based on
some notion of nearness between the two clusters and
they are merged into a single cluster. This process is
repeated till the clusters are reduced to the required
number.
A popular characteristic for finding the nearness
between a pair of clustersdistance between the
centroids of the two clusters.

Algorithm: Basic Divisive


Hierarchical Clustering
1: Start with the root node consisting all the
data points
2: repeat
3: Split parent node into two parts C 1 and C2
using Bisecting K-means to maximize Wards
distance W(C1, C2).
4: Construct the dendrogram. Among the
current, choose the cluster with the highest
squared error.
5: until Singleton leaves are obtained.

Sorting the n elements in the data


and finding the maximum interpattern gap is of O(nlogn) time
complexity for each feature direction
and for the d features under
consideration, this effort is
O(dnlogn).
So, this scheme can be infeasible for
large values of n and d.

Agglomerative Clustering
There are different kinds of agglomerative
clustering methods which primarily differ from
each other in the similarity measures that they
employ.
The widely studied algorithms in this category
are the following:
single link(nearest neighbour),
complete link(diameter),
group average(average link),
centroid similarity,and
Wards criterion(minimum variance).

Agglomerative Clustering
Typically, an agglomerative clustering algorithm goes
through the following steps:
Step 1: Compute the similarity/dissimilarity matrix
between all pairs of patterns. Initialise each cluster
with a distinct pattern.
Step 2: Find the closest pair of clusters and merge them.
Update the proximity matrix to reflect the merge.
Step 3: If all the patterns are in one cluster, stop. Else,
goto Step 2.

Step 1 in the above algorithm requires O(n 2) time to


compute pair-wise similarities and O(n 2) space to store
the values, when there are n patterns in the given
collection.
There are several ways of implementing the second
step. Some methods require only the only the distance
matrix and some require the distance matrix plus the
original data matrix. The following methods require only
the distance matrix: (These are considered with
reference to two clusters, C1 and C2 .
Single linkage. The distance between two clusters is
defined as the minimum of the n1n2 distances between
each observation of cluster C 1 and each observation of
cluster C2:
d(C1,C2) = min(drs), with r C1, s C2.

Complete linkage. The distance between two


clusters is defined as the maximum of the n 1n2
distances between each observation of a cluster
and each observation of the other cluster:
d(C1, C2) = max(drs), with r C1, s C2.
Average linkage. The distance between two
clusters is defined as the arithmetic average of
the n1n2 distances between each of the
observations of a cluster and each of the
observations of the other cluster:

Single Link
In single link clustering[36, 46], the similarity of two
clusters is the similarity between their most similar
(nearest neighbor) members. This method
intuitively gives more importance to the regions
where clusters are closest, neglecting the overall
structure of the cluster.
Hence, this method falls under the category of a
local similarity-based clustering method. Because of
its local behavior, single linkage is capable of
effectively clustering non-elliptical, elongated
shaped groups of data objects.
However, one of the main drawbacks of this method
is its sensitivity to noise and outliers in the data.

Complete Link
Complete link clustering [27] measures the
similarity of two clusters as the similarity of
their most dissimilar members. This is
equivalent to choosing the cluster pair whose
merge has the smallest diameter.
As this method takes the cluster structure into
consideration it is non-local in behavior and
generally obtains compact shaped clusters.
However, similar to single link clustering, this
method is also sensitive to outliers.

Example: Agglomerative
Clustering
The single-link algorithm can be explained
with the help of the data shown in the
previous example.
The dendrogram corresponding to the
single-link algorithm is shown below.
Note that there are 8 clusters to start with,
where each cluster has one element.
The distance matrix using city-block
distance or Manhattan distance is given in
Table 9.9.

(D,
F)

Group Averaged and Centroid


Agglomerative Clustering
Group Averaged Agglomerative Clustering(GAAC) considers
the similarity between all pairs of points present in both
the clusters and diminishes the drawbacks associated with
single and complete link methods.
Let two clusters Ca and Cb be merged so that the resulting
cluster is Cab = Ca Cb. The new centroid for this cluster is

Where Na and Nb are the cardinalities of the clusters Ca and


Cb, respectively.
The similarity measure for GAAC is calculated as follows:

The distance between two clusters is the average of all


the pair-wise distances between the data points in
these two clusters. Hence, this measure is expensive
to compute especially when the number of data
objects becomes large. Centroid-based agglomerative
clustering, on the other hand, calculates the similarity
between two clusters by measuring the similarity
between their centroids.
The primary difference between GAAC and Centroid
agglomerative clustering is that, GAAC considers all
pairs of data objects for computing the average pairwise similarity, whereas centroid-based agglomerative
clustering uses only the centroid of the cluster to
compute the similarity between two different clusters.

Wards Criterion
Wards criterion [49, 50] was proposed to compute the distance between
two clusters during agglomerative clustering. This process of using
Wards criterion for cluster merging in agglomerative clustering is also
called as Wards agglomeration.
It uses the K-means squared error criterion to determine the distance. For
any two clusters, Ca and Cb, the Wards criterion is calculated by
measuring the increase in the value of the SSE criterion for the clustering
obtained by merging them into Ca Cb.
The Wards criterion is defined as follows:

So the Wards criterion can be interpreted as the squared Euclidean


distance between the centroids of the merged clustersCaandCbweighted
by a factor that is proportional to the product of cardinalities of the
merged clusters.

Algorithm: Agglomerative
Hierarchical Clustering
1: Compute the dissimilarity matrix between
all the data points.
2: repeat
3: Merge clusters as Cab = Ca Cb. Set new
clusters cardinality as Nab = Na + Nb.
4: Insert a new row and column containing
the distances between the new cluster
Cab and the remaining clusters.
5: until Only one maximal cluster remains.

Hierarchical clustering algorithms are computationally


expensive.
The agglomerative algorithms require computation and
storage of a similarity or dissimilarity matrix of values that
has O(n2) time and space requirement. They can be used in
applications where hundreds of patterns were to be
clustered. However, when the data sets are larger in size,
these algorithms are not feasible because of the non-linear
time and space demands.
It may not be easy to visualise a dendrogram corresponding
to 1000 patterns.
Similarly, divisive algorithms require exponential time to
analyse the number of patterns or the number of features.
So, they too do not scale up well in the context of largescale problems involving millions of patterns.

Density-Based Methods

Partitioning and hierarchical methods


are designed to find spherical-shaped
clusters.
They have difficulty finding clusters
of arbitrary shape such as the S
shape and oval clusters.
Density-based clustering methods
can discover clusters of nonspherical
shape.

DBSCAN: Density-Based Clustering Based on


Connected
Regions with High Density
DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) finds core objects, that
is, objects that have dense neighbourhoods. It
connects core objects and their
neighbourhoods to form dense regions as
clusters.
A user-specified parameter > 0 is used to
specify the radius of a neighbourhood we
consider for every object.
The -neighbourhood of an object o is the
space within a radius centered at o.

Due to the fixed neighbourhood size


parameterized by , the density of a
neighbourhood can be measured simply by the
number of objects in the neighbourhood.
To determine whether a neighbourhood is
dense or not, DBSCAN uses another userspecified parameter, MinPts, which specifies
the density threshold of dense regions.
An object is a core object if the
-neighbourhood of the object contains at least
MinPts objects.

Given a set, D, of objects, we can identify all core objects


with respect to the given parameters, and MinPts. The
clustering task is therein reduced to using core objects and
their neighbourhoods to form dense regions, where the
dense regions are clusters.
For a core object q and an object p, we say that p is directly
density-reachable from q (with respect to and MinPts) if p
is within the -neighbourhood of q. Clearly, an object p is
directly density-reachable from another object q if and only
if q is a core object and p is in the -neighbourhood of q.
Using the directly density-reachable relation, a core object
can bring all objects from its -neighbourhood into a
dense region.

Application of Cluster
Analysis

Data Reduction
Hypothesis generation and Testing
Prediction based on groups
Finding K-nearest neighbours
Outlier detection

References:
Pattern Recognition - An Algorithmic
Approach- M. Narasimha Murty V. Susheela
Devi.
Applied Data Mining for Business and Industry
Paolo Giudici, Silvia Figini; 2nd Edition.
Principles of Data Mining Max Bramer.
Data Mining Multimedia, Soft Computing and
Bioinformatics Sushmita Mitra, Tinku Acharya.
Data Clustering Algorithms and Applications
Charu C. Aggarwal, Chandan K. Reddy.

References
[1] D. Arthur and S. Vassilvitskii.K-means++: The advantages of careful
seeding. In Proceedings of the Eighteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 10271035. Society for
Industrial and Applied Mathematics, 2007.
[2] G. H. Ball and D. J. Hall. ISODATA, a novel method of data analysis
and pattern classification. Technical report, DTIC Document, 1965.
[3] P. Berkhin. A survey of clustering data mining techniques. In
Grouping Multidimensional Data, J. Kogan, C. Nicholas, and M.
Teoulle, Eds., Springer, Berlin Heidelberg, pages 2571, 2006.
[4] J. C. Bezdek.Pattern recognition with fuzzy objective function
algorithms. Kluwer Academic Publishers, 1981.
[5] P. S. Bradley and U. M. Fayyad. Refining initial points fork-means
clustering. In Proceedings of the Fifteenth International Conference
on Machine Learning, volume 66. San Francisco, CA, USA, 1998.
[6] T. Cali nski and J. Harabasz. A dendrite method for cluster analysis.
Communications in StatisticsTheory and Methods, 3(1):127, 1974.

[7] Y. Cheng. Mean shift, mode seeking, and clustering.


IEEE Transactions on Pattern Analysis and Machine
Intelligence, 17(8):790799, 1995.
[8] D. Comaniciu and P. Meer. Mean shift: A robust
approach toward feature space analysis.IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 24(5):603619, 2002.
[9] T.H.Cormen. Introduction to Algorithms. MIT Press,
2001.
[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernelk-means:
Spectral clustering and normalized cuts. In
Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
(KDD), pages 551556. ACM, 2004.
[11] R. O. Duda and P. E. Hart. Pattern Classification and

[12] J. C. Dunn. A fuzzy relative of the ISODATA process


and its use in detecting compact wellseparated
clusters.Journal of Cybernetics, 3(3):3257, 1973.
[13] C. Elkan. Using the triangle inequality to
acceleratek-means. InProceedings of International
Conference on Machine Learning (ICML), pages 147
153, 2003.
[14] D. Fisher. Optimization and simplification of
hierarchical clusterings. In Proceedings of the 1st
International Conference on Knowledge Discovery and
Data Mining (KDD), pages 118123, 1995.
[15] D. H. Fisher. Knowledge acquisition via incremental
conceptual clustering. Machine Learning, 2(2):139
172, 1987.
[16] J. C. Gower and G. J. S. Ross. Minimum spanning

[17] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient


clustering algorithm for large databases. In ACM
SIGMOD Record, volume 27, pages 7384. ACM, 1998.
[18] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust
clustering algorithm for categorical attributes. In
Proceedings of the 15th International Conference on
Data Engineering, pages 512521. IEEE, 1999.
[19] J. A. Hartigan and M. A. Wong. Algorithm as 136: Akmeans clustering algorithm. Journal of the Royal
Statistical Society. Series C (Applied Statistics),
28(1):100108, 1979.
[20] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li. Automated
variable weighting ink-means type clustering. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 27(5):657668, 2005.

[21] Z. Huang. Extensions to the k-means algorithm for


clustering large data sets with categorical values.
Data Mining and Knowledge Discovery, 2(3):283304,
1998.
[22] A. K. Jain. Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8):651666,
2010.
[23] A. K. Jain, M. N. Murty, and P. J. Flynn. Data
clustering: A review.ACM Computing Surveys (CSUR),
31(3):264323, 1999.
[24] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D.
Piatko, R. Silverman, and A. Y. Wu. An efficient kmeans clustering algorithm: Analysis and
implementation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(7):881892, 2002.

[26] L. Kaufman, P.J. Rousseeuw, et al. Finding Groups in


Data: An Introduction to Cluster Analysis, volume 39.
Wiley Online Library, 1990.
[27] B. King. Step-wise clustering procedures. Journal of
the American Statistical Association, 62(317):86101,
1967.
[28] T. Kohonen. The self-organizing map. Proceedings of
the IEEE, 78(9):14641480, 1990.
[29] K. Krishna and M. N. Murty. Genetic k-means
algorithm. IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics, 29(3):433439, 1999.
[30] R. Krishnapuram and J. M. Keller. The possibilistic Cmeans algorithm: Insights and recommendations.IEEE
Transactions on Fuzzy Systems, 4(3):385393, 1996.

[31] G. N. Lance and W. T. Williams. A general theory of classificatory


sorting strategies II. Clustering systems. The Computer Journal,
10(3):271277, 1967.
[32] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on
Information Theory, 28(2):129137, 1982.
[33] J. MacQueen. Some methods for classification and analysis of
multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability,volume 1,
pages 281297, Berkeley, CA, USA, 1967.
[34] P. Maji and S. K. Pal. Rough set based generalized fuzzyc-means
algorithm and quantitative indices. IEEE Transactions on Systems,
Man, and Cybernetics, Part B, 37(6):15291540, 2007.
[35] C. D. Manning, P. Raghavan, and H. Schutze.Introduction to
Information Retrieval, volume 1. Cambridge University Press,
Cambridge, 2008.

[36] L. L. McQuitty. Elementary linkage analysis for


isolating orthogonal and oblique types and typal
relevancies. Educational and Psychological
Measurement17(2):207229, 1957.
[37] G. W. Milligan. A Monte Carlo study of thirty internal
criterion measures for cluster analysis. Psychometrika,
46(2):187199, 1981.
[38] B. G. Mirkin.Clustering for Data Mining: A Data
Recovery Approach, volume 3. CRC Press, Boca Raton,
FL, 2005.
[39] R. Mojena. Hierarchical grouping methods and
stopping rules: An evaluation. The Computer Journal,
20(4):359363, 1977.
[40] M. E. J. Newman and M. Girvan. Finding and
evaluating community structure in networks. Physical

[41] C. F. Olson. Parallel algorithms for hierarchical


clustering. Parallel Computing, 21(8):1313 1325,
1995.
[42] D. Pelleg and A. Moore. X-means: Extendingk-means
with efficient estimation of the number of clusters.
InProceedings of the Seventeenth International
Conference on Machine Learning, pages 727734, San
Francisco, CA, USA, 2000.
[43] G. Rudolph. Convergence analysis of canonical
genetic algorithms. IEEE Transactions on Neural
Networks, 5(1):96101, 1994.
[44] B. Scholkopf, A. Smola, and K. R. M uller. Nonlinear
component analysis as a kernel eigenvalue
problem.Neural Computation, 10(5):12991319, 1998.

[45] S. Z. Selim and M. A. Ismail. K-means-type


algorithms: A generalized convergence theorem and
characterization of local optimality.IEEE Transactions
on Pattern Analysis and Machine Intelligence, 6(1):81
87, 1984.
[46] P. H. A. Sneath and R. R. Sokal. Numerical
taxonomy.Nature, 193:855860, 1962.
[47] M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. In
KDD Workshop on Text Mining, volume 400, pages
525526. Boston, MA, USA, 2000.
[48] R. Tibshirani, G. Walther, and T. Hastie. Estimating
the number of clusters in a data set via the gap
statistic. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 63(2):411423, 2001.

[49] J. Ward. Hierarchical grouping to optimize an objective


function. Journal of the American Statistical Association,
58(301):236244, 1963.
[50] D. Wishart. An algorithm for hierarchical
classifications.Biometrics, 25(1):165170, 1969.
[51] R. Xu and D. Wunsch. Survey of clustering algorithms.
IEEE Transactions on Neural Networks, 16(3):645678, 2005.
[52] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L.
Ruzzo. Model-based clustering and data transformations for
gene expression data.Bioinformatics, 17(10):977987, 2001.
[53] C.T. Zahn. Graph-theoretical methods for detecting and
describing gestalt clusters. IEEE Transactions on Computers,
20(1):6886, 1971

You might also like