0% found this document useful (0 votes)
74 views

Clustering

This document summarizes a research paper that evaluates the quality of hierarchical clustering methods. It discusses three parameters used to measure cluster quality: cohesion, silhouette index, and elapsed time. Cohesion measures how closely related objects are within a cluster. Silhouette index represents cluster quality graphically and calculates how similar an object is to other objects in its cluster compared to objects in other clusters. Elapsed time refers to the total time taken for clustering, with lower times indicating better quality. The paper aims to use these parameters to evaluate the performance of different hierarchical clustering techniques and identify ones that produce high-quality clusters efficiently.

Uploaded by

mkmanojdevil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Clustering

This document summarizes a research paper that evaluates the quality of hierarchical clustering methods. It discusses three parameters used to measure cluster quality: cohesion, silhouette index, and elapsed time. Cohesion measures how closely related objects are within a cluster. Silhouette index represents cluster quality graphically and calculates how similar an object is to other objects in its cluster compared to objects in other clusters. Elapsed time refers to the total time taken for clustering, with lower times indicating better quality. The paper aims to use these parameters to evaluate the performance of different hierarchical clustering techniques and identify ones that produce high-quality clusters efficiently.

Uploaded by

mkmanojdevil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 5

Cluster Quality Based Performance Evaluation of

Hierarchical Clustering Method


Nisha
Student

Puneet Jai Kaur

UIET, Panjab University

Assistant Professor
UIET, Panjab University

Chandigarh, India
[email protected]

Chandigarh, India

[email protected]
n
Abstract Clustering is an important phase in data mining. A
number of different clustering methods are used to perform cluster
analysis: Partitioning Clustering, hierarchical clustering, gridbased clustering, model-based, graph based clustering and density
based clustering and so on. Hierarchical method helps us to cluster
the data objects in the form of a tree known as hierarchy. And each
node in hierarchy is known as the cluster. Hierarchical clustering
can be performed in two ways: agglomerative clustering and
divisive clustering. Agglomerative clustering is always more
preferable. For a good cluster analysis, the quality of the clusters
should be high. In this paper, we will measure the quality of
clusters with the help of three parameters: Cohesion measurement,
Silhouette index and Elapsed time.
KeywordsData mining, Clustering, Hierarchical Clustering,
Quality, Quality parameters, Cohesion, Silhouette index, Elapsed
time.

I.

INTRODUCTION

Cluster analysis is a method to detect the number of


clusters from a given data set. A data set can be defined as
the collection of objects or data [2].
In cluster analysis, the objects in the cluster are arranged
in such a way that they are similar to the objects within
the same clusters and are different from the objects that
are lying outside the same cluster [2]. Clustering is an
unsupervised learning because it does not need any
predefined clusters. Clustering is performed for a number
of reasons. Two main reasons to perform clustering are:
data interpretation, it means the data can be easily
understandable and data compression, it means that the
data can be easily optimized and data set can be used
efficiently. The main applications of cluster analysis are to
detect the different patterns, classification, data mining,
grouping the objects based on similarity or dissimilarity
criteria, knowledge discovery, searching the objects and
so on[3]. The major clustering techniques are classified

into the following categories: Partitioning clustering,


hierarchical clustering, density-based clustering, gridbased clustering, model based clustering etc. [4]. The
main method discussed in this paper is hierarchical
clustering in which a number of clusters are nested into
the form of a tree and at each level a cluster is obtained by
the union of its sub clusters. A method that obtains a high
quality cluster is always desirable. Therefore the main
idea of this paper is to discuss the various quality criterion
used for cluster analysis. A number of quality criterion
which will be discussed later are: Cluster cohesion,
silhouette index, time elapsed.
Cluster cohesion is becoming the most important aspect in
data mining. It can be used to determine the quality of the
clusters. Cluster cohesion is the degree of association
between the objects of the cluster. Therefore a high value
of cluster cohesion is desirable for cluster analysis [5]. A
number of different metrics have been implemented in the
past years to measure the cluster cohesion values, but no
results has yet been reached that which metrics best
calculates the cohesion. Existing cohesion metrics are
based on the similarity measures between the objects of
clusters [5]. It means clustering methods group the objects
based on the interconnection between the objects of the
class. As a result, a high value of similarity means the
high cohesion between the objects of the clusters and vice
versa.
The main index for measuring quality of a cluster in a
graphical form is silhouette index. This index is preferred
over all indices of graphical form, because silhouette
represents the clusters of a dataset in scattered form. This
index measures the amount of similarity of an object to

the objects within the cluster as compared to the

objects outside the cluster [11].


One of the quality criteria is calculation of time elapsed.
Time elapsed in cluster analysis is the amount of the time
taken to perform the clustering. A minimum value of time
elapsed is always desirable for any clustering method. For
large databases, the time taken is more as compared to the
small datasets.
II.

RELATED WORK

To improve software engineering, a number of data mining


techniques can be applied which includes association mining,
classification, generalization, clustering, decision trees, pattern
classification and so on [6]. Clustering (Cluster analysis) is an
important technique to improve software engineering.
According to [7], clustering is defined as the grouping of the
data objects in a data set in such a way that the objects in the
same cluster are similar to each other but they are different
from the objects outside the cluster. Hierarchical clustering in
[8] is defined as a technique in which the data objects are
grouped together to form a hierarchy known as a tree.
According to [10], the time taken by all the data objects to
make the final clusters is known as the elapsed time. For a
good cluster quality, the amount of elapsed time should be
low. In [11], cohesion is the extent to which the different
objects in the same cluster are associated with each other. The
main motive for good quality clusters is to obtain the more
cohesive clusters. In [12], silhouette index is a parameter
which is used to plot the cluster quality in the form of
scattered points. Silhouette index calculates the value in the
range [-1, 1].
III.

CLUSTERING METHOD

A. Hierarchical Clustering
Hierarchical clustering is defined as a method in which
clusters are formed in the form of a tree or hierarchy. Every
node in the tree represents the different cluster and the clusters
in the hierarchy are known as dendrograms. Hierarchical
clustering can be performed in two ways based on splitting
and merging of clusters: divisive method and agglomerative
method.
Divisive method of hierarchical clustering is also known as
top-down approach in which a large data set is given initially
and this data set is further divided into a number of smaller
subsets (known as clusters) until a threshold is reached [7] .
Agglomerative method works in the reverse direction of
divisive method. In this method, a number of clusters are
given initially and these clusters are merged in such a way that
the two clusters to be merged are very similar to each other
[7]. These clusters are merged together until a large cluster is

formed. Therefore this method is also known as bottom-up


approach.
The clusters are split or combine to a specific level. In order to
decide where splitting of a cluster will take place or which two
clusters should be combine, a measuring criteria known as
dissimilarity among the sets of data is required.
IV.

QUALITY MEASUREMENT

To evaluate the main approach of our study, we consider three


main parameters for clusters quality. The three parameters are:
Cohesion measurement, silhouette index and elapsed time.
These three parameters are discussed in detail below:
A. Elapsed Time
One of the important criteria in measuring the quality of
clusters is the time taken for performing cluster analysis. The
amount of total time taken by the dataset to make the clusters
is known as elapsed time. Lesser the amount of time taken by
the dataset to make clusters, better will be the quality of
clusters.
B. Cohesion Measurement
In paper [11], a new measure of similarity between the clusters
named as cohesion was introduced. Cohesion is a
measurement criteria used to determine that how well the
objects of a dataset are combined together to form the good
quality clusters. The main aim of the cohesion measurement is
to determine the inter-cluster distance, which is degree of
association between the objects of a dataset within the same
cluster. Therefore, a high value of cohesion within the cluster
is required for good quality clusters. In our approach, sum of
square error is used to determine the cohesion.
Sum of squared error (SSE) is the most widely used and
simplest method used for cluster analysis. SSE calculates the
distance within the cluster. This distance is measured as the
sum of square of all the distances between the objects in same
cluster.
SSE = (xi - xx)2
Where xx refers to the mean of all the objects in a dataset, Xi
refers to that object whose difference from the mean is to be
calculated and the symbol tells us to sum the differences
(xi - xx) for all i.
C. Silhouette Index
Silhouette index is the best and most preferred quality criteria
in cluster analysis. Silhouette index represents a graphical
form in which it shows that how similar an object of a data set
is to the other objects in the same cluster [12]. For each object,
the value of silhouette index known as silhouette width is
calculated and this value varies from -1 to +1. There are three
cases depending on the value of silhouette width which are as
follows:

A silhouette width with a value nearly equal to +1


means, the objects of a data set is in the correct
cluster.

A silhouette width with a value nearly equal to 0


means, the object can also be in some other cluster.

Lastly, a value nearly equal to -1 means, the object of


a data set is in wrong cluster.

Where D (A, B) is the distance to be calculated, C A is the


center of cluster A and CB is the center of cluster B.
B. Single Linkage Clustering
A Single Linkage Clustering is also known as the nearest
neighbor technique because it defines the distance between the
two closest objects known as minimum distance in clusters
[13].Mathematically, the single linkage distance can be
calculated as:

The silhouette width of an object can be calculated as:

D (A, B) = min {d (i, j)}


Where object i is in cluster A and j is in cluster B respectively.
C. Complete Linkage Clustering
A Complete Linkage Clustering measures the distance
between two farthest clusters known as the maximum cluster
distance [13]. It is also known as the farthest neighbor
technique. This distance can be calculated as:

Where n is total number of objects in a data set, a i is the


average distance between the object i and all the other objects
in the same cluster , b i is taken as the minimum of all the
average dissimilarities between the object i and all the other
elements outside the cluster.

D (A, B) = max {d (i, j)}

Silhouette index has an advantage over all indices used for


quality measure that it represents the clusters in the visually
scattered form and the clusters thus obtained are more accurate
than other indices.
V.

Where object i is in cluster A and j is in cluster B respectively.


D. Normal Linkage Clustering
In normal linkage clustering, the average distance between all
pairs of objects is calculated known as the average distance.
The mathematical expression is:

EXPERIMENTAL DESIGN

Hierarchical clustering is an effective method to evaluate the


clusters form a given dataset as it combines a number of
clusters into the form of a tree (called as dendrogram) in such
a way that the sub clusters are fundamentally similar to one
another.
The first phase in making dendrogram is to discover distances
between the objects of a dataset. When all the distances
between the clusters have discovered, the merging or splitting
operation is applied on the given. Agglomerative algorithm is
always more preferable than divisive algorithm. Therefore, a
merge operation will be applied on two nearest sub clusters
(sub clusters having minimum distance between them) to form
a hub. Again the same procedure will be followed for next two
sub clusters and so on until a single cluster will be obtained.
Note that once grouping begins, we work with genuine things
(e.g. a solitary quality) and pseudo-things that contain various
genuine things. A number of methods to discover the distances
when we deal with pseudo-things are: centroid linkage, single
linkage, complete linkage, and normal linkage.
A. Centroid Linkage Clustering
In Centroid Linkage Clustering, the distance between the
centers of the respective clusters is calculated and those
clusters with minimum distance between them will be
combined together to form a hub in the tree. The centroid
distance between two clusters can be calculated as:
D (A, B) = || CA CB ||

D (A, B) = average {d (i, j)}


Where object i is in cluster A and j is in cluster B respectively.
Algorithm 1: Hierarchical Clustering

I.

Start with a number of n sub clusters at level L(0) = 0 and


a counter C = 0.

II.

Locate the nearest neighbors that is the neighbors with


minimum distance say pair (A), (B), as indicated by
D[(A),(B)] = min d[(i),(j)]

III.

Increase the counter number: C = C +1. Merge the


clusters (A) and (B) to a single group Set the level of this
hierarchy to
L(C) = D[(A),(B)]

IV.

Upgrade the similarity lattice, by erasing the lines and


segments comparing to clusters (A) and (B) and including
a column and segment relating to the recent hierarchy.
The similarity between the new cluster, most recent
cluster(A,B) and old cluster(k) can be calculated as:
D[(k), (A,B)] = min [D[(k),(A)], D[(k),(B)] ]

V.

Stop, if only one clusters is remaining. Else, go to step 2.

VI.

METHODOLOGY AND RESULTS

that it gives us the relationship of an object with its


neighbor clusters and represent it in the form of scattered
points. Our result analysis is shown below:

The steps of methodology followed in this paper is given in


(see Fig.1):

Table 1. Elapsed Time for Different Population Size


Cluster Volume
Elapsed Time
100x3

0.092

200x3

0.010

300x3

0.012

400x3

0.020

500x3

0.031

Table 2. Cohesion Measurement for Different Population Size


Cluster Volume
Cohesion Measurement

Fig. 1. Steps of methodology used


The software requirement for our proposed work is
MATLAB R2009b. To implement the results, we have
taken population size as our dataset. We have performed
the cluster analysis on five different sizes of same dataset.
The first dataset is 100*3, where 100 represents the rows
(i.e. the total number of values for respective attribute) and
3 represents the number of total attributes in the dataset.
Then the size of the dataset increased to 200, 300 and so
on.
The next step is to perform all the three quality parameters
on these population sizes. And various results are obtained.
According to the result analysis, with increase in volume of
population, the elapsed time also increases except for the
first time. At first time, the time taken to make clusters is
maximum because initially the memory location of the
clusters are also identified. And the population volume with
minimum elapsed time is considered as the best quality
cluster. Therefore in our result analysis, the time taken by
the population volume of 200*3 is minimum, so we can say
that this volume is best suitable for clustering. Second
observation from our results is cohesion measurement
changes alternatively with increase in the volume of
population. That is in our result, cohesion measurement is
minimum at a volume of 100*3, and then cohesion
increases at a volume of 200*3, then again decreases at a
volume of 300*3 and so on. Therefore we can say that
cohesion only depends on the type of clusters formed by
hierarchical clustering method in our case, it does not
depend on the size or volume of dataset. But the main
advantage of cohesion measurement is that it gives us the
information about the association between the objects of a
dataset so a high value of cohesion results in a good cluster
quality. Therefore we can observe that a population volume
with 200 records is the best cluster volume for cluster
analysis. The third observation about our result analysis is
that the silhouette index for hierarchical clustering method
is worst among most of the clustering method, because it is
representing most of the objects in wrong clusters, the only
use of silhouette index in case of hierarchical clustering is

100x3

0.8210

200x3

0.8294

300x3

0.8219

400x3

0.8253

500x3

0.8215

There is no table generated for silhouette index because the


silhouette index represent values only in graphical form. The
corresponding graphical representations of three parameters is
(see Fig.2, Fig.3, Fig.4)

Fig. 2. Graphical representation of Elapsed Time

a low elapsed time always results in a good quality cluster.


The achieved results shows the relationship between all the
three parameters. However the approach performed in this can
be used for various object oriented systems to make more
general conclusions. In future work, we propose to: improve
the silhouette index values using some another method of
clustering, because we have seen that silhouette index is worst
for hierarchical clustering and then we will evaluate the
quality parameters of that clustering method also and finally
we will compare the results of both methods.

References
[1]

Fig. 3. Graphical representation of Cohesion Measurement

[2]

[3]

[4]

[5]

[6]

[7]

Fig.4. Silhouette Plotting for different population size.

VII.

CONCLUSION

The main technique in data mining to improve software


engineering is discussed in this paper. This paper explored
hierarchical clustering method to improve the quality of the
clusters.
The proposed work has been implemented using three quality
parameters such as cohesion measurement, silhouette index
and elapsed time. All the existing cohesion metrics like
LCOM (Lack of cohesion metric), LCC (Loose class
cohesion), TCC (tight class cohesion) etc. are based on the
attributes of the objects in a dataset. The cohesion
measurement is directly or indirectly based on the closeness of
the objects. Therefore, we used cluster analysis to produce
more efficient clusters of related objects in a dataset.
Silhouette index is preferred over all other indices because
only silhouette index represents the quality of clusters in the
form of scattered data points graphically. And the elapsed time
is taken as one of the parameter for identifying the quality of
clusters because a lesser amount of time taken results in a
better quality of clusters. So we can say that a cluster with
high cohesion, zero and positive values of silhouette index and

[8]

[9]
[10]

[11]

[12]

[13]

[14]

[15]

T. Xie, S. Thummalapenta et.al, Data mining for software


engineering, IEEE Computer Society August 2009, p. 55-62,
2009.
N.Griral , M. Crucianu et.al, Unsupervised and Semi-supervised
Clustering: a Brief Survey ,INRIA Rocquencourt, B.P 105,France,
pp. 1-12, August 15,2005.
H.Wahidah, L.V.Pey et.al., Application of Data Mining
Techniques for Improving Software Engineering, The 5th
International Conference on Information Technology, vol.2, p. 1-5
2011.
R.R.Henrique, E.A.A.Ahmed, Proposed Application of Data
Mining Technique for Clustering Software Projects, INFOCOMPspecial edition,p. 43-48, Jul 2010
C.Keith, A.Peter et.al., Towards Automating Class- Splitting
Using Betweenness Clustering, IEEE/ACM International
Conference in Automated Software Engineering, p. 595-599, Nov
2009.
K.J.Puneet, Pallavi, Data Mining Techniques for Software Defect
Prediction, International Journal of Software and Web Sciences,
Vol.3, p. 54-57, Feb 2013.
J.Aastha, K.Rajneet. Review :Comparative Study of Various
Clustering Techniques in Data Mining, International Journal of
Advanced Research in Computer Science and Software
Engineering, vol. 3, p.55-57, March 2013.
R.Yogita, Dr.R.Harish, A Study of Hierarchical Clustering
Algorithm, International Journal of Information and Computation
Technology, vol.3, p.1225-1232, Nov 2013.
M.Fionn, C.Pedro, Methods of Hierarchical Clustering, p. 1-21,
May 3 2011.
K.Mandeep, K.Usvir, Comparison Between K-mean and
Hierarchical Algorithm Using Query Redirection, IJARCSSE,
vol.3, p. 1454-1459, Jul 2013.
S.Lazhar, B.Mourad et.al., Improving Class Cohesion
Measurement: Towards a Novel Approach Using Hierarchical
Clustering, Journal of Software Engineering and Apllications, vol.
5, p. 449-458, Jul 2012.
S.Chatti, G.R. Krishna et.al, A Method to Find Optimum Number
of Clusters Based on Fuzzy Silhouette on Dynamic Data Set,
Procedia Computer Science, vol.46, p. 346-353 2015.
B.Ederson, F.G.Daniel et.al, Silhouette-Based Clustering using an
Immune Network, IEEE World Confress on Computational
Intelligence, Brisbane-Australia, p.1-9, June 2012.
P.K.Shraddha, M.Emmanuel, Review and Comparative Study of
Clustering Techniques, International Journal of Computer Science
and Information Technologies, vol.5, p.805-812, 2014.

N.Thanh, B.Asim et.al, Automatic

spike sorting by
unsupervised clustering with diffusion maps and silhouettes,
Neurocomputing, Vol. 153, p.199-210, April 2015.
[16] J.Han and M.Kamber, Data Mining: Concepts and Techniques,
The Morgankaufmann/ Elsevier,India.

You might also like