0% found this document useful (0 votes)
14 views

Internalmeasures

The document discusses internal validation measures for clustering algorithms. It introduces 11 widely used internal measures and evaluates their properties in terms of monotonicity, noise tolerance, density sensitivity, ability to handle subclusters, and performance on skewed data distributions. The experiments show that the SDbw measure performs well across all evaluation aspects, while other measures have limitations in some situations such as noise tolerance and subcluster detection.

Uploaded by

Mike
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Internalmeasures

The document discusses internal validation measures for clustering algorithms. It introduces 11 widely used internal measures and evaluates their properties in terms of monotonicity, noise tolerance, density sensitivity, ability to handle subclusters, and performance on skewed data distributions. The experiments show that the SDbw measure performs well across all evaluation aspects, while other measures have limitations in some situations such as noise tolerance and subcluster detection.

Uploaded by

Mike
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2010 IEEE International Conference on Data Mining

Understanding of Internal Clustering Validation Measures

Yanchi Liu1,2 , Zhongmou Li2 , Hui Xiong2 , Xuedong Gao1 , Junjie Wu3
1
School of Economics and Management, University of Science and Technology Beijing, China
[email protected], [email protected]
2
MSIS Department, Rutgers Business School, Rutgers University, USA
[email protected], [email protected]
3
School of Economics and Management, Beihang University, China
[email protected]

Abstract—Clustering validation has long been recognized without any additional information. In practice, external
as one of the vital issues essential to the success of clus- information such as class labels is often not available in
tering applications. In general, clustering validation can be many application scenarios. Therefore, in the situation that
categorized into two classes, external clustering validation
and internal clustering validation. In this paper, we focus on there is no external information available, internal validation
internal clustering validation and present a detailed study of measures are the only option for cluster validation.
11 widely used internal clustering validation measures for crisp In literature, a number of internal clustering validation
clustering. From five conventional aspects of clustering, we measures for crisp clustering have been proposed, such as
investigate their validation properties. Experiment results show
𝐶𝐻, 𝐼, 𝐷𝐵, 𝑆𝐷 and 𝑆 𝐷𝑏𝑤. However, current existing
that 𝑆 𝐷𝑏𝑤 is the only internal validation measure which
performs well in all five aspects, while other measures have measures can be affected by various data characteristics. For
certain limitations in different application scenarios. example, noise in data can have a significant impact on the
performance of an internal validation measure, if minimum
I. I NTRODUCTION or maximum pairwise distances are used in the measure.
The performance of existing measures in different situations
Clustering, one of the most important unsupervised learn- remains unknown. Therefore, we present a detailed study
ing problems, is the task of dividing a set of objects into of 11 widely used internal validation measures, as shown
clusters such that objects within the same cluster are similar in Table I. We investigate their validation properties in five
while objects in different clusters are distinct. Clustering different aspects: monotonicity, noise, density, subclusters
is widely used in many fields, such as image analysis and skewed distributions. For each aspect, we generate
and bioinformatics. As an unsupervised learning task, it is synthetic data for experiments. These synthetic data well
necessary to find a way to validate the goodness of partitions represent the properties. Finally, the experiment results show
after clustering. Otherwise, it would be difficult to make use that 𝑆 𝐷𝑏𝑤 is the only internal validation measure which
of different clustering results. performs well in all five aspects, while other measures have
Clustering validation, which evaluates the goodness of certain limitations in different application scenarios, mainly
clustering results [1], has long been recognized as one of the in aspects of noise and subclusters.
vital issues essential to the success of clustering applications
[2]. External clustering validation and internal clustering val- II. I NTERNAL C LUSTERING VALIDATION M EASURES
idation are the two main categories of clustering validation.
The main difference is whether or not external information In this section, we introduce some basic concepts of
is used for clustering validation. An example of external internal validation measures, as well as a suite of 11 widely
validation measure is entropy, which evaluates the “purity” used internal validation indices.
of clusters based on the given class labels [3]. As the goal of clustering is to make objects within the
Unlike external validation measures, which use external same cluster similar and objects in different clusters distinct,
information not present in the data, internal validation mea- internal validation measures are often based on the following
sures only rely on information in the data. The internal two criteria [4] [5].
measures evaluate the goodness of a clustering structure I. Compactness. It measures how closely related the
without respect to external information [4]. Since external objects in a cluster are. A group of measures evaluate cluster
validation measures know the “true” cluster number in compactness based on variance. Lower variance indicates
advance, they are mainly used for choosing an optimal better compactness. Also, there are numerous measures
clustering algorithm on a specific data set. On the other hand, estimate the cluster compactness based on distance, such
internal validation measures can be used to choose the best as maximum or average pairwise distance, and maximum or
clustering algorithm as well as the optimal cluster number average center-based distance.

1550-4786/10 $26.00 © 2010 IEEE 911


DOI 10.1109/ICDM.2010.35
Table I
I NTERNAL C LUSTERING VALIDATION M EASURES
Measure Notation Definition Optimal value
∑ ∑ ∑ 1
1 Root-mean-square std dev 𝑅𝑀 𝑆𝑆𝑇 𝐷 { 𝑖 𝑥∈𝐶 ∥ 𝑥 − 𝑐𝑖 ∥2 /[𝑃 𝑖 (𝑛𝑖 − 1)]} 2 Elbow
∑ 𝑖 ∑ ∑ ∑
2 R-squared 𝑅𝑆 ( 𝑥∈𝐷 ∥ 𝑥 − 𝑐 ∥ − 𝑖 𝑥∈𝐶 ∥ 𝑥 − 𝑐𝑖 ∥2 )/ 𝑥∈𝐷 ∥ 𝑥 − 𝑐 ∥2
2
Elbow
2
∑ ∑ 𝑖
3 Modified Hubert Γ statistic Γ 𝑛(𝑛−1) 𝑥∈𝐷 𝑦∈𝐷 𝑑(𝑥, 𝑦)𝑑𝑥∈𝐶𝑖 ,𝑦∈𝐶𝑗 (𝑐𝑖 , 𝑐𝑗 ) Elbow

𝑛 𝑑2 (𝑐𝑖 ,𝑐)/(𝑁 𝐶−1)
4 Calinski-Harabasz index 𝐶𝐻 ∑ ∑𝑖 𝑖 Max
𝑖 𝑥∈𝐶∑ 𝑑2 (𝑥,𝑐𝑖 )/(𝑛−𝑁 𝐶)
𝑖
𝑑(𝑥,𝑐)
5 𝐼 index 𝐼 ( 𝑁1𝐶 ⋅ ∑ ∑𝑥∈𝐷 𝑑(𝑥,𝑐 ) ⋅ max𝑖,𝑗 𝑑(𝑐𝑖 , 𝑐𝑗 ))𝑝 Max
𝑖 𝑥∈𝐶𝑖 𝑖
min𝑥∈𝐶 ,𝑦∈𝐶 𝑑(𝑥,𝑦)
6 Dunn’s indices 𝐷 min𝑖 {min𝑗 ( max {max𝑖 𝑗
)} Max
𝑘 𝑥,𝑦∈𝐶𝑘 𝑑(𝑥,𝑦)}
1
∑ 1
∑ 𝑏(𝑥)−𝑎(𝑥)
7 Silhouette index 𝑆 𝑁𝐶 {
𝑖 𝑛𝑖 ∑ 𝑥∈𝐶𝑖 max[𝑏(𝑥),𝑎(𝑥)] } Max

𝑎(𝑥) = 𝑛 1−1 1
𝑦∈𝐶𝑖 ,𝑦∕=𝑥 𝑑(𝑥, 𝑦), 𝑏(𝑥) = min𝑗,𝑗∕=𝑖 [ 𝑛𝑗 𝑦∈𝐶𝑗 𝑑(𝑥, 𝑦)]
1
∑ 𝑖
1
∑ 1

8 Davies-Bouldin index 𝐷𝐵 𝑁𝐶 𝑖 max𝑗,𝑗∕=𝑖 {[ 𝑛𝑖 𝑥∈𝐶𝑖 𝑑(𝑥, 𝑐𝑖 ) + 𝑛𝑗 𝑥∈𝐶𝑗 𝑑(𝑥, 𝑐𝑗 )]/𝑑(𝑐𝑖 , 𝑐𝑗 )} Min
∑ ∑
9 Xie-Beni index 𝑋𝐵 [ 𝑖 𝑥∈𝐶 𝑑2 (𝑥, 𝑐𝑖 )]/[𝑛⋅ 𝑚𝑖𝑛𝑖,𝑗∕=𝑖 𝑑2 (𝑐𝑖 , 𝑐𝑗 )] Min
𝑖
10 SD validity index 𝑆𝐷 𝐷𝑖𝑠(𝑁 𝐶𝑚𝑎𝑥 )𝑆𝑐𝑎𝑡(𝑁 𝐶) + 𝐷𝑖𝑠(𝑁 𝐶) Min
∑ 𝑚𝑎𝑥𝑖,𝑗 𝑑(𝑐𝑖 ,𝑐𝑗 ) ∑ ∑ −1
𝑆𝑐𝑎𝑡(𝑁 𝐶) = 𝑁1𝐶 𝑖 ∥ 𝜎(𝐶𝑖 ) ∥ / ∥ 𝜎(𝐷) ∥, 𝐷𝑖𝑠(𝑁 𝐶) = 𝑚𝑖𝑛𝑖,𝑗 𝑑(𝑐𝑖 ,𝑐𝑗 ) 𝑖( 𝑗 𝑑(𝑐𝑖 , 𝑐𝑗 ))
11 S Dbw validity index 𝑆 𝐷𝑏𝑤 𝑆𝑐𝑎𝑡(𝑁 𝐶) + 𝐷𝑒𝑛𝑠 𝑏𝑤(𝑁 𝐶) ∑ Min

1
∑ ∑ 𝑥∈𝐶𝑖 𝐶𝑗 𝑓 (𝑥,𝑢𝑖𝑗 )
𝐷𝑒𝑛𝑠 𝑏𝑤(𝑁 𝐶) = 𝑁 𝐶(𝑁 𝐶−1) 𝑖[ 𝑗,𝑗∕=𝑖
∑ ∑
𝑚𝑎𝑥{ 𝑥∈𝐶 𝑓 (𝑥,𝑐𝑖 ), 𝑥∈𝐶 𝑓 (𝑥,𝑐𝑗 )}
]
𝑖 𝑗
𝐷: data set; 𝑛: number of objects in 𝐷; 𝑐: center of 𝐷; 𝑃 : attributes number of 𝐷; 𝑁 𝐶: number of clusters; 𝐶𝑖 : the i–th cluster; 𝑛𝑖 : number of objects in 𝐶𝑖 ;
1
𝑐𝑖 : center of 𝐶𝑖 ; 𝜎(𝐶𝑖 ): variance vector of 𝐶𝑖 ; 𝑑(𝑥, 𝑦): distance between x and y; ∥ 𝑋𝑖 ∥= (𝑋𝑖𝑇 ⋅ 𝑋𝑖 ) 2

II. Separation. It measures how distinct or well-separated [7]. The Modified Hubert Γ statistic (Γ) [8] evaluates the
a cluster is from other clusters. For example, the pairwise difference between clusters by counting the disagreements
distances between cluster centers or the pairwise minimum of pairs of data objects in two partitions.
distances between objects in different clusters are widely The Calinski-Harabasz index (𝐶𝐻) [9] evaluates the
used as measures of separation. Also, measures based on cluster validity based on the average between- and within-
density are used in some indices. cluster sum of squares. Index 𝐼 (𝐼) [1] measures sep-
The general procedure to determine the best partition and aration based on the maximum distance between cluster
optimal cluster number of a set of objects by using internal centers, and measures compactness based on the sum of
validation measures is as follows. distances between objects and their cluster center. Dunn’s
Step 1: Initialize a list of clustering algorithms which will index (𝐷) [10] uses the minimum pairwise distance between
be applied to the data set. objects in different clusters as the inter-cluster separation
Step 2: For each clustering algorithm, use different com- and the maximum diameter among all clusters as the intra-
binations of parameters to get different clustering results. cluster compactness. These three indices take a form of
Step 3: Compute the corresponding internal validation 𝐼𝑛𝑑𝑒𝑥 = (𝑎⋅ 𝑆𝑒𝑝𝑎𝑟𝑎𝑡𝑖𝑜𝑛)/(𝑏⋅ 𝐶𝑜𝑚𝑝𝑎𝑐𝑡𝑛𝑒𝑠𝑠), where 𝑎 and
index of each partition obtained in step 2. 𝑏 are weights. The optimal cluster number is determined by
Step 4: Choose the best partition and the optimal cluster maximizing the value of these indices.
number according to the criteria.
Table I shows a suite of 11 widely used internal validation The Silhouette index (𝑆) [11] validates the clustering
measures. To the best of our knowledge, these measures performance based on the pairwise difference of between-
represent a good coverage of the validation measures avail- and within-cluster distances. In addition, the optimal cluster
able in different fields, such as data mining, information number is determined by maximizing the value of this index.
retrieval, and machine learning. The “Definition” column The Davies-Bouldin index (𝐷𝐵) [12] is calculated as
gives the computation forms of the measures. Next, we follows. For each cluster 𝐶, the similarities between 𝐶 and
briefly introduce these measures. all other clusters are computed, and the highest value is
Most indices consider both of the evaluation criteria (com- assigned to 𝐶 as its cluster similarity. Then the 𝐷𝐵 index
pactness and separation) in the way of ratio or summation, can be obtained by averaging all the cluster similarities.
such as 𝐷𝐵, 𝑋𝐵, and 𝑆 𝐷𝑏𝑤. On the other hand, some The smaller the index is, the better the clustering result
indices only consider one aspect, such as 𝑅𝑀 𝑆𝑆𝑇 𝐷, 𝑅𝑆, is. By minimizing this index, clusters are the most distinct
and Γ. from each other, and therefore achieves the best partition.
The Root-mean-square standard deviation (𝑅𝑀 𝑆𝑆𝑇 𝐷) The Xie-Beni index (𝑋𝐵) [13] defines the inter-cluster
is the square root of the pooled sample variance of all the separation as the minimum square distance between cluster
attributes [6]. It measures the homogeneity of the formed centers, and the intra-cluster compactness as the mean square
clusters. R-squared (𝑅𝑆) is the ratio of sum of squares distance between each data object and its cluster center. The
between clusters to the total sum of squares of the whole data optimal cluster number is reached when the minimum of
set. It measures the degree of difference between clusters [6] 𝑋𝐵 is found. Kim et al. [14] proposed indices 𝐷𝐵 ∗∗ and

912
Table II
E XPERIMENT R ESULTS OF T HE I MPACT OF M ONOTONICITY, T RUE 𝑁 𝐶 = 5
𝑅𝑀 𝑆𝑆𝑇 𝐷 𝑅𝑆 Γ 𝐶𝐻 𝐼 𝐷 𝑆 𝐷𝐵 ∗∗ 𝑆𝐷 𝑆 𝐷𝑏𝑤 𝑋𝐵 ∗∗
2 28.496 0.627 2973 1683 3384 0.491 0.607 0.716 0.215 61.843 0.265
3 20.804 0.801 3678 2016 5759 0.549 0.707 0.683 0.124 0.153 0.374
4 14.829 0.899 4007 2968 11230 0.580 0.825 0.522 0.075 0.059 0.495
5 3.201 0.994 4342 52863 106163 2.234 0.913 0.122 0.045 0.004 0.254
6 3.081 0.995 4343 45641 82239 0.025 0.718 0.521 0.504 0.066 35.099
7 2.957 0.996 4344 41291 68894 0.017 0.579 0.803 0.486 0.098 35.099
8 2.834 0.996 4346 38580 58420 0.009 0.475 1.016 0.538 0.080 36.506
9 2.715 0.997 4347 36788 50259 0.010 0.391 1.168 0.553 0.113 38.008

𝑋𝐵 ∗∗ in year 2005 as the improvements of 𝐷𝐵 and 𝑋𝐵. clustering results for different number of clusters. As shown
In this paper, we will use these two improved measures. in Figure 1, Wellseparated is a synthetic data set composed
The idea of SD index (𝑆𝐷) [15] is based on the concepts of five well-separated clusters.
of the average scattering and the total separation of clusters. As the experiment results shown in Table II, the first three
The first term evaluates compactness based on variances of indices monotonically increases or decreases as the cluster
cluster objects, and the second term evaluates separation number 𝑁 𝐶 increases. On the other hand, the rest eight
difference based on distances between cluster centers. The indices reach their maximum or minimum value as 𝑁 𝐶
value of this index is the summation of these two terms, equals to the true cluster number. There are certain reasons
and the optimal number of clusters can be obtained by for the monotonicity of the first three indices.
minimizing the value of 𝑆𝐷.
The S Dbw index (𝑆 𝐷𝑏𝑤) [16] takes density into ac-
count to measure the inter-cluster separation. The basic idea
is that for each pair of cluster centers, at least one of their
densities should be larger than the density of their midpoint.
The intra-cluster compactness is the same as it is in 𝑆𝐷.
Similarly, the index is the summation of these two terms
and the minimum value of 𝑆 𝐷𝑏𝑤 indicates the optimal
cluster number.
There are some other internal validation measures in
literature [17] [18] [19] [20]. However, some have poor Figure 1. The Data Set Wellseparated
performance while some are designed for data sets with √
specific structures. Take Composed Density between and 𝑅𝑀 𝑆𝑆𝑇 𝐷 = 𝑆𝑆𝐸/𝑃 (𝑛 − 𝑁 𝐶), and 𝑆𝑆𝐸 (Sum
within clusters index (𝐶𝐷𝑏𝑤) and Symmetry distance-based of Square Error) decreases as 𝑁 𝐶 increases. In practice
index (𝑆𝑦𝑚–𝑖𝑛𝑑𝑒𝑥) for examples. It is hard for 𝐶𝐷𝑏𝑤 to 𝑁 𝐶 ≪ 𝑛, thus 𝑛−𝑁 𝐶 can be viewed as a constant number.
find the representatives for each cluster, which makes the Therefore, 𝑅𝑀 𝑆𝑆𝑇 𝐷 decreases as 𝑁 𝐶 increases. And we
result of 𝐶𝐷𝑏𝑤 instable. Also 𝑆𝑦𝑚–𝑖𝑛𝑑𝑒𝑥 can only handle also have 𝑅𝑆 = (𝑇 𝑆𝑆 − 𝑆𝑆𝐸)/𝑇 𝑆𝑆 (𝑇 𝑆𝑆 - Total Sum of
data sets which are internally symmetrical. As a result, Squares), and 𝑇 𝑆𝑆 = 𝑆𝑆𝐸 + 𝑆𝑆𝐵 (𝑆𝑆𝐵 - Between group
we focus on the above mentioned 11 internal validation Sum of Squares) which is a constant number for a certain
measures in the rest of the paper. And throughout this paper, data set. Thus, 𝑅𝑆 increases as 𝑁 𝐶 increases.
we will use the acronyms of these measures. From the definition of Γ, only data objects in different
clusters will be counted in the equation. Therefore, if the
III. U NDERSTANDING OF I NTERNAL C LUSTERING data set is divided into two equal clusters, each cluster
VALIDATION M EASURES will have 𝑛/2 objects, and 𝑛2 /4 pairs of distances will be
In this section, we present a detailed study of the 11 counted actually. If the data set is divided into three equal
internal validation measures mentioned in Section II and clusters, each cluster will have 𝑛/3 objects, and 𝑛2 /3 pairs
investigate the validation properties of different internal of distances will be counted. Therefore, with the increasing
validation measures in different aspects, which may be of the cluster number 𝑁 𝐶, more pairs of distances are
helpful for index selection. If not mentioned, we use K- counted, which makes Γ increase.
means [21] (implemented by CLUTO) [22] as the clustering Looking further into these three indices, we can find
algorithm for experiment. out that they only take either separation or compactness
into account. (𝑅𝑆 and Γ only consider separation, and
A. The Impact of Monotonicity 𝑅𝑀 𝑆𝑆𝑇 𝐷 only considers compactness). As the property
The monotonicity of different internal validation indices of monotonicity, the curves of 𝑅𝑀 𝑆𝑆𝑇 𝐷, 𝑅𝑆 and Γ will
can be evaluated by the following experiment. We apply the be either upward or downward. It is claimed that the optimal
K-means algorithm on the data set Wellseparated and get the cluster number is reached at the shift point of the curves,

913
which is also known as “the elbow” [7]. However, since influence of noise, which makes the value of 𝐶𝐻 instable.
the judgement of the shift point is very subjective and hard Finally, the optimal cluster number will be affected by noise.
to determine, we will not discuss these three indices in the Moreover, the other indices rather than 𝐶𝐻 and 𝐷
further sections. will also be influenced by noise in a less sensitive way.
Comparing Table III with Table II, we can observe that
B. The Impact of Noise the values of other indices more or less change. If we add
In order to evaluate the influence of noise on internal 20% noise to the data set Wellseparated, the optimal cluster
validation indices, we have the following experiment on number suggested by 𝐼 will also be incorrect. Thus, in order
the data set Wellseparated.noise. As shown in Figure 2, to minimize the adverse effect of noise, in practice it is
Wellseparated.noise is a synthetic data set formulated by always good to remove noise before clustering.
adding 5% noise to the data set Wellseparated. The cluster
numbers select by indices are shown in Table III. The C. The Impact of Density
experiment results show that 𝐷 and 𝐶𝐻 choose the wrong
Data set with various density is challenging for many
cluster number. From our point of view, there are certain
clustering algorithms. Therefore, we are very interested
reasons that 𝐷 and 𝐶𝐻 are significantly affected by noise.
in whether it also affects the performance of the internal
validation measures. An experiment is done on a synthetic
data set with different density, which names Differentdensity.
The results listed in Table IV show that only 𝑆 suggests the
wrong optimal cluster number. The details of Differentden-
sity is shown in Figure 3.

Figure 2. The Data Set Wellseparated-noise

Table III
E XPERIMENT R ESULTS OF T HE I MPACT OF NOISE , T RUE 𝑁 𝐶 = 5
𝐶𝐻 𝐼 𝐷 𝑆 𝐷𝐵 ∗∗ 𝑆𝐷 𝑆 𝐷𝑏𝑤 𝑋𝐵 ∗∗
2 1626 3213 0.0493 0.590 0.739 0.069 20.368 0.264 Figure 3. The Data Set Differentdensity
3 1846 5073 0.0574 0.670 0.721 0.061 0.523 0.380
4 2554 9005 0.0844 0.783 0.560 0.050 0.087 0.444
5 10174 51530 0.0532 0.870 0.183 0.045 0.025 0.251 Table IV
6 14677 48682 0.0774 0.802 0.508 0.046 0.044 0.445 E XPERIMENT R ESULTS OF THE I MPACT OF D ENSITY, T RUE 𝑁 𝐶 = 3
7 12429 37568 0.0682 0.653 0.710 0.055 0.070 0.647
8 11593 29693 0.0692 0.626 0.863 0.109 0.052 2.404 𝐶𝐻 𝐼 𝐷 𝑆 𝐷𝐵 ∗∗ 𝑆𝐷 𝑆 𝐷𝑏𝑤 𝑋𝐵 ∗∗
9 11088 25191 0.0788 0.596 0.993 0.121 0.056 3.706 2 1172 120.1 0.0493 0.587 0.658 0.705 0.603 0.408
3 1197 104.3 0.0764 0.646 0.498 0.371 0.275 0.313
4 1122 93.5 0.0048 0.463 1.001 0.672 0.401 3.188
𝐷 uses the minimum pairwise distance between objects in 5 932 78.6 0.0049 0.372 1.186 0.692 0.367 3.078
different clusters (𝑚𝑖𝑛𝑥∈𝐶𝑖 ,𝑦∈𝐶𝑗 𝑑(𝑥, 𝑦)) as the inter-cluster 6 811 59.9 0.0049 0.312 1.457 0.952 0.312 6.192
separation, and the maximum diameter among all clusters 7 734 56.1 0.0026 0.278 1.688 1.192 0.298 9.082
8 657 44.8 0.0026 0.244 1.654 1.103 0.291 8.897
(max𝑘 {max𝑥,𝑦∈𝐶𝑘 𝑑(𝑥, 𝑦)}) as the intra-cluster compact- 9 591 45.5 0.0026 0.236 1.696 1.142 0.287 8.897
ness. And the optimal number of clusters can be obtained
by maximizing the value of 𝐷. When noise are introduced, The reason why 𝐼 does not give the right cluster number
the inter-cluster separation can decrease sharply since it only is not easy to tell. We can observe that 𝐼 keeps decreasing as
uses the minimum pairwise distance, rather than the average cluster number 𝑁 𝐶 increases. One possible reason by our
pairwise distance, between objects in different clusters. guess is the uniform effect of K-means algorithm, which
Thus, the value of 𝐷 may change dramatically and the tends to divide objects into relatively equal sizes [23].
corresponding optimal cluster number will be influenced by 𝐼 measures compactness based on the sum of distances
the noise. between objects and their cluster center. When 𝑁 𝐶 is small,
Since 𝐶𝐻 = (𝑆𝑆𝐵/𝑆𝑆𝐸)⋅ ((𝑛 − 𝑁 𝐶)/(𝑁 𝐶 − 1)), and objects with high density are likely in the same cluster,
((𝑛−𝑁 𝐶)/(𝑁 𝐶 −1)) is constant for the same 𝑁 𝐶, we can which makes the sum of distances almost remain the same.
just focus on the (𝑆𝑆𝐵/𝑆𝑆𝐸) part. By introducing noise, Since most of the objects are in one cluster, the total sum
𝑆𝑆𝐸 increases in a more significant way comparing with will not change too much. Therefore, as 𝑁 𝐶 increases, 𝐼
𝑆𝑆𝐵. Therefore, for the same 𝑁 𝐶, 𝐶𝐻 will decrease by the will decrease as 𝑁 𝐶 is in the denominator.

914
D. The Impact of Subclusters with skewed distributions. It consists of one large cluster
Subclusters are clusters that are closed to each other. Fig- and two small ones. Since K-means has the uniform effect
ure 4 shows a synthetic data set Subcluster which contains which tends to divide objects into relatively equal sizes,
five clusters, and four of them are subclusters since they can it does not have a good performance when dealing with
form two pairs of clusters respectively. skewed distributed data sets [23]. In order to demonstrate
The experiment results presented in Table V evaluate this statement, we employ four widely used algorithms
whether the internal validation measures can handle data set from four different categories: K-means (prototype-based),
with subclusters. For the data set Subcluster, 𝐷, 𝑆, 𝐷𝐵 ∗∗ , DBSCAN (density-based) [24], Agglo based on average-
𝑆𝐷 and 𝑋𝐵 ∗∗ get the wrong optimal cluster numbers, link (hierarchical) [2] and Chameleon (graph-based) [25].
while 𝐼, 𝐶𝐻 and 𝑆 𝐷𝑏𝑤 suggest the correct ones. Inter- We apply each of them on Skewdistribution and divide the
cluster separation is supposed to have a sharp decrease when data set into three clusters, since three is the true cluster
cluster number changes from 𝑁 𝐶𝑜𝑝𝑡𝑖𝑚𝑎𝑙 to 𝑁 𝐶𝑜𝑝𝑡𝑖𝑚𝑎𝑙+1 number. As shown in Figure 6, K-means performs the worst
[14]. However, for 𝐷, 𝑆, 𝐷𝐵 ∗∗ , 𝑆𝐷 and 𝑋𝐵 ∗∗ , sharper while Chameleon is the best.
deceases can be observed at 𝑁 𝐶 < 𝑁 𝐶𝑜𝑝𝑡𝑖𝑚𝑎𝑙 . The reasons
are as follows.

Figure 5. The Data Set Skewdistribution

Table VI
Figure 4. The Data Set Subcluster E XPERIMENT R ESULTS OF THE I MPACT OF S KEWED D ISTRIBUTIONS ,
T RUE 𝑁 𝐶 = 3
Table V 𝐶𝐻 𝐼 𝐷 𝑆 𝐷𝐵 ∗∗ 𝑆𝐷 𝑆 𝐷𝑏𝑤 𝑋𝐵 ∗∗
E XPERIMENT R ESULTS OF THE I MPACT OF S UBCLUSTERS , T RUE 2 788 232.3 0.0286 0.621 0.571 0.327 0.651 0.369
𝑁𝐶 = 5 3 1590 417.9 0.0342 0.691 0.466 0.187 0.309 0.264
𝐶𝐻 𝐼 𝐷 𝑆 𝐷𝐵 ∗∗ 𝑆𝐷 𝑆 𝐷𝑏𝑤 𝑋𝐵 ∗∗ 4 1714 334.5 0.0055 0.538 0.844 0.294 0.379 1.102
2 3474 2616 0.7410 0.736 0.445 0.156 0.207 0.378 5 1905 282.9 0.0069 0.486 0.807 0.274 0.445 0.865
3 7851 5008 0.7864 0.803 0.353 0.096 0.056 0.264 6 1886 226.7 0.0075 0.457 0.851 0.308 0.547 1.305
4 8670 5594 0.0818 0.737 0.540 0.164 0.039 1.420 7 1680 187.1 0.0071 0.371 1.181 0.478 0.378 3.249
5 16630 9242 0.0243 0.709 0.414 0.165 0.026 1.215 8 1745 172.9 0.0075 0.370 1.212 0.474 0.409 3.463
6 14310 7021 0.0243 0.587 0.723 0.522 0.063 12.538 9 1317 125.5 0.0061 0.301 1.875 0.681 0.398 7.716
7 12900 5745 0.0167 0.490 0.953 0.526 0.101 12.978
8 11948 4803 0.0167 0.402 1.159 0.535 0.105 14.037 An experiment is done on the data set Skewdistribution to
9 11354 4248 0.0107 0.350 1.301 0.545 0.108 14.858 evaluate the performance of different indices on data set with
𝑆 uses the average minimum distance between clusters skewed distributions. We use Chameleon as the clustering
as the inter-cluster separation. For data set with subclusters, algorithm. The experiment results listed in Table VI show
the inter-cluster separation will achieve its maximum value that only 𝐶𝐻 cannot give the right optimal cluster number.
when subclusters close to each other are considered as one Since 𝐶𝐻 = (𝑇 𝑆𝑆/𝑆𝑆𝐸 − 1)⋅ ((𝑛 − 𝑁 𝐶)/(𝑁 𝐶 − 1)) and
big cluster. Therefore, the wrong optimal cluster number 𝑇 𝑆𝑆 is a constant number of a certain data set. Thus, 𝐶𝐻 is
will be chosen due to subclusters. 𝑋𝐵 ∗∗ uses the minimum essentially based on 𝑆𝑆𝐸, which shares the same basis with
pairwise distance between cluster centers as the evaluation K-means algorithm. As mentioned above, K-means cannot
of separation. For data set with subclusters, the measure of handle skewed distributed data sets. Therefore, the similar
separation will achieve its maximum value when subclusters conclusion can be applied to 𝐶𝐻.
closed to each other are considered as a big cluster. As a Table VII lists the validation properties of all 11 internal
result, the correct cluster number will not be found by using validation measures in all five aspects studied above, which
𝑋𝐵 ∗∗ . The reasons for 𝐷, 𝑆𝐷 and 𝐷𝐵 ∗∗ are very similar may serve as a guidance for index selection in practice.
to the reason of 𝑋𝐵 ∗∗ , we will not elaborate them here due In this table, ’–’ stands for property not tested, and ’×’
to the limit of space. denotes situation cannot be handled. From Table VII we can
see, 𝑆 𝐷𝑏𝑤 is the only internal validation measure which
E. The Impact of Skewed Distributions performs well in all five aspects, while the other measures
It is common that clusters in a data set have unequal have certain limitations in different scenarios, mainly in
sizes. Figure 5 shows a synthetic data set Skewdistribution aspects of noise and subclusters.

915
70 70 70 70

60 60 60 60

50 50 50 50

40 40 40 40

30 30 30 30

20 30 40 50 60 70 80 20 30 40 50 60 70 80 20 30 40 50 60 70 80 20 30 40 50 60 70 80

(a) Clustering by K-means (b) Clustering by Agglo (c) Clustering by DBSCAN (d) Clustering by Chameleon
Figure 6. Clustering results on the data set Skewdistribution by different algorithms where NC = 3

Table VII
OVERALL P ERFORMANCE OF D IFFERENT I NDICES [6] S. Sharma, Applied multivariate techniques. New York, NY,
USA: John Wiley & Sons, Inc., 1996.
Mono. Noise Dens. Subc. Skew Dis.
𝑅𝑀 𝑆𝑆𝑇 𝐷 × – – – – [7] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On cluster-
𝑅𝑆 × – – – – ing validation techniques,” Journal of Intelligent Information
Γ × – – – – Systems, vol. 17, no. 2-3, pp. 107–145, 2001.
𝐶𝐻 × × [8] L. Hubert and P. Arabie, “Comparing partitions,” Journal of
𝐼 ×
𝐷 × × Classification, vol. 2, no. 1, pp. 193–218, 1985.
𝑆 × [9] T. Calinski and J. Harabasz, “A dendrite method for cluster
𝐷𝐵 ∗∗ × analysis,” Comm. in Statistics, vol. 3, no. 1, pp. 1–27, 1974.
𝑆𝐷 × [10] J. Dunn, “Well separated clusters and optimal fuzzy parti-
𝑆 𝐷𝑏𝑤
𝑋𝐵 ∗∗ × tions,” J. Cybern., vol. 4, no. 1, pp. 95–104, 1974.
[11] P. Rousseeuw, “Silhouettes: a graphical aid to the interpre-
tation and validation of cluster analysis,” J. Comput. Appl.
IV. CONCLUDING REMARKS Math., vol. 20, no. 1, pp. 53–65, 1987.
[12] D. Davies and D. Bouldin, “A cluster separation measure,”
In this paper, we investigated the validation properties of IEEE PAMI, vol. 1, no. 2, pp. 224–227, 1979.
a suite of 11 existing internal clustering validation measures [13] X. L. Xie and G. Beni, “A validity measure for fuzzy
for crisp clustering in five different aspects: monotonicity, clustering,” IEEE PAMI, vol. 13, no. 8, pp. 841–847, 1991.
noise, density, subclusters and skewed distributions. Compu- [14] M. Kim and R. S. Ramakrishna, “New indices for cluster
tational experiments on five synthetic data sets, which well validity assessment,” Pattern Recogn. Lett., vol. 26, no. 15,
pp. 2353–2363, 2005.
represent the above five aspects respectively, were used to [15] M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality
evaluate the 11 validation measures. As demonstrated by scheme assessment in the clustering process,” in PKDD,
the experiment results, most of the existing measures have London, UK, 2000, pp. 265–276.
certain limitations in different application scenarios. 𝑆 𝐷𝑏𝑤 [16] M. Halkidi and M. Vazirgiannis, “Clustering validity assess-
is the only measure that performs well in all five aspects. ment: Finding the optimal partitioning of a data set,” in
ICDM, Washington, DC, USA, 2001, pp. 187–194.
The summation of validation properties of these 11 internal [17] S. Saha and S. Bandyopadhyay, “Application of a new
validation measures shown in Table VII may serve as a symmetry-based cluster validity index for satellite image
guidance for index selection in practice. segmentation,” IEEE GRSL, 2002.
[18] M. Halkidi and M. Vazirgiannis, “Clustering validity assess-
V. ACKNOWLEDGEMENTS ment using multi-representatives,” in SETN, 2002.
This research was supported in part by National Science [19] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the
number of clusters in a data set via the gap statistic,” J. Royal
Foundation (NSF) via grant number CNS-0831186 and the Statistical Society, vol. 63, no. 2, pp. 411–423, 2001.
Program for New Century Excellent Talents in University of [20] B. S. Y. Lam and H. Yan, “A new cluster validity index for
China (NCET) via grant number NCET-05-0097. data with merged clusters and different densities,” in IEEE
ICSMC, 2005, pp. 798–803.
R EFERENCES [21] J. MacQueen, “Some methods for classification and analysis
of multivariate observations,” in Proceedings of BSMSP.
[1] U. Maulik and S. Bandyopadhyay, “Performance evaluation University of California Press, 1967, pp. 281–297.
of some clustering algorithms and validity indices,” IEEE [22] G. Karypis, Cluto —software for clustering high-dimentional
PAMI, vol. 24, pp. 1650–1654, 2002. datasets. version 2.1.2, 2006.
[2] A. K. Jain and R. C. Dubes, Algorithms for clustering data. [23] H. Xiong, J. Wu, and J. Chen, “K-means clustering versus
Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988. validation measures: a data distribution perspective,” in KDD,
[3] J. Wu, H. Xiong, and J. Chen, “Adapting the right measures New York, NY, USA, 2006, pp. 779–784.
for k-means clustering,” in KDD, 2009, pp. 877–886. [24] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-
[4] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data based algorithm for discovering clusters in large spatial
Mining. USA: Addison-Wesley Longman, Inc., 2005. databases with noise,” in KDD, 1996, pp. 226–231.
[5] Y. Zhao and G. Karypis, “Evaluation of hierarchical clustering [25] G. Karypis, E.-H. S. Han, and V. Kumar, “Chameleon:
algorithms for document datasets,” in Procedings of CIKM, Hierarchical clustering using dynamic modeling,” Computer,
2002, pp. 515–524. vol. 32, no. 8, pp. 68–75, 1999.

916

You might also like