A Comparative Study of Data Clustering
A Comparative Study of Data Clustering
are presented with a comprehensive comparison Some of the clustering techniques rely on
of the different techniques and the effect of knowing the number of clusters apriori. In that
different parameters in the process. case the algorithm tries to partition the data into
the given number of clusters. K-means and Fuzzy
The remainder of the paper is organized as
C-means clustering are of that type. In other cases
follows. Section II presents an overview of data
it is not necessary to have the number of clusters
clustering and the underlying concepts. Section
known from the beginning; instead the algorithm
III presents each of the four clustering techniques
starts by finding the first large cluster, and then
in detail along with the underlying mathematical
goes to find the second, and so on. Mountain and
foundations. Section IV introduces the
Subtractive clustering are of that type. In both
implementation of the techniques and goes over
cases a problem of known cluster numbers can be
the results of each technique, followed by a
applied; however if the number of clusters is not
comparison of the results. A brief conclusion is
known, K-means and Fuzzy C-means clustering
presented in Section V. The MATLAB code
cannot be used.
listing of the four clustering techniques can be
Another aspect of clustering algorithms is
found in the appendix.
their ability to be implemented in on-line or off-
line mode. On-line clustering is a process in
II. DATA CLUSTERING OVERVIEW
which each input vector is used to update the
cluster centers according to this vector position.
As mentioned earlier, data clustering is The system in this case learns where the cluster
concerned with the partitioning of a data set into centers are by introducing new input every time.
several groups such that the similarity within a In off-line mode, the system is presented with a
group is larger than that among groups. This training data set, which is used to find the cluster
implies that the data set to be partitioned has to centers by analyzing all the input vectors in the
have an inherent grouping to some extent; training set. Once the cluster centers are found
otherwise if the data is uniformly distributed, they are fixed, and they are used later to classify
trying to find clusters of data will fail, or will lead new input vectors. The techniques presented here
to artificially introduced partitions. Another are of the off-line type. A brief overview of the
problem that may arise is the overlapping of data four techniques is presented here. Full detailed
groups. Overlapping groupings sometimes reduce discussion will follow in the next section.
the efficiency of the clustering method, and this The first technique is K-means clustering
reduction is proportional to the amount of overlap [6] (or Hard C-means clustering, as compared to
between groupings. Fuzzy C-means clustering.) This technique has
Usually the techniques presented in this been applied to a variety of areas, including
paper are used in conjunction with other image and speech data compression, [3, 4] data
sophisticated neural or fuzzy models. In preprocessing for system modeling using radial
particular, most of these techniques can be used basis function networks, and task decomposition
as preprocessors for determining the initial in heterogeneous neural network architectures
locations for radial basis functions or fuzzy if- [5]. This algorithm relies on finding cluster
then rules. centers by trying to minimize a cost function of
The common approach of all the clustering dissimilarity (or distance) measure.
techniques presented here is to find cluster The second technique is Fuzzy C-means
centers that will represent each cluster. A cluster clustering, which was proposed by Bezdek in
center is a way to tell where the heart of each 1973 [1] as an improvement over earlier Hard C-
cluster is located, so that later when presented means clustering. In this technique each data
with an input vector, the system can tell which point belongs to a cluster to a degree specified by
cluster this vector belongs to by measuring a a membership grade. As in K-means clustering,
similarity metric between the input vector and al Fuzzy C-means clustering relies on minimizing a
the cluster centers, and determining which cluster cost function of dissimilarity measure.
is the nearest or most similar one.
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 3
The third technique is Mountain clustering, The partitioned groups are defined by a
proposed by Yager and Filev [1]. This technique c n binary membership matrix U , where the
builds calculates a mountain function (density element uij is 1 if the jth data point x j belongs
function) at every possible position in the data
to group i , and 0 otherwise. Once the cluster
space, and chooses the position with the greatest
centers ci are fixed, the minimizing uij for
density value as the center of the first cluster. It
then destructs the effect of the first cluster Equation (1) can be derived as follows:
mountain function and finds the second cluster 2 2
1 if x j ci x j ck , for each k i ,
center. This process is repeated until the desired uij = (2)
number of clusters have been found. 0 otherwise.
The fourth technique is Subtractive
clustering, proposed by Chiu [1]. This technique Which means that x j belongs to group i if
is similar to mountain clustering, except that
ci is the closest center among all centers.
instead of calculating the density function at
every possible position in the data space, it uses On the other hand, if the membership
the positions of the data points to calculate the matrix is fixed, i.e. if uij is fixed, then the optimal
density function, thus reducing the number of
center ci that minimize Equation (1) is the mean
calculations significantly.
of all vectors in group i :
III. DATA CLUSTERING TECHNIQUES 1
ci = xk , (3)
Gi k , x k Gi
In this section a detailed discussion of each
technique is presented. Implementation and where Gi is the size of Gi , or Gi =
n
u .
results are presented in the following sections. j =1 ij
implementation issues is presented later in this The algorithm works iteratively through the
paper. preceding two conditions until the no more
improvement is noticed. In a batch mode
B. Fuzzy C-means Clustering operation, FCM determines the cluster centers ci
and the membership matrix U using the
Fuzzy C-means clustering (FCM), relies on following steps:
the basic idea of Hard C-means clustering
(HCM), with the difference that in FCM each Step 1: Initialize the membership matrix U with
data point belongs to a cluster to a degree of random values between 0 and 1 such that
membership grade, while in HCM every data the constraints in Equation (4) are
point either belongs to a certain cluster or not. So satisfied.
FCM employs fuzzy partitioning such that a Step 2: Calculate c fuzzy cluster centers
given data point can belong to several groups ci , i = 1, , c , using Equation (6).
with the degree of belongingness specified by Step 3: Compute the cost function according to
membership grades between 0 and 1. However, Equation (5). Stop if either it is below a
FCM still uses a cost function that is to be certain tolerance value or its
minimized while trying to partition the data set. improvement over previous iteration is
The membership matrix U is allowed to below a certain threshold.
have elements with values between 0 and 1. Step 4: Compute a new U using Equation (7).
However, the summation of degrees of Go to step 2.
belongingness of a data point to all clusters is
As in K-means clustering, the performance
always equal to unity:
c
of FCM depends on the initial membership matrix
uij = 1, j = 1, ,n . (4) values; thereby it is advisable to run the algorithm
i =1 for several times, each starting with different
values of membership grades of data points.
The cost function for FCM is a
generalization of Equation (1): C. Mountain Clustering
c c n
J (U , c1 , , cc ) = Ji = uijm dij2 , (5)
i =1 i =1 j =1 The mountain clustering approach is a
simple way to find cluster centers based on a
where uij is between 0 and 1; ci is the cluster density measure called the mountain function.
This method is a simple way to find approximate
center of fuzzy group i ; d ij = ci x j is the cluster centers, and can be used as a preprocessor
Euclidean distance between the ith cluster center for other sophisticated clustering methods.
and the jth data point; and m [1, ) is a The first step in mountain clustering
involves forming a grid on the data space, where
weighting exponent.
the intersections of the grid lines constitute the
The necessary conditions for Equation (5)
potential cluster centers, denoted as a set V .
to reach its minimum are
n
The second step entails constructing a
umx
j =1 ij j mountain function representing a data density
ci = n m
, (6)
measure. The height of the mountain function at a
u
j =1 ij point v V is equal to
2
and
N v xi
m( v ) = exp , (8)
i =1 2 2
1
uij = 2 /( m 1)
. (7)
c d ij where xi is the ith data point and is an
k =1
d kj application specific constant. This equation states
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 5
that the data density measure at a point v is Since each data point is a candidate for
affected by all the points xi in the data set, and cluster centers, a density measure at data point xi
this density measure is inversely proportional to is defined as
the distance between the data points xi and the n xi x j
2
The subtracted amount eliminates the effect of the where rb is a positive constant which defines a
first cluster. Note that after subtraction, the new
neighborhood that has measurable reductions in
mountain function mnew ( v) reduces to zero at
density measure. Therefore, the data points near
v = c1 . the first cluster center x c1 will have significantly
After subtraction, the second cluster center reduced density measure.
is selected as the point having the greatest value After revising the density function, the next
for the new mountain function. This process cluster center is selected as the point having the
continues until a sufficient number of cluster greatest density value. This process continues
centers is attained. until a sufficient number of clusters is attainted.
D. Subtractive Clustering
IV. IMPLEMENTATION AND RESULTS
The problem with the previous clustering
method, mountain clustering, is that its Having introduced the different clustering
computation grows exponentially with the techniques and their basic mathematical
dimension of the problem; that is because the foundations, we now turn to the discussion of
mountain function has to be evaluated at each these techniques on the basis of a practical study.
grid point. Subtractive clustering solves this This study involves the implementation of each of
problem by using data points as the candidates for the four techniques introduced previously, and
cluster centers, instead of grid points as in testing each one of them on a set of medical data
mountain clustering. This means that the related to heart disease diagnosis problem.
computation is now proportional to the problem The medical data used consists of 13 input
size instead of the problem dimension. However, attributes related to clinical diagnosis of a heart
the actual cluster centers are not necessarily disease, and one output attribute which indicates
located at one of the data points, but in most cases whether the patient is diagnosed with the heart
it is a good approximation, especially with the disease or not. The whole data set consists of 300
reduced computation this approach introduces. cases. The data set is partitioned into two data
sets: two-thirds of the data for training, and one-
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 6
third for evaluation. The number of clusters into Equations (2) and (3), respectively, until no
which the data set is to be partitioned is two further improvement in the cost function is
clusters; i.e. patients diagnosed with the heart noticed. Since the algorithm initializes the cluster
disease, and patients not diagnosed with the heart centers randomly, its performance is affected by
disease. Because of the high number of those initial cluster centers. So several runs of the
dimensions in the problem (13-dimensions), no algorithm is advised to have better results.
visual representation of the clusters can be Evaluating the algorithm is realized by
presented; only 2-D or 3-D clustering problems testing the accuracy of the evaluation set. After
can be visually inspected. We will rely heavily on the cluster centers are determined, the evaluation
performance measures to evaluate the clustering data vectors are assigned to their respective
techniques rather than on visual approaches. clusters according to the distance between each
As mentioned earlier, the similarity metric vector and each of the cluster centers. An error
used to calculate the similarity between an input measure is then calculated; the root mean square
vector and a cluster center is the Euclidean error (RMSE) is used for this purpose. Also an
distance. Since most similarity metrics are accuracy measure is calculated as the percentage
sensitive to the large ranges of elements in the of correctly classified vectors. The algorithm was
input vectors, each of the input variables must be tested for 10 times to determine the best
normalized to within the unit interval [ 0,1] ; i.e. performance. Table 1 lists the results of those
runs. Figure 1 shows a plot of the cost function
the data set has to be normalized to be within the
over time for the best test case.
unit hypercube.
Each clustering algorithm is presented with
the training data set, and as a result two clusters
K-means clustering cost function history
450
A. K-means Clustering
200
1 2 3 4 5 6 7 8 9 10 11
Iteration
1
Best Linear Fit: A = (0.604) T + (0.214)
Data Points B. Fuzzy C-means Clustering
A=T
Best Linear Fit
Figure 2. Regression Analysis of K-means Clustering be conducted to have higher probability of getting
good performance. However, the results showed
no (or insignificant) variation in performance or
To further measure how accurately the
accuracy when the algorithm was run for several
identified clusters represent the actual
times.
classification of data, a regression analysis is
For testing the results, every vector in the
performed of the resultant clustering against the
evaluation data set is assigned to one of the
original classification. Performance is considered
clusters with a certain degree of belongingness
better if the regression line slope is close to 1.
(as done in the training set). However, because
Figure 2 shows the regression analysis of the best
the output values we have are crisp values (either
test case.
1 or 0), the evaluation set degrees of membership
As seen from the results, the best case
are defuzzified to be tested against the actual
achieved 80% accuracy and an RMSE of 0.447.
outputs.
This relatively moderate performance is related to
The same performance measures applied in
the high dimensionality of the problem; having
K-means clustering will be used here; however
too much dimensions tend to disrupt the coupling
only the effect of the weighting exponent m is
of data and introduces overlapping in some of
analyzed, since the effect of random initial
these dimensions that reduces the accuracy of
membership grades has insignificant effect on the
clustering. It is noticed also that the cost function
final cluster centers. Table 2 lists the results of
converges rapidly to a minimum value as seen
the tests with the effect of varying the weighting
from the number of iterations in each test run.
exponent m . It is noticed that very low or very
However, this has no effect on the accuracy
measure. high values for m reduces the accuracy;
moreover high values tend to increase the time
taken by the algorithm to find the clusters. A
value of 2 seems adequate for this problem since
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 8
it has good accuracy and requires less number of points, and a grid size of g per dimension, the
iterations. Figure 3 shows the accuracy and required number of calculations is:
number of iterations against the weighting factor. N = m g n + (c 1) g n (12)
1st cluster remainder clusters
Performance of Fuzzy C-means Clustering
100
80
dimensions, 200 training inputs, and a grid size of
70
10 per dimension, the required number of
60
mountain function calculation is approximately
50
2.01 1015 calculations. In addition the value of
40
the mountain function needs to be stored for
30
every grid point for later calculations in finding
20
subsequent clusters; which requires g n storage
10 locations, for our problem this would be 1013
0
0 1 2 3 4 5 6 7 8 9 10 11 12
storage locations. Obviously this seems
m
impractical for a problem of this dimension.
In order to be able to test this algorithm, the
Figure 3. Fuzzy C-means Clustering Performance
dimension of the problem have to be reduced to a
reasonable number; e.g. 4-dimensions. This is
In general, the FCM technique showed no achieved by randomly selecting 4 variables from
improvement over the K-means clustering for this the input data out of the original 13 and
problem. Both showed close accuracy; moreover performing the test on those variables. Several
FCM was found to be slower than K-means tests involving differently selected random
because of fuzzy calculations. variables are conducted in order to have a better
understanding of the results. Table 3 lists the
C. Mountain Clustering results of 10 test runs of randomly selected
variables. The accuracy achieved ranged between
Mountain clustering relies on dividing the 52% and 78% with an average of 70%, and
data space into grid points and calculating a average RMSE of 0.546. Those results are quite
mountain function at every grid point. This discouraging compared to the results achieved in
mountain function is a representation of the K-means and FCM clustering. This is due to the
density of data at this point. fact that not all of the variables of the input data
The performance of mountain clustering is contribute to the clustering process; only 4 are
severely affected by the dimension of the chosen at random to make it possible to conduct
problem; the computation needed rises the tests. However, with only 4 attributes chosen
exponentially with the dimension of input data to do the tests, mountain clustering required far
because the mountain function has to be much more time than any other technique during
evaluated at each grid point in the data space. For the tests; this is because of the fact that the
a problem with c clusters, n dimensions, m data
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 9
80
D. Subtractive Clustering 70
60
Accuracy
50
VI. REFERENCES
Appendix
nc = 2; % number of clusters = 2
% Clustering Loop
delta = 1e-5;
n = 1000;
iter = 1;
while (iter < n)
for i = 1:nc
for j = 1:m
d = euc_dist(x(j,:),c(i,:));
u(i,j) = 1;
for k = 1:nc
if k~=i
if euc_dist(x(j,:),c(k,:)) < d
u(i,j) = 0;
end
end
end
end
end
J(iter) = 0;
for i = 1:nc
JJ(i) = 0;
for k = 1:m
if u(i,k)==1
JJ(i) = JJ(i) + euc_dist(x(k,:),c(i,:));
end
end
J(iter) = J(iter) + JJ(i);
end
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 13
for i = 1:nc
sum_x = 0;
G(i) = sum(u(i,:));
for k = 1:m
if u(i,k)==1
sum_x = sum_x + x(k,:);
end
end
c(i,:) = sum_x ./ G(i);
end
iter = iter + 1;
end % while
disp('Clustering Done.');
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
for i = 1:nc
for j = 1:m
d = euc_dist(x(j,:),c(i,:));
evu(i,j) = 1;
for k = 1:nc
if k~=i
if euc_dist(x(j,:),c(k,:)) < d
evu(i,j) = 0;
end
end
end
end
end
% Analyze results
ev = EvalSet(:,end)';
rmse(1) = norm(evu(1,:)-ev)/sqrt(length(evu(1,:)));
rmse(2) = norm(evu(2,:)-ev)/sqrt(length(evu(2,:)));
subplot(2,1,1);
if rmse(1) < rmse(2)
r = 1;
else
r = 2;
end
ctr = ctr + 1;
end
end
str = sprintf('Testing Set accuracy: %.2f%%', ctr*100/m);
disp(str);
[m,b,r] = postreg(evu(r,:),ev); % Regression Analysis
disp(sprintf('r = %.3f', r));
nc = 2; % number of clusters = 2
% Clustering Loop
m_exp = 12;
prevJ = 0;
J = 0;
delta = 1e-5;
n = 1000;
iter = 1;
while (iter < n)
for i = 1:nc
sum_ux = 0;
sum_u = 0;
for j = 1:m
sum_ux = sum_ux + (u(i,j)^m_exp)*x(j,:);
sum_u = sum_u + (u(i,j)^m_exp);
end
c(i,:) = sum_ux ./ sum_u;
end
J(iter) = 0;
for i = 1:nc
JJ(i) = 0;
for j = 1:m
JJ(i) = JJ(i) + (u(i,j)^m_exp)*euc_dist(x(j,:),c(i,:));
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 15
end
J(iter) = J(iter) + JJ(i);
end
for i = 1:nc
for j = 1:m
sum_d = 0;
for k = 1:nc
sum_d = sum_d + (euc_dist(c(i,:),x(j,:))/euc_dist(c(k,:),x(j,:)))^(2/(m_exp-1));
end
u(i,j) = 1/sum_d;
end
end
iter = iter + 1;
end % while
disp('Clustering Done.');
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
for i = 1:nc
for j = 1:m
sum_d = 0;
for k = 1:nc
sum_d = sum_d + (euc_dist(c(i,:),x(j,:))/euc_dist(c(k,:),x(j,:)))^(2/(m_exp-1));
end
evu(i,j) = 1/sum_d;
end
end
% Analyze results
ev = EvalSet(:,end)';
rmse(1) = norm(evu(1,:)-ev)/sqrt(length(evu(1,:)));
rmse(2) = norm(evu(2,:)-ev)/sqrt(length(evu(2,:)));
subplot(2,1,1);
if rmse(1) < rmse(2)
r = 1;
else
r = 2;
end
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 16
% Mountain Clustering
%------------------------------------------------------------------
% Setup the training data
%------------------------------------------------------------------
%------------------------------------------------------------------
% First: setup a grid matrix of n-dimensions (V)
% (n = the dimension of input data vectors)
% The gridding granularity is 'gr' = # of grid points per dimension
%------------------------------------------------------------------
gr = 10;
% setup the dimension vector [d1 d2 d3 .... dn]
v_dim = gr * ones([1 n]);
% setup the mountain matrix
M = zeros(v_dim);
sigma = 0.1;
%------------------------------------------------------------------
% Second: calculate the mountain function at every grid point
%------------------------------------------------------------------
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 17
% report progress
if mod(i,5000)==0
str = sprintf('vector %.0d/%.0d; M(v)=%.2f', i, cur(1,end), M(i));
disp(str);
end
end
%------------------------------------------------------------------
% Third: select the first cluster center by choosing the point
% with the greatest density value
%------------------------------------------------------------------
c(1,:) = max_v;
c1 = max_i;
str = sprintf('Cluster 1:');
disp(str);
str = sprintf('%4.1f', c(1,:));
disp(str);
str = sprintf('M=%.3f', max_m);
disp(str);
%------------------------------------------------------------------
% CLUSTER 2
%------------------------------------------------------------------
Mnew = zeros(v_dim);
max_m = 0;
max_v = 0;
beta = 0.1;
% report progress
if mod(i,5000)==0
str = sprintf('vector %.0d/%.0d; Mnew(v)=%.2f', i, cur(1,end), Mnew(i));
disp(str);
end
end
c(2,:) = max_v;
str = sprintf('Cluster 2:');
disp(str);
str = sprintf('%4.1f', c(2,:));
disp(str);
str = sprintf('M=%.3f', max_m);
disp(str);
%------------------------------------------------------------------
% Evaluation
%------------------------------------------------------------------
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
% drop the attributes corresponding to the ones dropped in the training set
for i = 1:n_dropped
x(:,dropped(i)) = [];
end
[m,n] = size(x);
% Analyze results
ev = EvalSet(:,end)';
rmse(1) = norm(evu(1,:)-ev)/sqrt(length(evu(1,:)));
rmse(2) = norm(evu(2,:)-ev)/sqrt(length(evu(2,:)));
% Subtractive Clustering
%------------------------------------------------------------------
% Setup the training data
%------------------------------------------------------------------
%------------------------------------------------------------------
% First: Initialize the density matrix and some variables
%------------------------------------------------------------------
D = zeros([m 1]);
ra = 1.0;
%------------------------------------------------------------------
% Second: calculate the density function at every data point
%------------------------------------------------------------------
% report progress
if mod(i,50)==0
str = sprintf('vector %.0d/%.0d; D(v)=%.2f', i, m, D(i));
A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 20
% disp(str);
end
end
%------------------------------------------------------------------
% Third: select the first cluster center by choosing the point
% with the greatest density value
%------------------------------------------------------------------
c(1,:) = max_x;
c1 = max_i;
str = sprintf('Cluster 1:');
disp(str);
str = sprintf('%4.1f', c(1,:));
disp(str);
str = sprintf('D=%.3f', max_d);
disp(str);
%------------------------------------------------------------------
% CLUSTER 2
%------------------------------------------------------------------
% report progress
if mod(i,50)==0
str = sprintf('vector %.0d/%.0d; Dnew(v)=%.2f', i, m, Dnew(i));
disp(str);
end
end
c(2,:) = max_x;
str = sprintf('Cluster 2:');
disp(str);
str = sprintf('%4.1f', c(2,:));
disp(str);
str = sprintf('D=%.3f', max_d);
disp(str);
%------------------------------------------------------------------
% Evaluation
%------------------------------------------------------------------
x = Normalize(EvalSet, range);
x(:,end) = [];
[m,n] = size(x);
dd = euc_dist(x(j,:),c(i,:));
evu(i,j) = 1;
for k = 1:2
if k~=i
if euc_dist(x(j,:),c(k,:)) < dd
evu(i,j) = 0;
end
end
end
end
end
% Analyze results
ev = EvalSet(:,end)';
rmse(1) = norm(evu(1,:)-ev)/sqrt(length(evu(1,:)));
rmse(2) = norm(evu(2,:)-ev)/sqrt(length(evu(2,:)));