An Efficient K-Means Clustering Algorithm
An Efficient K-Means Clustering Algorithm
SUrface
L.C. Smith College of Engineering and Computer
Electrical Engineering and Computer Science
Science
1-1-1997
Sanjay Ranka
University of Florida
Vineet Singh
Hitachi America, Ltd.
Recommended Citation
Alsabti, Khaled; Ranka, Sanjay; and Singh, Vineet, "An efficient k-means clustering algorithm" (1997). Electrical Engineering and
Computer Science. Paper 43.
http://surface.syr.edu/eecs/43
This Working Paper is brought to you for free and open access by the L.C. Smith College of Engineering and Computer Science at SUrface. It has been
accepted for inclusion in Electrical Engineering and Computer Science by an authorized administrator of SUrface. For more information, please
contact [email protected].
An Efficient K-Means Clustering Algorithm
There are two main approaches described in the litera- performed only with internal nodes (representing many pat-
ture which can be used to reduce the overall computational terns) and not the patterns themselves in most cases. This
requirements of the k-means clustering method especially approach can also be used to significantly reduce the time
for the distance calculations: requirements for calculating the prototypes for the next it-
eration (second for loop in Figure 1). We also expect the
1. Use the information from the previous iteration to time requirement for the second for loop to be proportional
reduce the number of distance calculations. P- to LFMK @ .
CLUSTER is a k-means-based clustering algorithm The improvements obtained using our approach are cru-
which exploits the fact that the change of the assign- cially dependent on obtaining good pruning methods for ob-
ment of patterns to clusters are relatively few after the taining candidate sets for the next level. We propose to use
first few iterations [7]. It uses a heuristic which deter- the following strategy.
mines if the closest prototype of a pattern E has been
N
changed or not by using a simple check. If the assign- For each candidate 4 , find the minimum and maxi-
ment has not changed, no further distance calculations mum distances to any point in the subspace
are required. It also uses the fact that the movement of
N
the cluster centroids is small for consecutive iterations Find
the minimum of maximum distances, call it
O OQP/R
(especially after a few iterations).
N
2. Organize the prototype vectors in a suitable data struc- Prune out all candidates with minimum distance
O OQP/R
ture so that finding the closest prototype for a given greater than
The above strategy guarantees that no candidate is pruned These results show that splitting along the longest dimen-
if it can potentially be closer than any other candidate pro- sion and choosing a midpoint-based approach for splitting
totype to a given subspace. is preferable [1].
Our algorithm is based on organizing the pattern vectors In the second phase of the k-means algorithm, the ini-
so that one can find all the patterns which are closest to a tial prototypes are derived. Just as in the direct k-means
given prototype efficiently. In the first phase of the algo- algorithm, these initial prototypes are generated randomly
rithm, we build a k-d tree to organize the pattern vectors. or drawn from the dataset randomly.4
The root of such a tree represents all the patterns, while the
children of the root represent subsets of the patterns com- function TraverseTree(% , ,! , )
pletely contained in subspaces (Boxes). The nodes at the ! = Pruning(% , ,! , )
lower levels represent smaller boxes. For each node of the - J then
if - ! ?
/* All the points in %
tree, we keep the following information:
belong to the alive
cluster */
1. The number of points ( ) Update the centroid’s statistics based on the in-
2. The linear sum of the points (
), i.e. 4 2
7 4
formation stored in the node
return
4
2 4
if % is a leaf then
for each point in %
3. The square sum of the points ( ), i.e. 7
?
Find the nearest prototype 9
Assign point to 9
Let the number of dimensions be @ and the depth of the
k-d tree be . The extra time and space requirements for Update the centroid’s statistics
maintaining A
the above information at each node is propor- return
B@ . Computing the medians at
2
tionalA to levels takes
time
[2]. These set of medians are needed in per-
for each child node do
TraverseTree( !! , ! , - ! - , )
called Clustering Feature (CF). However, as we will see later, we use CF proaches for this choice. However, these have not been investigated in
in a different way. this paper.
3 For the rest of the paper, patterns and points are used interchangeably. 5 Note that our approach is independent of the traversal strategy.
="! 9#
%$'&
function Pruning( , ,!, ) 2. The error function is: 7 4
2
/ #4 # 9
! The leaf size is an important parameter for tuning the
for each prototype 9 do
overall performance of our algorithm. Small leaf size re-
Compute the minimum ( % 9 ) and maximum sults in larger cost for constructing the tree, and increases
9
(
) distances for any point in the box
representing the the overall cost of pruning as the pruning may have to be
Find the minimum of
9 0 ! , call it continued to lower levels. However, a small leaf size de-
%
9 creases the overall cost for distance calculations for finding
for each prototype 9
if % 9 .
do
9 then
! the closest prototype.
%
! )
! 9
Calculating the Minimum and Maximum Distances
return(
The pruning algorithm requires calculation of the minimum
as well as maximum distance to any given box from a given
prototype. It can be easily shown that the maximum dis-
Figure 3. Pruning algorithm box. Let H)(+*-,/.+0-12, 4
tance will be to one of the corners
be that corner for prototype
(
of4 ).the The coordinates of
4 476
H)(+*-,/.+0%13, ( H)()*-,/.40%12, 4 H)(+*-,/.+0-12, 4 H)()*5,/.+0%12, ) can
approach and may miss some of the pruning opportunities. be computed as follows: ?
For example, the candidate shown as an x with a square
around it could be pruned with a more complex pruning 98;:
=<
:
= 4
:@
= 4
4 <
H)()*5,/.+0%12, A< <?> < <
strategy. However, our approach is relatively inexpensive :A@ ,/.405* 150
and can be shown to require time proportional to . Choos- (1)
ing a more expensive pruning algorithm may decrease the where : and :@ are the
lower and upper coordinates of
overall number of distance calculations. This may, how- the box along dimension .
ever, be at the expense of higher overall computation time The maximum distance can be computed as follows:
due to an offsetting increase in cost of pruning. @
12,
B
7
6
32 4
=
H)(+*-,/.+0%13,
4
tational time. Our improvements are substantially better. tions provide some partial clustering information. This in-
However, we note that the datasets used are different and a formation can potentially be used to construct the tree such
direct comparison may not be accurate. that the pruning is more effective. Another possibility is
to add the optimizations related to incremental approaches
6 This includes the distance calculations for finding the nearest pro-
presented in [7]. These optimizations seem to be orthogo-
totype and the equivalent of distance calculations for computing he new nal and can be used to further reduce the number of distance
set of centroids. calculations.
Dataset Size Dimensi- No. of Characteristic Range
onality Clusters
DS1 100,000 2 100 Grid [-3,41]
DS2 100,000 2 100 Sine [2,632],[-29,29]
DS3 100,000 2 100 Random [-3,109],[-15,111]
R1 128k 2 16 Random [0,1]
R2 256k 2 16 Random [0,1]
R3 128k 2 128 Random [0,1]
R4 256k 2 128 Random [0,1]
R5 128k 4 16 Random [0,1]
R6 256k 4 16 Random [0,1]
R7 128k 4 128 Random [0,1]
R8 256k 4 128 Random [0,1]
R9 128k 6 16 Random [0,1]
R10 256k 6 16 Random [0,1]
R11 128k 6 128 Random [0,1]
R12 256k 6 128 Random [0,1]
Table 1. Description of the datasets. The range along each dimension is the same unless explicitly
stated
Dataset Direct Alg Our Algorithm Class Identification. Proc. of the Fourth Int’l. Symposium on
Total Time FRT FRD ADC
Large Spatial Databases, 1995.
DS1 115.100 6.830 16.85 64.65 1.01 [6] J. Garcia, J. Fdez-Valdivia, F. Cortijo, and R. Molina. Dy-
DS2 114.400 7.430 15.39 50.78 1.28
namic Approach for Clustering Data. Signal Processing,
DS3 115.900 6.520 17.77 66.81 0.97
44:(2), 1994.
R1 194.400 24.920 7.80 10.81 6.01
R2 705.745 49.320 14.30 10.80 6.02
[7] D. Judd, P. McKinley, and A. Jain. Large-Scale Parallel Data
R3 160.400 3.730 43.00 133.27 0.49 Clustering. Proc. Int’l Conference on Pattern Recognition,
R4 323.650 5.270 61.41 224.12 0.29 August 1996.
R5 302.300 32.430 9.32 10.72 6.06 [8] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data:
R6 606.00 63.330 9.56 10.83 6.00 an Introduction to Cluster Analysis. John Wiley & Sons,
R7 297.050 32.100 9.25 26.66 2.44 1990.
R8 448.750 31.980 14.03 26.66 2.44 [9] K. Mehrotra, C. Mohan, and S. Ranka. Elements of Artificial
R9 408.700 63.920 6.39 6.25 10.41 Neural Networks. MIT Press, 1996.
R10 822.450 132.880 6.18 5.86 11.09 [10] R. T. Ng and J. Han. Efficient and Effective Clustering Meth-
R11 291.400 67.850 4.29 6.30 10.32
ods for Spatial Data Mining. Proc. of the 20th Int’l Conf.
R12 585.300 133.580 4.38 6.07 10.72
on Very Large Databases, Santiago, Chile, pages 144–155,
1994.
[11] V. Ramasubramanian and K. Paliwal. Fast K-Dimensional
Table 3. The overall results for 50 iterations
Tree Algorithms for Nearest Neighbor Search with Applica-
and 64 clusters
tion to Vector Quantization Encoding. IEEE Transactions
on Signal Processing, 40:(3), March 1992.
[12] E. Schikuta. Grid Clustering: An Efficient Hierarchical
References Clustering Method for Very Large Data Sets. Proc. 13th
Int’l. Conference on Pattern Recognition, 2, 1996.
[1] K. Alsabti, S. Ranka, and V. Singh. An Efficient K-Means [13] J. White, V. Faber, and J. Saltzman. United States Patent No.
Clustering Algorithm. http://www.cise.ufl.edu/ ranka/, 5,467,110. Nov. 1995.
1997. [14] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Ef-
[2] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduc- ficient Data Clustering Method for Very Large Databases.
tion to Algorithms. McGraw-Hill Book Company, 1990. Proc. of the 1996 ACM SIGMOD Int’l Conf. on Management
[3] R. C. Dubes and A. K. Jain. Algorithms for Clustering Data. of Data, Montreal, Canada, pages 103–114, June 1996.
Prentice Hall, 1988.
[4] M. Ester, H. Kriegel, J. Sander, and X. Xu. A Density-
Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise. Proc. of the 2nd Int’l Conf. on Knowl-
edge Discovery and Data Mining, August 1996.
[5] M. Ester, H. Kriegel, and X. Xu. Knowledge Discovery in
Large Spatial Databases: Focusing Techniques for Efficient