0% found this document useful (0 votes)
36 views

An Improved K Medoids Clustering Approach - 2022 - Journal of Computational Mat

This document summarizes an article from the Journal of Computational Mathematics and Data Science that proposes an improved K-medoids clustering approach based on the crow search algorithm. The K-medoids algorithm uses actual data points as cluster centers rather than means, but it is not well-suited for clustering arbitrary shapes or large datasets. The authors propose a hybrid algorithm that uses the crow search algorithm to balance the exploration and exploitation processes in K-medoids clustering in order to overcome these limitations. Experimental results show that the proposed approach performs better than other existing clustering methods.

Uploaded by

Rich Phan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

An Improved K Medoids Clustering Approach - 2022 - Journal of Computational Mat

This document summarizes an article from the Journal of Computational Mathematics and Data Science that proposes an improved K-medoids clustering approach based on the crow search algorithm. The K-medoids algorithm uses actual data points as cluster centers rather than means, but it is not well-suited for clustering arbitrary shapes or large datasets. The authors propose a hybrid algorithm that uses the crow search algorithm to balance the exploration and exploitation processes in K-medoids clustering in order to overcome these limitations. Experimental results show that the proposed approach performs better than other existing clustering methods.

Uploaded by

Rich Phan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Journal of Computational Mathematics and Data Science 3 (2022) 100034

Contents lists available at ScienceDirect

Journal of Computational Mathematics and Data


Science
journal homepage: www.elsevier.com/locate/jcmds

An improved K-medoids clustering approach based on the crow


search algorithm
Nitesh Sureja a ,∗, Bharat Chawda b , Avani Vasant a
a Department of Computer Science and Engineering, Krishna School of Emerging Technology & Applied Research, KPGU, Vadodara, India
b Department of Computer Science and Engineering, Bhailabhai & Bhikhabhai Institute of Technology, Vadodara, India

ARTICLE INFO ABSTRACT


Keywords: K-medoids clustering algorithm is a simple yet effective algorithm that has been applied to solve
Clustering many clustering problems. Instead of using the mean point as the centre of a cluster, K-medoids
Crow search algorithm uses an actual point to represent it. Medoid is the most centrally located object of the cluster,
K-means clustering
with a minimum sum of distances to other points. K-medoids can correctly represent the cluster
K-medoids clustering
centre as it is robust to outliers. However, the K-medoids algorithm is unsuitable for clustering
Nature-inspired algorithm
arbitrary shaped groups of objects and large scale datasets. This is because it uses compactness
as a clustering criterion instead of connectivity. An improved k-medoids algorithm based on
the crow search algorithm is proposed to overcome the above problems. This research uses the
crow search algorithm to improve the balance between the exploration and exploitation process
of the K-medoids algorithm. Experimental result comparison shows that the proposed improved
algorithm performs better than other competitors.

1. Introduction

Clustering is used to identify similar groups or clusters of data in a dataset. Clustering is a popular method in data science.
Clustering divides a dataset into clusters or groups, in which each cluster has very similar and relevant objects. Clustering can be
applied to many applications across different domains such as social network analysis, search result grouping, market segmentation,
anomaly detection, recommendation engines, image segmentation and, medical imaging [1]. Clustering is divided into the following
categories: soft clustering and hard clustering. An object can only belong to one cluster in the hard clustering, while an object can
belong to more than one cluster in soft clustering.
The representative objects called medoids are considered instead of centroids in K-medoids clustering. A medoid is based on the
centrally located object in a cluster. Therefore, it is significantly less sensitive to outliers and noises.

2. Related work

This section presents the related literature review which will be applied in this study including the K-medoids algorithm,
improved K-medoids using different techniques and parameter settings.
Among many algorithms for K-medoids clustering, partitioning around medoids (PAM) proposed by Kaufman et. el. (1990) is
the most powerful. However, PAM has a drawback: it works inefficiently for an extensive dataset due to its time complexity (Han
et al. 2001) [2].
Our goal is to develop an improved K-medoids clustering algorithm that should be very simple and efficient. We also have some
approaches available in the literature for the same task. Kaufman et. el. (1990) proposed CLARA, which applies the partitioning

∗ Corresponding author.
E-mail address: [email protected] (N. Sureja).

https://doi.org/10.1016/j.jcmds.2022.100034
Received 17 July 2021; Received in revised form 26 March 2022; Accepted 6 April 2022

2772-4158/© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

around medoids (PAM) to derived objects instead of all objects in a dataset [1]. Lucasius et. el. (1993) reported the performance
degradation of CLARA with the increasing number of clusters [3]. They proposed a genetic K-medoids clustering whose performance
is reported as good compared to CLARA with large datasets. An approach for updating new medoids from some neighbouring objects
in PAM is proposed by Ng and Han (1994) [4]. An approach to maximizing the silhouette instead of minimizing the sum of distances
to the closest medoid is proposed by Van der Laan et. el.i n 2003 [5]. Zhang et. el. (2005) applies the triangular irregular network
concept when calculating the total cost of the replacement in the swap step of PAM to reduce the computational time [6]. Li
Peng et el. (2014) proposed a PAM improved with ant colony algorithm to cluster sensor nodes within a region to solve uneven
clustering [7]. This algorithm is proposed to balance network energy consumption and lengthen the life cycle. Yang Teng-Fei et al.
(2010) have presented a hybridization of K-Medoids and particle swarm optimization algorithm (QKSCO) to utilize global search
of PSO and local search of K-medoids algorithm [8]. The QKSCO has produced better results than the approaches available at that
time. Vivi Nur Wijayaningrum et. el. (2020) has presented a crow search algorithm to balance the exploration and exploitation
process in K-medoids clustering [9]. They have tried to reduce the computational burden due to high dimensionality. Ying-ting Zhu
et. el. (2014) proposed an improved algorithm based on MapReduce and an optimal search of medoids to cluster big data [10]. In
this approach, the properties of triangular geometry are used to reduce the calculation of distances among data elements to help
search medoids quickly and reduce the calculation burden of the K-medoids algorithm. Song H. et. el. (2017) has proposed a parallel
K-medoids algorithm (PAMAE) [11]. They identified two factors known as entire data and global search responsible for achieving
good accuracy and applied them individually through two different phases in parallel. The PAMAE produced better accuracy as
comparable as the previously proposed approaches. Anton V. Ushakov et. el. (2020) developed a parallel, distributed dual heuristic
algorithm for removing the computational burden of the K-medoids approach [12].
Most of the above algorithms proposed are based on partitioning around medoids (PAM) clustering. So, the computational burden
remains.
A. Askarzadeh proposed a nature-inspired algorithm known as the crow search algorithm (CSA) in 2016 [13]. The CSA imitates
the social behaviour of the crows. The CSA is a population-based nature-inspired algorithm.
A new hybrid clustering algorithm, K-medoids Crow Search Algorithm (KMCSA), is proposed in this paper to overcome the
problems with K-medoids. Furthermore, the K-medoids algorithm is hybridized with the crow search algorithm to exploit the
computational efficiency of CSA.
The highlights of the proposed algorithm include:

• A hybrid approach is proposed for reducing the computational burden of the K-medoids clustering algorithm.
• This work introduces a crow search algorithm for balancing the exploration and exploitation process of the K-medoids
algorithm.
• The proposed approach has been tested using several real datasets available in the UCI repository and some synthetic datasets.
• The proposed approach is evaluated using accuracy, sensitivity, specificity, precision, F-measure, and G-mean.

This paper is organized as follows. Related work is discussed in Section 2. The K-medoids algorithm is presented in Section 3. The
crow search algorithm is described in Section 4. Section 5 presents proposed KMCSA algorithm. Experimental results are presented
and discussed in Section 6. Finally, conclusions are presented in Section 7.

3. K-medoids algorithm

K-medoids clustering algorithm uses an actual point in the cluster to represent it as the centre or medoid of a cluster. A medoid
or centre of a cluster is located in the centre of the cluster [5,6]. The medoid is also located at the smallest sum of distances to other
objects (points). A medoid can represent a correct cluster centre due to its robustness to outliers and noises. Partitioning around
medoids (PAM) algorithm is a popular K-medoids clustering algorithm. We select K objects (points) in this algorithm initially and
move towards the best cluster objects (points) repeatedly. Then, we analyse all combinations of points and derive the clustering
quality for all pairs of points. If we find an object (point) with the best-improved distortion function value, the new object (point) will
replace the current best object (point). These newly generated best objects (points) form the new, improved medoids. This algorithm
minimizes the dissimilarities between objects and their reference points. It helps us minimize the sum of the dissimilarities between
the object and their closest object in the cluster. In other words, the k-medoids algorithm minimizes an objective function known
as the absolute error (E) function. This function is given in Eq. (1) [6].

𝑘 ∑
𝑛
| |
𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑒𝑟𝑟𝑜𝑟 (𝐸) = |𝑝 − 𝑜𝑏𝑗 | (1)
| |
𝑗=0 𝑝∈𝑐𝑗

Where E represents the sum of absolute error, p is a data point representing an object in the cluster (𝐶𝑗 ); and 𝑜𝑏𝑗 is the
representative object of 𝐶𝑗 . The algorithm iterates itself until the representative object becomes the medoid or becomes the most
centrally located object in the cluster. Based on this, the k-medoids algorithm is developed for grouping n objects into k clusters.
The k-medoids algorithm is described as follows:
Input:
Number of clusters
A dataset containing n objects

2
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Output: number of clusters (k), with minimized dissimilarities of each object to its nearest medoid.
Algorithm:
1. Initialization: Initialize the initial representatives (K ) of n data points randomly from D space.
2. Assignment: Assign each data point that remained to its closest medoid (m ).
3. Updating:

(a) Select a non-medoid object randomly


(b) Swap medoid (m ) with a data point (or)
(c) Compute the total cost (c ).
(d) Select the medoid with the lowest cost for the next step.

4. Termination: Terminate if the termination criteria are satisfied; else, go to step 2 and repeat.

4. Crow search algorithm

The crow search algorithm (CSA) is based on the social behaviour of crows. Therefore, the social interaction mechanism of crows
is imitated in this algorithm [13]. It is described in the following section.
Each crow in the flock represents itself as a solution and works for searching for food. The area explored by crows is considered
as the environment. The flock (population) is made of P crows (solutions). A vector 𝑣𝑖𝑥1 is used here to represent the position of all
[ ]
crows in the search space. The vector is represented as, 𝑣𝑖 = 𝑣𝑖𝑥1 , 𝑣𝑖𝑥2 , … , 𝑣𝑖𝑥𝑑 . Here x represents a crow, and i represent iteration.
The dimension of the problem is d. Each crow in the population has a memory mem, to remember the hiding place of the food. CSA
uses a parameter, awareness probability (AP), to balance the intensification and diversification, which is very important for any
optimization algorithm.
An objective function (fitness function) is used to evaluate each crow (solution) in the flock (population). The position of a crow
(x) is updated based on the position of other crow y, which is selected randomly from the population to search for the food hiding
place of that crow (y). Two scenarios will be observed during the above crow movement process.
Scenario 1: the crow x will find the food place of crow y, if crow y does not watch it while moving towards the food place.
Now, crow x will update its position using Eq. (2).
( )
𝑣(𝑖+1)
𝑥 = 𝑣𝑥(𝑖) + 𝑎𝑥 .𝑓 𝑙𝑥(𝑖) . 𝑚𝑒𝑚(𝑖)
𝑦 −𝑥
(𝑖)
(2)

Where 𝑎𝑥 is a random number in the interval [0, 1], 𝑓 𝑙𝑥(𝑖) is the flight length, and 𝑚𝑒𝑚(𝑖) represents the memory of solution y.
Here, the exploitation and exploration depend on the flight length (𝑓 𝑙); if its value is greater than 1, the crow x moves towards
crow y and a local search is performed. Otherwise, a global search is performed, and the crow x moves away from crow y.
Scenario 2: If the crow y finds that the crow x is following it, it moves arbitrarily to dupe the crow x.
These two scenarios depending on the 𝐴𝑃𝑥(𝑖) (awareness probability) value and are represented as below.
⎧ (𝑖) ( )
(𝑖) (𝑖) (𝑖) (𝑖)
⎪𝑣𝑥 + 𝑎𝑥 .𝑓 𝑙𝑥 . 𝑚𝑒𝑚𝑦 −𝑥 𝑖𝑓 𝑎𝑥 ≥ 𝐴𝑃𝑥
𝑣𝑖+1
𝑥 =⎨ (3)
⎪𝑀𝑜𝑣𝑒 𝑡𝑜 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The exploration and exploitation are balanced based on the value of 𝐴𝑃𝑥(𝑖) . A global search is performed in the case of an increase
of value. Otherwise, a local search is done.
If the fitness value of the new position of the crow is good, then the current value, then memory is updated. Otherwise, it remains
unchanged. The memory updating is done as per Eq. (4).
⎧ (𝑖+1) ( ) ( )
⎪𝑣𝑥 𝑖𝑓 𝑓 𝑣(𝑖+1)
𝑥 𝑖𝑠 𝑏𝑒𝑡𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 𝑓 𝑚𝑒𝑚(𝑖)
𝑥
𝑚𝑒𝑚𝑖+1
𝑥 =⎨ (𝑖)
(4)
⎪𝑚𝑒𝑚𝑥 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

5. Proposed algorithm

K- Medoids clustering algorithm is also a partition-based clustering algorithm. It can solve all kinds of clustering problems. The
only problem with it is its computational burden in terms of time required to produce the results. Nature-inspired algorithms have
proved their efficacy in solving complex problems (see Fig. 1).
We have hybridized the K-medoids algorithm with the crow search algorithm to exploit the crow search algorithm’s characteris-
tics for removing the K-medoids algorithm’s computational burden. The hybridized algorithm is called KMCSA. The proposed hybrid
algorithm is described as follows:

1. Initialize the parameters: Initialize the value of clusters (k), population size (P), iterations (miter), awareness probability
(AP), flight length (fl) and iteration counter (i = 0).

3
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Fig. 1. Flowchart of the proposed KMCSA algorithm.

2. Initialize the population: Randomly create the initial crows (solutions) population. Then, represent each solution as a vector
v, where 𝑣(𝑖)
𝑥 ∈ [𝐿, 𝑈 ] and 𝑖 = 1, … , 𝑃 .
3. Initialize the Memory: Define a vector 𝑚𝑒𝑚(𝑥) to initialize the memory of each solution (crow). Then, assign the memory to
each solution (crow) by calculating it using an objective function.
4. Evaluating solution: Calculate the fitness value of each solution using Eq. (1). Then, evaluate each solution according to its
fitness value.
5. Updating Solutions: Update each solution in the population based on the value of awareness probability (AP) using Eqs. (2)
and (3). Then, execute the main loop till the termination criteria are satisfied.
6. Acceptability of the solution: Accept the position of the new crow if its fitness value is better than the current solution.
Otherwise, keep the current solution.
7. Updating memory: Update the memory of each solution using Eq. (4).
8. Termination of Algorithm: Terminate the algorithm based on the criteria defined. Produce the best solution generated.

6. Experimental results

6.1. Datasets

Eighteen datasets, fifteen real and three artificial are used to evaluate the proposed algorithm. Out of all, eight datasets are small,
six are medium and four datasets are large. In a clustering or feature selection problem, datasets are of small scale, medium scale,

4
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Table 1
Dataset properties.
Dataset Instances Features Classes Domain
Small scale datasets Heart 270 13 2 Biology
moon 210 2 2 lunar
Wisconsin 699 9 2 Biology
Zoo 101 16 7 Artificial
Yeast 1484 8 10 Biology (Protein)
Vowel 990 14 2 Biology (Speech)
Square 400 2 4 Artificial
2D3C 300 2 4 Artificial
Medium scale datasets SonarEW 208 63 2 Biology
Dermatology 366 34 6 Biology
IonosphereEW 351 34 2 Electromagnetic
SpectEW 267 22 2 Biology
BreastEW 596 30 2 Biology
KrvskpEW 3196 36 2 Game
Large scale datasets Clean1 476 169 2 Biology (Speech)
Semeion 1593 257 10 Biology (Speech)
Leukaemia 72 7130 2 Biology
Colon 62 2000 2 Biology

or large scale if number of features belongs to [0, 19], [20, 49], or [50, ∞], respectively [30]. Table 1 briefly discusses all datasets
used.
Square: The Square dataset has been used in many clustering algorithms. It is a two-dimensional dataset. It has four clusters and
is arranged as a square. The dataset are generated using a normal distribution N(u,r2) [14,15].
2D3C: This dataset consists of three clusters, out of which two are created using normal distributions, and one is created using
a uniform distribution [14,15].
All real datasets used came from the UCI repository [16].

6.2. Evaluation measures

The internal and external measures are used to evaluate the performance of KMCSA. The external measures used are purity,
entropy, completeness score (CS) and F-measure. The internal measure used is total within-cluster variance (TWCV). Also, we have
computed and compared the convergence time (in seconds) with the other algorithms.
5.2a Purity: We use entropy as a quality measure for a clustering algorithm [17,18]. It is calculated using Eq. (5). The best value
for the purity measure is 1. It ranges in between 0 to 1.
( )
1∑
𝑘
| |
𝑃 𝑢𝑟𝑖𝑡𝑦 = 𝑚𝑎𝑥𝑖 |𝑇𝑖 ∩ 𝑃𝑗 | (5)
𝑛 𝑗=1 | |

Here, the number of points assigned to a cluster (j) is represented by 𝑃𝑗 , k represents the number of clusters, and 𝑇𝑖 represents
the correct assignments of the cluster (i).
5.2b Entropy: Entropy is used to check the algorithm’s ability to find semantic classes within each cluster, calculated by
Eq. (6) [17,18]. The best value for the entropy measure is 0. It ranges in between 0 to 1.
( )
| |
∑𝑘 |𝑃𝑗 |
| |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸(𝑃𝑗 ) (6)
𝑗=1
𝑛

Here, E(P 𝑗 ) is the entropy of an individual cluster. E(P 𝑗 ) is computed by Eq. (7).

( ) ( )
| | | |
|𝑃 ∩ 𝑇𝑖 | |𝑃𝑗 ∩ 𝑇𝑖 |
1 ∑ | 𝑗
𝑘
( ) | | |
𝐸 𝑃𝑗 = 𝑙𝑜𝑔( ) (7)
𝑙𝑜𝑔𝑘 𝑖=1 𝑃𝑗 𝑃𝑗
5.2c Completeness score (CS):
Completeness score (CS) is defined as per Eq. (8) [18]. The best value for the completeness score measure is 1. It ranges in
between 0 to 1.
𝐻 (𝑃 ∕𝑇 )
𝐶𝑆 = 1 − (8)
𝐻(𝑃 )
H(P) is the cluster Entropy, and H(P|T) is the conditional Entropy cluster. Then, they are calculated as per Eq. (9) and Eq. (10).
|𝑃 | ( )
∑ 𝑛𝑝 𝑛𝑝
𝐻 (𝑃 ) = − .𝑙𝑜𝑔 (9)
𝑝=1
𝑁 𝑁

5
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Table 2
Parameter settings for the algorithms.
Algorithms Iterations Parameters
CSAK-Means [20] 100 Flight Length (fl), Awareness Probability (AP) and number of clusters (k)
K-medoids [21] 100 Number of clusters (k)
Genetic K-medoids [22] 100 Crossover rate (r), mutation rate (Mr), clusters (k)
PSO K-medoids [23] 100 Inertia weight (w), c1, c2, k (Clusters)
ABC [24] 100 Number of bees(m), Food sources (fs), employed bees (eb), on loker bees (ob)
ACO [25] 100 Number of ants (m), 𝛼 (Pheromone concentration, 𝛽 (Heuristic factor), 𝜌 (Pheromone
evaporation)
KMCSA 100 Flight Length (fl), Awareness Probability (AP) and number of clusters (k)

|𝑇 | ∑
|𝑃 | ( )
∑ 𝑛𝑝𝑡 𝑛𝑝𝑡
𝐻 (𝑃 |𝑇 ) = − 𝑙𝑜𝑔 𝑏 (10)
𝑡=1 𝑝=1
𝑁 𝑛𝑝

5.2d F-measure: the accuracy of the results obtained is measured using F-measure (FM ) [19]. It is an external measure. F-measure
is calculated as the harmonic mean of precision (p) and recall (r) using Eq. (11). The best value for the F-measure measure is 1. It
ranges in between 0 to 1.
𝑝×𝑟
𝐹𝑀 = 2 × (11)
𝑝+𝑟
Precision is calculated as the total correct positives (CP) divided by the sum of correct positives (CP) and wrong positives (WP).
Thus, the precision value ranges between 0 and 1. It is calculated using Eq. (12).
𝐶𝑃
𝑝=2× (12)
𝐶𝑃 + 𝑊 𝑃
The recall is computed as the total correct optimistic predictions (CN ) divided by the sum of the correct negative (CN ) and false
negatives (WN ). The recall is computed using Eq. (13).
𝐶𝑁
𝑟=2× (13)
𝐶𝑁 + 𝑊 𝑁
5.2e Silhouette: The silhouette is a measure used to check the similarity of an object with its cluster compared to others.
Eq. (14) [26,27] determines the silhouette’s silhouette. The best value for the silhouette measure is 1. It ranges in between 0 to 1.
𝑦𝑖 − 𝑥𝑖
𝑆𝑖𝑙ℎ𝑜𝑢𝑡𝑡𝑒 = (14)
𝑚𝑎𝑥𝑖𝑚𝑢𝑚(𝑥𝑖 , 𝑦𝑖 )
Here, 𝑥𝑖 and 𝑦𝑖 represent an average of the dissimilarity of object i with other objects in its and other clusters, respectively.
5.2f Total Within Cluster Variance (TWCV) [26,27]:
𝑁 ∑
∑ ∑ 𝐹 ( )2
1 ∑ ∑
𝑁 𝐾
𝑇 𝑊 𝐶𝑉 = 𝑝2𝑛𝑓 − 𝑝𝑘𝑓 (15)
𝑛=1 𝑓 =1 𝑘=1
|𝑝𝑘| 𝑓 =1

Here, F represents features number, 𝑝𝑛𝑓 represents the feature at the point n, 𝑝𝑘𝑓 represents the feature at the point k. |𝑝𝑘|
represents the number of points cluster k has. Again, TWCV should be minimized to achieve the best results.

6.3. Results

We implement the proposed algorithm in Matlab R2015a. We execute it on an Intel i3 machine with 2.30 GHz speed and 8 GB
RAM. Each algorithm is implemented and executed in 05 different runs, each with 100 iterations using the parameters listed in
Table 2. The results of convergence time, measures and fitness values obtained from the KCSA Means [21], K- Medoids [22], Genetic
K-Medoids [23], PSO K-Medoids [24], ACO, ABC and KMCSA algorithms are shown in Tables 3 to 9. A comparison of the results
produced by the algorithms for the used datasets is shown in Figs. 2 to 8.

6.4. Discussion

In this section, a discussion on the results achieved is given.


Table 3 shows the results obtained for the fitness values with all algorithms. It is seen that KMCSA outperforms all algorithms
followed by ACO and ABC. Table 4 shows the results obtained for the silhouette (SC) with all algorithms. Here also, KMCSA works
well except sonar dataset, followed by the ACO algorithm. Table 5 shows the results obtained for the purity measure with all
algorithms. Again, the KMCSA outperform all algorithms except for the 2D3C, square and vowel datasets. Table 6 shows the results
obtained for the entropy measure with all algorithms. Again, the KMCSA produces the best results for all datasets. The ACO and
ABC also produce good results. Table 7 shows the results obtained for the completeness score (CS) measure with all algorithms.
Again, the KMCSA produces the best results for all datasets except square dataset. Table 8 shows the results obtained for F-measure.
The table indicates the excellent quality of the solution with KMCSA for all datasets. The ACO and ABC algorithms also produce

6
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Table 3
Results obtained for fitness function (TWCV).
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 247.02 247.48 245.47 245.47 244.85 242.35 240.15
moon 235.59 235.51 233.55 232.15 231.66 230.59 229.93
wisconsin 228.76 224.15 218.35 215.95 205.45 200.41 199.79
zoo 108.87 107.65 100.10 99.25 94.65 92.11 91.66
yeast 162.55 161.88 161.63 161.45 160.78 160.48 160.41
vowel 799.23 797.59 795.59 794.38 793.37 792.97 792.25
square 35.09 34.34 33.29 32.74 31.75 30.95 30.18
2D3C 49.61 47.60 43.34 42.60 42.39 42.32 42.28
sonar 430.42 427.11 425.51 424.34 423.05 420.15 419.45
dermatology 569.81 544.35 537.07 532.32 524.35 523.50 515.25
ionosphere 42.26 32.01 28.76 27.80 25.33 24.21 22.95
spectEW 1024.05 1022.03 1017.62 1014.01 1002.45 1001.50 997.89
breastEW 160.75 159.96 158.75 158.59 157.90 157.25 154.39
krvskpEW 3400.92 3395.78 3360.56 3357.72 3351.56 3350.56 3329.43
clean1 6868.03 6864.02 6862.82 6844.15 6834.59 6832.59 6793.65
semeion 77165.26 76230.89 75728.19 74998.93 74306.42 74287.42 73947.53
leukaemia 25769.74 25756.99 25628.34 25416.66 25248.25 25155.35 25047.23
colon 2768.74 2566.11 2400.46 2350 2263.72 2127.03 2069.43

Table 4
Results obtained for SC measure.
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 0.43 0.43 0.44 0.44 0.45 0.45 0.46
moon 0.46 0.46 0.47 0.49 0.51 0.51 0.53
wisconsin 0.23 0.23 0.24 0.25 0.26 0.26 0.27
zoo 0.22 0.22 0.24 0.26 0.28 0.29 0.30
yeast 0.47 0.47 0.49 0.49 0.51 0.52 0.53
vowel 0.41 0.41 0.42 0.43 0.45 0.45 0.46
square 0.21 0.2 0.2 0.23 0.25 0.24 0.25
2D3C 0.25 0.25 0.26 0.26 0.28 0.29 0.29
sonar 0.26 0.28 0.28 0.26 0.24 0.22 0.22
dermatology 0.36 0.36 0.37 0.37 0.41 0.4 0.42
ionossphere 0.36 0.35 0.37 0.38 0.41 0.42 0.43
spectEW 0.4 0.4 0.41 0.43 0.47 0.47 0.49
breastEW 0.3 0.3 0.5 0.6 0.7 0.7 0.8
krvskpEW 0.44 0.47 0.49 0.51 0.55 0.56 0.59
clean1 0.32 0.33 0.35 0.36 0.43 0.43 0.47
semeion 0.49 0.49 0.5 0.5 0.53 0.55 0.58
leukaemia 0.45 0.45 0.46 0.46 0.48 0.47 0.49
colon 0.51 0.53 0.53 0.54 0.55 0.55 0.58

Fig. 2. Comparison of fitness values.

good results. The convergence time comparison is shown in Table 9. Here, KMCSA outperforms all algorithms due to the significant
performance of the crow search algorithm in the exploration and exploitation of search space.
Overall it can be said that the hybridization of K-medoids with the crow search algorithm (CSA) significantly improves the
performance of the K-medoids algorithm.

7
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Table 5
Results obtained for purity measure.
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 0.64 0.68 0.71 0.71 0.71 0.75 0.77
moon 0.57 0.54 0.55 0.55 0.66 0.69 0.71
wisconsin 0.81 0.82 0.82 0.84 0.86 0.87 0.87
zoo 0.92 0.92 0.93 0.95 0.96 0.96 0.97
yeast 0.84 0.84 0.85 0.86 0.87 0.88 0.91
vowel 0.99 0.99 0.99 0.99 0.99 0.99 0.99
square 0.91 0.91 0.91 0.91 0.91 0.91 0.91
2D3C 0.96 0.98 0.98 0.98 0.98 0.98 0.98
sonar 0.87 0.88 0.88 0.89 0.89 0.89 0.90
dermatology 0.72 0.71 0.73 0.75 0.79 0.83 0.85
ionossphere 0.69 0.69 0.69 0.71 0.75 0.77 0.80
spectEW 0.79 0.79 0.81 0.81 0.83 0.83 0.85
breastEW 0.8 0.8 0.87 0.87 0.9 0.9 0.91
krvskpEW 0.52 0.52 0.54 0.55 0.57 0.58 0.6
clean1 0.57 0.57 0.59 0.59 0.6 0.62 0.63
semeion 0.39 0.39 0.45 0.45 0.47 0.47 0.48
leukaemia 0.68 0.69 0.71 0.73 0.75 0.77 0.80
colon 0.15 0.15 0.22 0.25 0.27 0.27 0.29

Table 6
Results obtained for entropy measure.
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 0.92 0.88 0.86 0.86 0.86 0.84 0.82
moon 0.97 1 0.99 0.95 0.95 0.97 0.96
wisconsin 0.61 0.6 0.6 0.59 0.59 0.57 0.51
zoo 0.29 0.29 0.24 0.23 0.21 0.2 0.19
yeast 0.17 0.17 0.17 0.16 0.15 0.16 0.12
vowel 0.11 0.09 0.11 0.11 0.09 0.11 0.11
square 0.42 0.43 0.43 0.43 0.43 0.43 0.43
2D3C 0.11 0.08 0.08 0.07 0.08 0.08 0.08
sonar 0.28 0.28 0.25 0.22 0.21 0.21 0.2
dermatology 0.33 0.17 0.32 0.2 0.24 0.25 0.21
ionossphere 0.53 0.51 0.53 0.51 0.51 0.5 0.48
spectEW 0.41 0.41 0.42 0.41 0.41 0.41 0.40
breastEW 0.26 0.25 0.25 0.24 0.25 0.24 0.22
krvskpEW 0.65 0.65 0.64 0.64 0.63 0.63 0.62
clean1 0.97 0.97 0.95 0.95 0.94 0.94 0.93
semeion 0.73 0.73 0.73 0.71 0.7 0.7 0.69
leukaemia 0.86 0.86 0.85 0.84 0.81 0.81 0.80
colon 0.77 0.46 0.44 0.42 0.41 0.4 0.38

Fig. 3. Comparison of silhouette values.

8
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Table 7
Results obtained for CS measure.
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 0.08 0.12 0.14 0.14 0.14 0.15 0.16
moon 0.2 0.2 0.3 0.3 0.5 0.6 0.7
wisconsin 0.41 0.4 0.43 0.43 0.45 0.47 0.50
zoo 0.74 0.75 0.79 0.84 0.87 0.88 0.89
yeast 0.79 0.79 0.81 0.85 0.86 0.85 0.87
vowel 0.01 0.02 0.01 0.02 0.01 0.03 0.04
square 0.02 0.01 00 0.01 0.01 0.01 0.01
2D3C 0.89 0.92 0.01 0.02 0.83 0.87 0.89
sonar 0.6 0.59 0.62 0.66 0.66 0.69 0.71
dermatology 0.69 0.71 0.71 0.74 0.78 0.77 0.81
ionossphere 0.11 0.13 0.1 0.13 0.15 0.17 0.19
spectEW 0.1 0.1 0.07 0.1 0.15 0.15 0.18
breastEW 0.6 0.6 0.62 0.65 0.67 0.69 0.71
krvskpEW 0.01 0.01 0.03 0.05 0.07 0.09 0.1
clean1 0.02 0.02 0.03 0.05 0.07 0.09 0.1
semeion 0.25 0.27 0.27 0.29 0.31 0.33 0.35
leukaemia 0.09 0.15 0.19 0.21 0.25 0.25 0.27
colon 0.74 0.75 0.82 0.83 0.85 0.86 0.91

Table 8
Results obtained for F-measure measure.
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 0.43 0.37 0.28 0.42 0.58 0.65 0.69
moon 0.47 0.5 0.47 0.48 0.69 0.73 0.78
wisconsin 0.17 0.17 0.16 0.15 0.14 0.14 0.13
zoo 0.04 0.06 0.1 0.1 0.3 0.85 0.89
yeast 0.02 0.02 0.02 0.02 0.02 0.02 0.03
vowel 0.26 0.28 0.27 0.3 0.31 0.45 0.49
square 0.28 0.31 0.32 0.3 0.33 0.34 0.38
2D3C 0.01 0.01 0.13 0.26 0.27 0.28 0.29
sonar 0.22 0.17 0.39 0.22 0.44 0.45 0.50
dermatology 0.11 0.16 0.15 0.06 0.17 0.18 0.20
ionossphere 0.15 0.17 0.23 0.22 0.2 0.17 0.20
spectEW 0.07 0.07 0.15 0.07 0.15 0.1 0.13
breastEW 0.14 0.14 0.19 0.09 0.19 0.19 0.21
krvskpEW 0.18 0.17 0.16 0.18 0.19 0.19 0.2
clean1 0.49 0.5 0.53 0.53 0.57 0.57 0.60
semeion 0.07 0.09 0.08 0.07 0.09 0.06 0.09
leukaemia 0.43 0.43 0.59 0.62 0.63 0.63 0.65
colon 0.5 0.53 0.53 0.56 0.6 0.61 0.64

Table 9
Results obtained for convergence time (in seconds).
Datasets CSAK-means [20] K-medoids [21] Genetic K- medoids [22] PSO K- medoids [23] ABC [24] ACO [25] KMCSA
heart 15.1 15.08 15.1 14.09 11.91 10.11 8.7
moon 12.99 13.64 23.3 11.83 12.74 11.31 8.3
wisconsin 75.68 75.1 73.47 72.25 72.14 72.04 65.45
zoo 46.25 48.52 44.25 43.65 40.25 37.1 30.65
yeast 6.1 6.1 5.85 5.55 4.85 4.75 3.65
vowel 151.43 152.01 150.2 151.19 145.35 141.11 129.00
square 50.29 96.06 48.62 46.04 45.95 46.14 40.15
2D3C 65.41 101.09 55.6 54.09 49.19 47.16 37.15
sonar 34.6 34.09 35.66 34.9 35.07 34.45 53.75
dermatology 21.2 22.36 21.22 22.07 21.52 21.01 19.21
ionossphere 19.8 19.21 19.33 20.76 18.56 18.66 15.55
spectEW 13.76 15.97 14.72 13.3 23.84 22.9 19.00
breastEW 30.89 31.78 32.46 29.35 30.12 31.25 24.65
krvskpEW 169.99 168.67 167.7 167.16 166.35 165.25 155.00
clean1 34.55 37.25 36.2 33.86 33.21 33.5 29.35
semeion 148.06 170.4 151.5 147.13 145.65 145.37 135.50
leukaemia 175.28 207.8 174.35 170.23 155.25 151.55 141.27
colon 146.84 136.11 137.93 136.57 133.97 133.36 119.55

9
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Fig. 4. Comparison of purity values.

Fig. 5. Comparison of entropy values.

Fig. 6. Comparison of Completeness score values.

Fig. 7. Comparison of F-measure values.

7. Conclusion

A hybrid K-medoids algorithm based on Crow search algorithm called as KMCSA is proposed in this paper. The results produced
by KMCSA are compared with K-Medoids, KCSA, Genetic K-Medoids, PSO K-Medoids, ABC, and ACO algorithms. The absolute Square
Error principle is used as a fitness function to evaluate the KMCSA algorithm. The results show that the KMCSA outperforms the
other algorithms in all aspects. However, the ACO and ABC compete with KMCSA with some datasets. It is a little bit tedious to set
the parameters of each algorithm. In future, it is assumed to extend this research where it will find clusters dynamically.

10
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

Fig. 8. Comparison of convergence time (in seconds).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.

Acknowledgements

We thank all of our supporters. The Krishna School of Emerging Technology & Applied Research, KPGU, Vadodara, Gujarat,
India supported this study with all experimental facilities. We also thank the technicians and students for their assistance.

References

[1] Han J, Kamber M, Pei J. Data mining: Concepts and techniques. In: Data mining: concepts and techniques. 2012, http://dx.doi.org/10.1016/C2009-0-
61819-5.
[2] Kaufman L, Rousseeuw PJ. Finding groups in data: An introduction to cluster analysis (Wiley series in probability and statistics). In: Eepe. ethz. ch. Vol.
66. 1990.
[3] Lucasius CB, Dane AD, Kateman G. On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasibility and comparison.
Anal Chim Acta 1993;282(3). http://dx.doi.org/10.1016/0003-2670(93)80130-D.
[4] Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data
bases. 1994.
[5] Van der Laan MJ, Pollard KS, Bryan J. A new partitioning around medoids algorithm. J Stat Comput Simul 2003;73(8). http://dx.doi.org/10.1080/
0094965031000136012.
[6] Zhang Q, Couloigner I. A new and efficient K-medoid algorithm for spatial clustering. In: Lecture notes in computer science, vol. 3482, (III). 2005,
http://dx.doi.org/10.1007/11424857_20.
[7] Peng L, Dong GY, Dai FF, Liu GP. A new clustering algorithm based on ACO and K-medoids optimization methods. IFAC Proc Vol (IFAC-PapersOnline)
2014;19. http://dx.doi.org/10.3182/20140824-6-za-1003.01501.
[8] Yang TF, Zhang XP. Spatial clustering algorithm with obstacles constraints by quantum particle swarm optimization and K-medoids. In: 2010 2nd
International conference on computational intelligence and natural computing. Vol. 2. 2010, http://dx.doi.org/10.1109/CINC.2010.5643776.
[9] Wijayaningrum Vivi Nur, Putriwijaya Novi Nur. An improved crow search algorithm for data clustering. EMITTER Int J Eng Technol 2020;8(1).
http://dx.doi.org/10.24003/emitter.v8i1.498.
[10] Zhu YT, Wang FZ, Shan XH, Lv XY. K-medoids clustering based on MapReduce and optimal search of medoids. In: Proceedings of the 9th international
conference on computer science and education. 2014, http://dx.doi.org/10.1109/ICCSE.2014.6926527.
[11] Song H, Lee JG, Han WS. PAMAE: Parallel k-medoids clustering with high accuracy and efficiency. In: Proceedings of the ACM SIGKDD international
conference on knowledge discovery and data mining. 2017, http://dx.doi.org/10.1145/3097983.3098098.
[12] Ushakov Av, Vasilyev I. Near-optimal large-scale k-medoids clustering. Inform Sci 2021;545. http://dx.doi.org/10.1016/j.ins.2020.08.121.
[13] Askarzadeh A. A novel metaheuristic method for solving constrained engineering optimization problems: crow search algorithm. Comput Struct 2016;169.
http://dx.doi.org/10.1016/j.compstruc.2016.03.001.
[14] Handl J, Knowles J, Dorigo M. Ant-based clustering: a comparative study of its relative performance with respect to k-means, average link and 1d-som.
Des Appl Hybrid 2003.
[15] Filippone M, Camastra F, Masulli F, Rovetta S. A survey of kernel and spectral methods for clustering. Pattern Recognit 2008;41(1). http://dx.doi.org/10.
1016/j.patcog.2007.05.018.
[16] Dua D, Graff C. UCI machine learning repository: data sets. Irvine, CA: University of California, School of Information and Computer Science; 2019.
[17] Aljarah I, Ludwig SA. A new clustering approach based on glowworm swarm optimization. In: 2013 IEEE congress on evolutionary computation. 2013,
http://dx.doi.org/10.1109/CEC.2013.6557888.
[18] Rosenberg A, Hirschberg J. V-Measure: A conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL 2007 - proceedings of the 2007
joint conference on empirical methods in natural language processing and computational natural language learning. 2007.
[19] Lewis DD, Gale WA. A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on research
and development in information retrieval. 1994, http://dx.doi.org/10.1007/978-1-4471-2099-5_1.
[20] Lakshmi K, Visalakshi NK, Shanthi S. Data clustering using K-means based on crow search algorithm. Sadhana - Acad Proc Eng Sci 2018;43(11).
http://dx.doi.org/10.1007/s12046-018-0962-3.
[21] Park HS, Jun CH. A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 2009;36(2 PART 2). http://dx.doi.org/10.1016/j.eswa.2008.01.039.
[22] Sheng W, Liu X. A genetic k-medoids clustering algorithm. J Heuristics 2006;12(6). http://dx.doi.org/10.1007/s10732-006-7284-z.

11
N. Sureja, B. Chawda and A. Vasant Journal of Computational Mathematics and Data Science 3 (2022) 100034

[23] Zhang J, Wang Y, Feng J. Parallel multi-swarm pso based on k-medoids and uniform design. Res J Appl Sci Eng Technol 2013;5(8). http://dx.doi.org/10.
19026/rjaset.5.4699.
[24] Zhang C, Ouyang D, Ning J. An artificial bee colony approach for clustering. Expert Syst Appl 2010;37(7). http://dx.doi.org/10.1016/j.eswa.2009.11.003.
[25] Zhang L, Cao Q. A novel ant-based clustering algorithm using the kernel method. Inform Sci 2011;181(20). http://dx.doi.org/10.1016/j.ins.2010.11.005.
[26] Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987;20(C). http://dx.doi.org/10.
1016/0377-0427(87)90125-7.
[27] Peng P, Addam O, Elzohbi M, Özyer ST, Elhajj A, Gao S, et al. Reporting and analyzing alternative clustering solutions by employing multi-objective
genetic algorithm and conducting experiments on cancer data. Knowl-Based Syst 2014;56. http://dx.doi.org/10.1016/j.knosys.2013.11.003.

12

You might also like