0% found this document useful (0 votes)
7 views

An_Improved_K-Means_Algorithm_Based_on_Fuzzy_Metrics (1)

improved k means algorithm

Uploaded by

DEEPESH TIWARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

An_Improved_K-Means_Algorithm_Based_on_Fuzzy_Metrics (1)

improved k means algorithm

Uploaded by

DEEPESH TIWARI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received November 3, 2020, accepted November 20, 2020, date of publication November 26, 2020,

date of current version December 15, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3040745

An Improved K-Means Algorithm


Based on Fuzzy Metrics
XINYU GENG , YUKUN MU, SENLIN MAO, JINCHI YE , AND LIPING ZHU
School of Computer Science, Southwest Petroleum University, Chengdu 610500, China
Corresponding author: Xinyu Geng ([email protected])

ABSTRACT The traditional K-means algorithm has been widely used in cluster analysis. However,
the algorithm only involves the distance factor as the only constraint, so there is a problem of sensitivity to
special data points. To address this problem, in the process of K-means clustering, ambiguity is introduced
as a new constraint condition. Hence, a new membership Equation is proposed on this basis, and a method
for solving the initial cluster center points is given, so as to reduce risks caused by random selection of initial
points. Besides, an optimized clustering algorithm with Gaussian distribution is derived with the utilization
of fuzzy entropy as the cost function constraint. Compared with the traditional clustering method, the new
Equation’s membership degree can reflect the relationship between a certain point and the set in a clearer way,
and solve the problem of the traditional K-means algorithm that it is prone to be trapped in local convergence
and easily influenced by noise. Experimental verification proves that the new method has fewer iterations
and the clustering accuracy is better than other methods, thus having a better clustering effect.

INDEX TERMS K-means, fuzzy entropy, cluster center, membership degree, fuzzy clustering.

I. INTRODUCTION out whether the clustering result is reasonable or not [10].


The clustering process is the most effective classification For different types of data, there are different processing
method for people to summarize complex external informa- algorithms to get a good clustering effect. K-means is one
tion [1]. Though classification can see a mature development of the most commonly used clustering algorithms. Featur-
now, there are still challenges for the clustering algorithm ing a simple principle, the K-means algorithm is easy to
regarding how to eventually realize cognition, learning and implement, and can classify low-dimensional large data sets
classification under unsupervised conditions by extracting in an efficient way. However, the K-means algorithm also
data features [2]. No model can be used universally and has shortcomings such as the vulnerability to special points,
achieve better results, since it is not a priori [3]. Data imply the risk of local optimal solutions, the only constraint of
enormous scientific and commercial values [4], especially in distance, the sensitivity to initial point selection, and the
the explosive growth of data production in recent years. In exclusive suitability for clustering numerical data and data
2016, the global data volume reached 10ZB and maintained sets with convex clusters [11]. Therefore, based on the clas-
an annual growth rate of more than 40 Scattered raw data, sic K-means algorithm, many new and improved algorithms
processed with data mining technology, can deliver valuable have been proposed, such as the K-modes algorithm that can
results, such as the planning of humanities and the construc- cluster discrete data [12], K-means-CP algorithm based on
tion of biological sciences in the reference [6]–[8]. This type the consistency of k nearest neighbors [13], Canopy-based
of research is of great significance for both social develop- K-means (DCK-means) algorithm with Canopy Method inte-
ment and human self-cognition and learning cognition. It can grated through which the initial point is found by consid-
be clearly seen that clustering research on various types of ering sample density [14] etc. The K-means algorithm and
data has attracted academic attention for a long time [9]. its derived algorithms have been successfully applied in rec-
In a study on clustering problems, for a given data set, ommendation systems, image processing, data mining, video
we should first make sure whether there is a clustering struc- recognition, and other fields [15]. At present, the classic
ture. If so, the algorithm structure should be determined. K-means algorithm is usually optimized by combining other
Once conformed, three aspects should be involved to figure algorithms to realize the selection of clustering centers and
the determination of the distance function In [16], it is pro-
The associate editor coordinating the review of this manuscript and posed to use the result of Singular Value Decomposition
approving it for publication was Utku Kose. (SVD ) decomposition as the initial point of clustering to

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
217416 VOLUME 8, 2020
X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

obtain better clustering effects. In reference [17] it is pre- [31], a heuristic method is offered based on the Silhouette
sented to utilize neighboring ideas, by taking the overall criterion to find the number of clusters. Reference [32] desig-
distribution of data samples as the basis for division, so as nates the digital execution of a model, based on an intuition-
to improve clustering effects. In reference [18], with the istic fuzzy histogram hyperbolization and possibilitic fuzzy
combination of Adaptive radius immune algorithm(ARIA) c-mean clustering algorithm for early breast cancer detection.
and K-means algorithms, the immune algorithm for adaptive In reference [33], a clustering method featuring the combina-
radius is applicated to preprocess the data so as to generate tion of fuzzy c-means algorithm and entropy- based algorithm
mirror data that represen the distribution and density of the is advised to achieve both distinct and compact effects. Ref-
original data, improving the noise resistance and stability of erence [34]proposed a novel intuitionistic possibilistic fuzzy
clustering. In reference [19], a K-means clustering algorithm c-mean algorithm. Possibilistic fuzzy c-mean and intuitionis-
is designed with a random shift of the center of gravity tic fuzzy c-mean are hybridized to overcome the problems
to process single-color images specifically for the isolated of fuzzy c-mean. In reference [35], a novel fuzzy-entropy
point problem. In addition, new directions have emerged, based clustering measure (FECM) is presented, in which the
which combine swarm intelligence and bionic algorithms, average symmetric fuzzy cross entropy of membership subset
such as Particle swarm optimization K-means(PSO-Kmeans) pairs is integrated with the average fuzzy entropy of clusters.
based on particle swarms [20], artificial bee colony K-means The above methods improve clustering effects with the fuzzy
algorithm (ABC-Kmeans) based on bee swarms [21], and entropy utilized to enhance the effective use of information
Gray Wolf K-means (GWO-Kmeans) [22] based on gray wolf such as overall distribution characteristics. Inspired by the
optimization. In the above optimization examples, most are theory of fuzzy mathematics and reference [17] and [18],
still optimization methods for the division of a certain aspect in the clustering process, this paper proposes a fuzzy mean
such as distance or edge data. Among them, only reference clustering algorithm, namely fuzzy metrics K-means (FMK),
[18] presented an application for the characteristics of data with fuzzy entropy as a constraint in the distance condition.
distribution, which is an effective use of data information The algorithm first introduces artificial setting of the initial
except for the distance between points. cluster center to reduce the influence of noise, integrates
After the fuzzy theory appeared, Dunn et al. put for- the overall distribution structure into that of the membership
ward the fuzzy c-means clustering algorithm (FCM) [23]. function, and then compares the overall ambiguity of the
According to its judgement criteria, clustering is divided into cluster after introducing a certain point. The last step is the
hard clustering and soft clustering. Reference [24] raised a convergence completed through iteration, finally realizing
Fuzzy C-Means algorithm with a Divergence-based Kernel the clustering of the FMK-means algorithm.
FCM algorithm (FCMDK), which can handle data whose
boundaries between clusters were non-linear. Chowdhary, II. RELATED INFORMATION
C. L et al.proposesd a novel possibilistic exponential fuzzy A. K-MEANS ALGORITHM
c-means (PEFCM) clustering algorithm for segmenting med- As one of the most well-known clustering algorithms, The
ical images better and earlier [25]. In the process of hard K-means algorithm can actively select the number of cate-
clustering, objects to be clustered are strictly classified by gories and bases its calculation of closeness on the Euclidean
their exclusive nature, but the object of a practical problem distance between the points.
is complex, and there may exist objective problems with In the algorithm, k clusters C = {C1 , C2 , · · · , Ck }are
attributes of multiple categories. Therefore, it is necessary to randomly selected as partitions and n datasets of samples
come up with a soft division method on such fundamental D = {x1 , x2 , · · · , xn }, n > kare divided into the nearest
issues. For example, FCM clustering has been widely used in cluster. Then Recalculate the center point of the cluster. Stop
fields like pattern classification and image processing [26]. the iteration until the convergence condition is reached. Gen-
Because it adds ambiguity to the membership requirement erally, the convergence function is defined as follow:
of each pixel, this type of algorithm is significantly bet- K X
X
ter than traditional K-means clustering in image processing SSE(C) = k xi − ck k2 (1)
results [27]. Although the degree of membership can feed- k=1 xi ∈ck
back the correlation beyond the distance, it is still gained in, SSE is the least square error of the cluster division cor-
by the distance relationship between a single point and the responding to the algorithm sample clustering. ck is the center
class. Therefore, the algorithm is also very susceptible to point of the cluster Ck . In the reference [15], the mathematical
noise. In reference [28] hybridizes intuitionistic fuzzy set meaning of the center point is verified and deduced. Here
and rough set in combination with statistical feature extrac- comes the conclusion: the best center of a cluster is the mean
tion techniques,which is higher than the accuracy achieved value of each point in it. The calculation method is as follow:
by hybridizing fuzzy rough set model. In reference [29], P
a method is advanced to relax the restrictions of membership xi ∈Ck xi
ck = (2)
and improve the robustness. In reference [30], the weight of |Ck |
membership degree is involved for the consideration of influ- In summary, the goal of the K-means algorithm is the min-
ences of each dimension attribute on clustering. In reference imum clustering result in (1). This optimization problem

VOLUME 8, 2020 217417


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

is an NP-hard problem [15]. Equation (1) also reflects the


closeness of the samples in the cluster around the center. The
smaller the SSE value is, the higher the degree of clustering
and the higher the similarity within the cluster will be. The
algorithm is characterised by its simple method in generating
the center point, fast speed and good scalability.

B. MEMBERSHIP AND FUZZY METRICS


The degree of membership is the basis of fuzzy set operations,
FIGURE 1. Illustation of the Spical points in this study.
and the membership function is the key to describing the
fuzziness [36]. Unlike the rule of the classic set, one element What’s more, Fan and Wu [38] proved that clustering with
in a fuzzy set can belong to multiple sets. Besides, the sum the use of fuzzy entropy as a fuzzy metric is feasible. The
of the membership degrees of the element to different clus- fuzzy entropy function in the reference [38] is a particu-
ters is always one. For example, in the FCM algorithm, the lar case where p is zero in the reference [37]. Meanwhile,
membership algorithm is as follow: the parameter p also plays an analytical role in clustering to
1 overcome defects of partition coefficients and other similar
uij = P dij 2 clustering validity functions.
c
k=1 ( dkj )
m−1

c III. OPTIMIZATION METHOD


s.t.
X
uij = 1 (3) When the K-means algorithm is applicated in the process-
esing of data, there may occur special point problems such
j=1
as equidistant points or noise, as shown in FIGURE 1.
The relationship between a point and a set is described by From equation (2), it can be known that the center point is
the degree of membership, while the overall fuzzy degree controlled by the distance between points during iteration.
of a certain set requires a new fuzzy measure. The fuzzy So when there are special points, the clustering results of
entropy defined based on the order relationship reflects the the K-means algorithm are very easily affected. Therefore,
fuzzy degree of a fuzzy partition [37]. Let F(X ) denote all this paper proposes an improved algorithm based on fuzzy
fuzzy sets on X . For any A, B ∈ F(X ), we call A 6 FM B if entropy (FMK-means), according to the characteristics of
and only if Ā 6 B̄. There are the following theorems on the fuzzy entropywhich judges the set structure.
ambiguity d(A) of fuzzy set A [34,35]
A. OPTIMIZATION DIRECTION
Theorem 1: Let A be the fuzzy set in the universe X, if the
mapping d: F(U ) → [0, 1] then: (1) When processing data, the traditional K-means algorithm and
the improved algorithm still focus on calculating the distance
1) d(A) = 0 if and only if A ∈ ρ(X ); between points to get the globally optimal solution. However,
2) d(A) = 1 if and only if µA(x) ≡ 0.5; their clustering effects on non-convex sets are not good. At the
3) If any xi ∈ U , µA(xi ) > µB(xi ) > 0.5, or µA(xi ) 6 same time, as the initial cluster centers are randomly selected,
µB(xi ) 6 0.5, then d(A) 6 d(B); the cluster center iteration is contracted evenly instead of
4) d(A) = d(∼ A) conforming to the characteristic direction of data distribution.
Theorem 2: IF U1 = [A1 , A2 , · · · , Ac ], U2 = Hence, the problem of sensitivity to special points has not
[B1 , B2 , · · · , Bc ] are two fuzzy c divisions on X and satisfy been solved [39]. When the membership degree is intro-
Aj 6 FM Bj , j = 1, 2, · · · , c,then call U1 a distinct modifica- duced into the fuzzy clustering FCM algorithm, although
tion of U2 and record it as U1 ≺ FM U2 . Hence, for all fuzzy more information between points is referred to, the value of
partitions FP(X ) on X, there must exist [ C1 ] as the largest emphasis, m, related to the membership degree needs to be
element. taken account of. There is still no ideal theoretical result for
In this paper, the fuzzy entropy E is used as the fuzzy this problem [40]. None of the above clustering can assess the
measure, and equation is as follows clustering effect by reflecting the set’s overall ambiguity and
c n information degree of the set.
4 XX
Ep (U : c) = uij (1 − uij ) (4) Therefore, in order to solve the problem of intra-cluster
cP measurement and initial center point, the overall feature space
j=1 i=1
and the initial artificial cluster center point are needed in addi-
in (4), c is the number of divided clusters. In addition, when
tion to the concept of fuzzy measurement. PEntropy usually
1 < c < n, equation (4) must satisfy the following properties:
functions in the description of the order of information: the
1) 0 6 Ep (U : c) 6 4n(c−1)
cp+1 higher the order of information is, the lower the fuzzy entropy
2) Ep (U : c) = 0 ⇔ U is hard partition will be, and vice versa. For example, reference [41], a sim-
3) U = [ 1c ] ⇔ Ep (U : c) = 4n(c−1)
cp+1 ple fuzzy entropy constraint is added to the original FCM
Proofs of these theorems have been given in reference [37], algorithm, which can suppress noise data to a certain extent.
so they are omitted here. In reference [42], there emerged, based on the original FCM,

217418 VOLUME 8, 2020


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

a membership equation with data distribution characteristics Take the partial derivative of uij in (8) and make it zero.
and the fuzzy clustering algorithm on entropy (FCMOE) Through constraint (7), the degree of membership can be
with fuzzy entropy constraints. Both of them have a certain finally obtained, which can be expressed as equation (9)
improving influence on the convergence direction of data. 4(1−2uij )(xi −xj )2
exp(− Cpω )
Reference [43], the ABC-kmeans algorithm is supplemented uij = (9)
Pc 4(1−2uij )(xi −xj )2
with the strategy of data features, performing better in clus- t=1 exp(− Cpω )
tering. Reference [44], based on the PSO-Kmeans algorithm, Then calculate the partial derivative of the cluster center ci
a two-step method is proposed by giving priority to a better from equation (8), and the cluster center can be expressed as
initial clustering center, finally reducing the cost.. equation (10) Pn
In summary, the introduction of the fuzzy entropy function uij (1 − uij )xi
makes fuzzy entropy the coefficient of the K-means algo- cj = Pi=1n (10)
i=1 uij (1 − uij )
rithm’s objective function. Meanwhile, an optimized clus-
tering algorithm with Gaussian distribution can be derived C. INITIAL CENTER POINT SOLUTION OF FMK-MEANS
by utilizing the distribution characteristics of the data to ALGORITHM
be clustered and taking fuzzy entropy as the cost func- In the traditional K-means algorithm, the initial center point is
tion constraint, finally improving the anti-noise performance. randomly allocated. If using noise or edge points as the initial
Moreover, this algorithm gives a better initial center point to point, a significant interference will definitely be caused to
reduce iterations. the result. To avoid similar situations, the initial point can be
B. FMK-MEANS ALGORITHM DERIVATION given artificially, and the initial point is required to be as close
Plug fuzzy entropy into equation (1) as a measure of fuzzy to actual point size as possible. According to the distribution
degree., equation (5) can be obtained,in which the constraint of clustering centers, centers of different clusters must be
condition can be got through the theorem 2. arranged in a nearly linear way in one or more dimensions.
n c An approximation effect can be achieved through the average
4 XX
SSE = p uij (1 − uij )(xi − cj )2 value. Assume that the data set X includes n N-dimensional
C data divided into c clusters. Then the cluster center of the I th
i=1 j=1
s.t. p = logc 4(c − 1) − 1 (5) dimension (I = 1, 2, · · · , N ) of any i cluster can be expressed
as:
Concurrently, the distribution characteristic ω of the clus- 3i1X
n
tering function is introduced to make the newly added con- vi I = xjI (11)
2cn
straint condition contract according to it. j=1
n N
1 X X xik Get the initial center point Vi = {vi1 , vi2 , · · · , viI , · · · , viN |i =
ω= (6) 1, 2, · · · , c}
nN max xk
i=1 k=1
in, N is the number of dimensions of elements in the cluster, D. THE FLOW OF FMK-MEANS ALGORITHM
and ω represents the normalized distribution characteristics The traditional K-means algorithm randomly selects the ini-
of the clustered data set. According to the factor ω, let the tial point without considering distribution characteristics of
fuzzy entropy meet the data distribution characteristics. The actual data. It does not have soft clustering characteris-
KKT condition of FMK-means algorithm with fuzzy entropy tics, which leads to the lack of robustness and stability of
factor can be expressed as equation (7) clustering results. Obviously, compared with the traditional
n c
4 XX K-means algorithm, it’s more reasonable for the improved
S = p uij (1 − uij )(xi − cj )2 + ωuij ln uij algorithm to work from multiple angles rather than a single
C
i=1 j=1
one. In addition, the improved algorithm reflects a Gaussian
s.t. g(uij ) = uij ln uij 6 0 distribution nature and can effectively overcome the problem
ω>0 of sensitivity to noise. The initial point can be given automati-
ωg(0) = ωg(1) = 0 (7) cally, and a soft clustering algorithm with data characteristics
is proposed in this paper. The specific steps are as follows:
Find the extremum here and define Lagrangian multiplication. Step 1 For a given data set, the mean points of all samples
A new function of (8) is available are calculated by referring to equation (11). The mean point
n c
4 XX is taken as the first clustering center, denoted asV (0) . The
L = p uij (1 − uij )(xi − cj )2 + ωuij ln uij
C
i=1 j=1 distribution characteristic of the overall data, ω, is calculated
n c through equation (6).
X X (0)
+ λi (1 − uij ) Step 2 Calculate the uij of all samples, and the member-
i=1 j=1 ship degrees will be determined according to equation (9).
c
X Similarly, at the beginning, all the sample points are initial-
s.t. uij = 1 (8) ized, which means all of the membership degrees should be
j=1 zero.

VOLUME 8, 2020 217419


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

Algorithm 1 FMK-Means (Fuzzy Metrics K-Means)


Require: Data set, the Maximum number of iterations T ,
Threshold ε, the Number of clusters c
Ensure: Clustering results of the data set
1: initialize the Array List;
2: compute Vj (D);
3: FOR(each sample j∈K){
4: FOR(each sample I∈N){
5: compute vjI ;
6: }
7: }
8: select Center Vj ;
9: PRINTF(K ,Initial Center V );
10: WHILE(data sets D!=null){
11: compute uij (D);
12: FOR(each sample jinK){
13: FOR(each sample iinD){
14: compute uij ;
15: }
16: }
17: select Membership uij ;
18: PRINTF(D,Initial Membership uij );
19: FMK-means inputer(D, K ,Initial Center V , D,Initial
FIGURE 2. Algorithm flow chart. Membership; uij )
20: WHILE(new center!=original center){
Step 3 Conduct the algorithm FMK-means calculation 21: FOR(each center Vj ∈ V ) {
on given data sets. Constantly iterate uij and cj , until the 22: FOR(each center Vj ∈ V ){
termination condition of the iteration is reached. 23: compute SSE(uik ,dik );
Step 4 Output final clustering results. 24: }
25: IF(SSE(ui k,di k)=Min SSE(ui j,di j)){
IV. FMK-MEANS ALGORITHM COMPLEXITY 26: Center Vk <- sample I;
In terms of time cost, the time complexity of the traditional 27: }
K-means algorithm is O(i × c × n × m), where c is the 28: } //END FOR;
number of clusters, i is the number of iterations required for 29: compute NEW Center Vj =
convergence, and n is the number of the the number of the 30: Mean(sample(i &&(i ∈ ClusterVj )));
data set contains, m is the number of attributes of each data. 31: } // END WHILE;
If the convergence occurs faster, the value of i will usually 32: PRINTF(Cluster Vj );
relatively small. As long as the number of clusters c is signif-
icantly less than n, the K-means algorithm is linearly related
to n.
In the optimization algorithm, the time complexity of the
algorithm to solve the initial center point can be expressed V. SIMULATION EXPERIMENT OF CLUSTERING
as O(n × m). The time complexity of solving the membership A. EXPERIMENTAL ENVIRONMENT AND DESIGN
degree in a single iteration in the subsequent iteration process The experiment was conducted with the system of
is O(c × n × m), and the time complexity of cluster centers in Windows 10, 16GB physical memory, CPU frequency
each iteration is O(c × n × m). By introducing the above data of 3.20GHz and the Python 3.8 platform. Comparisons were
to equation (5), the time cost of the FMK-means algorithm is realized between the FMK-means algorithm, K-means algo-
O(c × n × m). In summary, the total cost of each iterations rithm, K-means++ algorithm, FCM algorithm,Alternative
is 2O(c × n × m). Assuming that the number of iterations is Fuzzy k-means (AFKM) algorithm [45] and FCMOE algo-
i, then the total cost time complexity is O(i × c × n × m). rithm. The k-means algorithm, k-means++ algorithm, and
Finally, the time complexity of the FMK-means algorithm FCM algorithm were implemented by calling the Scikit-
can be determined to be O(kn). learn [46], [47]. The experimental data were 5 data sets (iris,
In reality, in consideration of the actual effects of the balance, phoneme, ring and HTRU) [48]–[52], and detailed
initial center points, the number of iterations of the improved parameters are shown in TABLE 1. Since the reference data
algorithm will be much smaller than the number of iterations has a certain degree of the reference data, the clustering
i of the traditional K-means algorithm. results have a more intuitive comparison effect.

217420 VOLUME 8, 2020


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

TABLE 1. Attribution of data. TABLE 5. Performance comparison of ring clustering algorithm.

TABLE 2. Performance comparison of Iris clustering algorithm.


TABLE 6. Performance comparison of HTRU clustering algorithm.

TABLE 3. Performance comparison of balabce clustering algorithm.

TABLE 4. Performance comparison of phoneme clustering algorithm.

FIGURE 3. Comparison of clustering accuracy of the six algo-Rithms.

results of different algorithms in the five cases of 500, 2000,


The parameter are t = 10000, and the threshold ε = 0.0001. 20000,100000(8) and 100000(19).
Becauseit is pointed out in reference [53] that when the fuzzy It can be seen from the accuracy line chart of FIGURE 3
index is 1.5 − 2, the clustering effect is ideal, so set m = 2 in that the six clustering algorithms perform best on Iris. Hard
FCM. clustering performs the worst on the accuracy of Balance
In order to verify the effect of the improved FMK-means data, because the banlance data is not a continuous variable.
algorithm, two sets of comparative experiments were The worst effect on soft clustering in phoneme is due to the
designed. Experiment 1 compared the above data’s cluster- fact that different types of data are more scattered. However,
ing results with different clustering algorithms to verifythe considering data distribution characteristics, the two algo-
increase of the improved algorithm on the clustering effects rithms FCMOE and FMK-means stand out for their high
of the traditional algorithm. In the experiment, each group accuracy. It also somewhat reflects that purely measuring the
of algorithms took the average of ten calculation results distance between points in calculation is unreasonable.
for different data groups in turn as the clustering results. It is not difficult to find from the comparison that
Experiment 2 verified the anti-noise ability of the improved FMK-means in this paper has several indicators ranking the
algorithm and the traditional algorithm by adding noise data best in the performance test. In several experimental data sets,
to the sample data. the FMK-means algorithm has the highest average accuracy
and is close to the FCMOE algorithm. The reason is that
B. EXPERIMENTAL RESULTS AND ANALYSIS both of them offer the initial cluster center, avoid the problem
1) ALGORITHM CLUSTERING RESULTS of random initial points, and also take into account fuzzy
The clustering performance comparison of the above six entropy measurement and the overall distribution character-
algorithms is shown in the table below istics of data, hence more comprehensive than other algo-
In the above results, we found that the amount of data rithms. Although changing the metric in the AFKM algorithm
and dimensions have different effects on different ranges of improves the robustness of the AE metric, its clustering effect
the same method. In addition, in the same data set, the out- is not significantly improved compared to the FCM.
put results of different algorithms were also quite differ- To summarize, for low-dimensional small data clustering,
ent. Based on the data in TABLE 2 to TABLE 6, set the when the stopping condition is satisfied, the FMK-means

VOLUME 8, 2020 217421


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

FIGURE 4. Clustering time of the six algo-Rithms.

FIGURE 6. Data distribution of banlance.

TABLE 7. Initial cluster center comparison.

TABLE 8. Anti-noise performance.

FIGURE 5. Number of iterations of the six algo-Rithms.

algorithm has the fewest iterations and the highest accuracy contraction direction. Both Ring and HTRU are at the level
rate. In the case of rising data volume from Iris to Phoneme, of 100,000, but the impact of their dimensionality on all
the number of iterations of the traditional algorithm increases algorithms is higher than that of quantity.
at an incredible speed, but the soft clustering methods have Even though the FCMOE requires fewer iterations than
fewer iterations and the time cost changes smoothly. The the traditional K-means algorithm and FCM algorithm, there
main reason is that the degree of membership can better is still a certain gap on accuracy and convergence results
reflect the relationship between multiple points rather than between it and the FMK-means algorithm. The main reason
simply the distance between two points, thereby speeding up is that fuzzy entropy is used as the coefficient in the optimiza-
the convergence. Moreover, the algorithms of FMK-means tion algorithm. Although both introduce an adjustment factor
and FCMOE have a smaller number of iterations due to the that indicates the distribution characteristics of the data set,
artificial initial center. Besides, when dimensions increase it is more reliable in the reduction direction. However, due
from Phoneme to Ring, the accuracy of K-means algorithms to the calculation with fuzzy entropy introduced, the output
decreases, but the accuracy of fuzzy clustering does not of the FMK-means algorithm is the smallest, and the result
experience a similar change and clustering algorithms with of the overall distribution is more unambiguous.
fuzzy entropy as the constraint posess the highest accuracy.
Since fuzzy entropy describes the nature of the overall fuzzy 2) ANTI-NOISE ABILITY COMPARISON
degree, and since the more specific the set is, the lower fuzzy For the purpose of comparing the anti-noise ability of the
entropy is, FMK-means has a higher accuracy and a lower FMK-means algorithm and the other five algorithms, 10%,
clustering output. In the Balance data, there appear abnormal 15% and 20% of the noise data are added to the Iris data. Then
feedback results in the above several algorithms because the use these algorithms to complete the clustering and check
balance data is not convex data, and the distribution is shown the anti-noise ability. The experimental results are shown in
in FIGURE 6. Therefore, in spite of a small data volume of TABLE 8.
the sample, the accuracy of the traditional K-means algorithm According to the experimental results, the FMK-means
is not high. The FMK-means and the FCMOE algorithm, algorithm is less affected by noise and has a good anti-noise
though taking a little bit longer time, can gain a high accu- performance. Compared with traditional K-means algo-
racy due to the use of data distribution characteristics as the rithms, FMK-means, with artificial initial cluster centers,

217422 VOLUME 8, 2020


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

effectively avoids the interference of randomly selected spe- [11] Z. Huang, ‘‘A fast clustering algorithm to cluster very large categorical data
cial points, reduces iterations and possesses some anti-noise sets in data mining,’’ DMKD, vol. 3, no. 8, pp. 34–39, 1997.
[12] Z. Huang, ‘‘Extensions to the K-means algorithm for clustering large data
ability. Fuzzy entropy is introduced as a new constraint in the sets with categorical values,’’ Data Mining Knowl. Discovery, vol. 2, no. 3,
FMK-means algorithm. Secondly, in the FMK-means algo- pp. 283–304, Sep. 1998.
rithm, data distribution characteristics are also introduced [13] C. Ding and X. He, ‘‘K-nearest-neighbor consistency in data clustering:
incorporating local information into global optimization,’’ in Proc. ACM
into the membership degree, so that the membership func- Symp. Appl. Comput., Nicosia, 2004, pp. 584–589.
tion is supplemented with Gaussian distribution characteris- [14] G. Zhang, C. Zhang, and H. Zhang, ‘‘Improved K-means algorithm based
tics. By doing so, it ensures that the reducing direction of on density canopy,’’ Knowl.-Based Syst., vol. 145, pp. 289–297, Apr. 2018.
[15] Z. Xianchao, ‘‘Cluster-based partition algorithm,’’ in Data Clustering, 3rd
entropy conforms to that of distribution characteristics during ed. Beijing, China: Science Press, 2017, pp. 37–62.
the algorithm convergence, thereby reducing the impact of [16] D. Yueming, W. Minghui, Z. Ming, and W. Yan, ‘‘Optimizing initial cluster
non-convex clustering. Centroids by SVD in K-means algorithm for Chinese text clustering,’’
Comparison between the clustering effects of different J. Syst. Simul., vol. 30, pp. 244–251, Oct. 2018.
[17] L. T. Z. Can, ‘‘Nearest neighbor optimization k-means clustering algo-
algorithms and the analysis the clustering results make it rithm,’’ Comput. Sci., vol. 46, pp. 216–219, Nov. 2019.
clear that the improved FMK-means algorithm has stronger [18] W. L. L. X. Z. Liangjun, ‘‘Research on K-means clustering algorithm based
advantages in terms of convergence speed, iterations and on ARIA,’’ J. Sichuan Univ. Sci. Eng. (Natural Sci. Ed.), vol. 32, pp. 65–70,
Apr. 2019.
clustering accuracy. [19] Y. Xiaoli, ‘‘Design of center random drift (CRD) K-means clustering
algorithm,’’ J. Changchun Univ., vol. 27, pp. 35–38, Aug. 2017.
VI. CONCLUSION [20] C. Guo and Y. Zang, ‘‘Clustering algorithm based on density function and
This paper proposes an improved K-means algorithm based nichePSO,’’ J. Syst. Eng. Electron., vol. 23, no. 3, pp. 445–452, Jun. 2012.
on fuzzy entropy. Fuzzy entropy characterized in its descrip- [21] Y. Jinping, Z. Jie, and M. Hongbiao, ‘‘K-means clustering algorithm based
on improved artificial bee colony algorithm,’’ J. Comput. Appl., vol. 34,
tion of the set’s fuzzy degree contributes to the solution of the pp. 1065–1069, Jun. 2014.
problem that the traditional K-means algorithm is extremely [22] L. Jia-Ming, K. Li-Qun, and Y. Hong-Hong, ‘‘K-means clustering algo-
sensitive to special points. Meanwhile, a new solution to rithm optimized by Gray Wolf,’’ Chin. Sciencepaper, vol. 14, pp. 778–782
and 807, Aug. 2019.
the initial center point is proposed, which avoids the defect [23] J. C. Bezdek, R. Ehrlich, and W. Full, ‘‘FCM: The fuzzy c-means clustering
that the initial point is randomly selected and the risks of algorithm,’’ Comput. Geosci., vol. 10, nos. 2–3, pp. 191–203, Jan. 1984.
a special point or a local optimal solution. In the optimiza- [24] Y. S. Song and D. C. Park, ‘‘Fuzzy C-means algorithm with divergence-
based kernel,’’ in Proc. Int. Conf. Fuzzy Syst. Knowl. Discovery,
tion algorithm, the factor representing data set distribution Berlin,Germany, 2006, pp. 99–108.
characteristics is specified for fuzzy entropy as ω, so that [25] C. L. Chowdhary and D. P. Acharjya, ‘‘Clustering algorithm in possibilistic
the clustering result solution has Gaussian distribution char- exponential fuzzy C-Mean segmenting medical images,’’ J. Biomimetics,
acteristics, which improves the accuracy and noise resis- Biomater. Biomed. Eng., vol. 30, pp. 12–23, Jan. 2017.
[26] C. Xia and F. Fang, ‘‘Fuzzy Clustering Methods,’’ Life Sci. Instrum.,
tance of the clustering algorithm. However, the processing vol. 11, pp. 33–37, Dec. 2013.
of multi-dimensional data, such as text data and image data, [27] W. Chengmao and S. Jiamei, ‘‘Adaptive robust picture fuzzy clustering
is not covered in this paper. Further research tasks are listed segmentation algorithm,’’ J. Huazhong Univ. Sci. Technol. (Natural Sci.
Ed.), vol. 47, pp. 115–120, Sep. 2019.
as how to select a correct way to reduce dimensionality and [28] C. L. Chowdhary and D. P. Acharjya, ‘‘A hybrid scheme for breast cancer
how to reasonably introduce other types of ambiguity. detection using intuitionistic fuzzy rough set technique,’’ Int. J. Healthcare
Inf. Syst. Informat., vol. 11, no. 2, pp. 38–61, Apr. 2016.
REFERENCES [29] V. Cherkassky and F. Mulier, ‘‘Support Vector Machines,’’ in Learning
[1] Y. Wu, ‘‘General overview on clustering algorithms,’’ Comput. Sci., vol. 42, from Data Concepts, Theory, and Methods, vol. 1, 9th ed. New York, NY,
pp. 491–499,524, Jun. 2015. USA: Wiley, 2007, pp. 413–417.
[2] Y. Jing and J. Wang, ‘‘Tag clustering algorithm LMMSK: Improved K- [30] J. Li, X. B. Gao, and L. C. Jiao, ‘‘A new feature weighted fuzzy clustering
means algorithm based on latent semantic analysis,’’ J. Syst. Eng. Electron., algorithm,’’ Acta Electronica Sinica, vol. 34, pp. 89–92, Jan. 2006.
vol. 28, no. 2, pp. 374–384, Apr. 2017. [31] C. Palanisamy and S. Selvan, ‘‘Efficient subspace clustering for higher
[3] S. Sambasivam and N. Theodosopoulos, ‘‘Advanced data clustering meth- dimensional data using fuzzy entropy,’’ J. Syst. Sci. Syst. Eng., vol. 18,
ods of mining Web documents,’’ Issues Informing Sci. Inf. Technol., vol. 3, no. 1, pp. 95–110, Mar. 2009.
pp. 563–579, Jan. 2006. [32] C. L. Chowdhary and D. P. Acharjya, ‘‘Breast cancer detection using
[4] J. Xu, K. Zheng, M.-M. Chi, Y.-Y. Zhu, X.-H. Yu, and X.-F. Zhou, ‘‘Tra- intuitionistic fuzzy histogram hyperbolization and possibilitic fuzzy C-
jectory big data: Data, applications and techniques,’’ J. Commun., vol. 36, mean clustering algorithms with texture feature based classification on
no. 12, pp. 97–105, 2015. mammography images,’’ in Proc. Int. Conf. Adv. Inf. Commun. Technol.
[5] IDC. Accessed: Oct. 31, 2020. [Online]. Available: https://www. idc.com/ Comput., 16th ed. Bikaner, India: ACM Press, 2016, pp. 1–6.
[6] P. M. Sosa, G. S. Carrazoni, R. Gonçalves, and P. B. Mello-Carpes, ‘‘Use [33] V. Dey, D. K. Pratihar, and G. L. Datta, ‘‘Genetic algorithm-tuned entropy-
of facebook groups as a strategy for continuum involvement of students based fuzzy C-means algorithm for obtaining distinct and compact clus-
with physiology after finishing a physiology course,’’ Adv. Physiol. Edu., ters,’’ Fuzzy Optim. Decis. Making, vol. 10, no. 2, pp. 153–166, Jun. 2011.
vol. 44, no. 3, pp. 358–361, Sep. 2020. [34] C. L. Chowdhary and D. P. Acharjya, ‘‘Segmentation of mammograms
[7] F. Yu, K. Peng, and X. Zheng, ‘‘Big data and psychology in China,’’ Chin. using a novel intuitionistic possibilistic fuzzy C-mean clustering algo-
Sci. Bull., vol. 60, nos. 5–6, pp. 520–533, Feb. 2015. rithm,’’ Nature Inspired Comput., vol. 652, pp. 75–82, Jan. 2018.
[8] K. Jahanbin and V. Rahmanian, ‘‘Using Twitter and Web news mining [35] H. T. Hong and Yonghong, ‘‘Automatic pattern recognition of ECG signals
to predict COVID-19 outbreak,’’ Asian Pacific J. Tropical Med., vol. 13, using entropy-based adaptive dimensionality reduction and clustering,’’
pp. 378–380, Jul. 2020. Appl. Soft Comput., vol. 55, pp. 238–252, Jun. 2017.
[9] J.-G. Sun, ‘‘Clustering algorithms research,’’ J. Softw., vol. 19, no. 1, [36] J. Zejun, ‘‘Fuzzy Set,’’ in Theory and Methods of Fuzzy Mathematics.
pp. 48–61, Jun. 2008. Beijing, China: Publishing House of Electronics Industry, 2015, pp. 23–27.
[10] H. Dai, Z. Chang, and N. Yu, ‘‘Understanding data mining,’’ in Introduction [37] M. X. QING and SUN, ‘‘A new clustering effectiveness function: Fuzzy
to Data Mining, 1st ed. Beijing, China: Tsinghua Univ. Press, 2015, entropy of fuzzy partition,’’ CAAI Trans. Intell. Syst., vol. 10, pp. 75–80,
pp. 1–27. Jan. 2015.

VOLUME 8, 2020 217423


X. Geng et al.: Improved K-Means Algorithm Based on Fuzzy Metrics

[38] F. Jiulun and W. Chengmao, ‘‘Clustering validity function based on fuzzy YUKUN MU received the bachelor’s degree in
entropy,’’ Pattern Recognit. Artif. Intell., vol. 14, no. 4, pp. 390–394, network engineering from Southwest Petroleum
Dec. 2001. University, Chengdu, China, where he is currently
[39] Z. Liu and Y. Hu, ‘‘Research on FCM clustering optimization algorithm pursuing the M.S. degree in computer science and
for self-adaptive bacterial foraging,’’ Mod. Electron. Technique, vol. 43, technology. His research interests include fuzzy
pp. 144–148, Mar. 2020. mathematics and data mining.
[40] X. J. Gao and Pei, ‘‘A study of weighting exponent m in a fuzzy C-means
algorithm,’’ Acta Electronica Sinica, vol. 28, pp. 80–83, Apr. 2000.
[41] T. Chaira, ‘‘A novel intuitionistic fuzzy c means clustering algorithm and
its application to medical images,’’ Appl. Soft Comput., vol. 11, no. 2,
pp. 1711–1717, Mar. 2011.
[42] S. Liao, J. Zhang, and A. Liu, ‘‘Fuzzy C-means clustering algorithm
by using fuzzy entropy constraint,’’ J. Chin. Comput. Syst., vol. 35,
pp. 189–193, Feb. 2014.
[43] U. H. Atasever, ‘‘A novel unsupervised change detection approach based
on reconstruction independent component analysis and ABC-kmeans
clustering for environmental monitoring,’’ Environ. Monitor. Assessment,
vol. 191, no. 7, pp. 1–11, Jun. 2019. SENLIN MAO received the bachelor’s degree
[44] C. Ibrahim and I. Mougharbel, ‘‘Two stages K-means and PSO-based
in city underground space engineering from
method for optimal allocation of multiple parallel DRPs application &
Southwest Petroleum University, Chengdu, China,
deployment,’’ IET Smart Grid, vol. 3, no. 2, pp. 216–225, May 2019.
[45] T. Hu, ‘‘Discussion of improving fuzzy K-means clustering,’’ M.S. the- where he is currently pursuing the M.S. degree
sis, Dept. School Data Comput. Sci., Sun Yat-Sen Univ., Guangzhou, in computer science and technology. His research
China,2010. interests include grey system theory, data mining,
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, and artificial neural networks.
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,
‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12,
pp. 2825–2830, Oct. 2011.
[47] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller,
O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton,
J. VanderPlas, A. Joly, B. Holt and G. Varoquaux, ‘‘API design for
machine learning software: Experiences from the scikit-learn project,’’ in
Proc. ECML PKDD Workshop, Lang. Data Mining Mach. Learn., 2013,
pp. 108–122.
[48] Iris Uci. Accessed: Oct. 31, 2020. [Online]. Available:http://archive.ics. JINCHI YE received the bachelor’s degree in infor-
uci.edu/ml/datasets/Iris mation management and information system from
[49] Balance Uci. Accessed: Oct. 31, 2020. [Online]. Available: http://archive. Southwest Petroleum University, Chengdu, China,
ics.uci.edu/ml/datasets/Balance+Scale where he is currently pursuing the M.S. degree
[50] phoneme DataHub Accessed:Oct. 31.2020. [Online]. Available: in computer science and technology. His research
https://datahub.io/machine-learning/phoneme
interests include grey system theory, data mining,
[51] Ring KEEL. Accessed: Oct. 31, 2020. [Online]. Available: https://sci2s.ugr.
and artificial neural networks.
es/keel/datasets.php
[52] R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles,
‘‘Fifty years of pulsar candidate selection: From simple filters to a new
principled real-time classification approach,’’ Monthly Notices Roy. Astro-
nomical Soc., vol. 459, no. 1, pp. 1104–1123, Apr. 2016.
[53] N. R. Pal and J. C. Bezdek, ‘‘On cluster validity for the fuzzy c-means
model,’’ IEEE Trans. Fuzzy Syst., vol. 3, no. 3, pp. 370–379, Aug. 1995.

XINYU GENG is currently a Professor with LIPING ZHU received the bachelor’s degree in
the School of Computer Science, Southwest computer science and technology from Sichuan
Petroleum University, Chengdu, China. His main Normal University, Chengdu, China, where she is
research interests include data mining and artificial currently pursuing the M.S. degree in engineering
neural networks. management. Her research interests include data
mining and artificial neural networks.

217424 VOLUME 8, 2020

You might also like