Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
ABSTRACT Imbalanced data exist extensively in the real world, and the classification of imbalanced data is a
hot topic in machine learning. In order to classify imbalanced data more effectively, an oversampling method
named LSSASMOTE is proposed in this paper. First, the kernel function parameters and penalty parameters
of the support vector machine (SVM) were optimized using levy sparrow search algorithm (LSSA), and a
fitness function was correspondingly designed. Then, during the optimization process, SMOTE sampling
rate was combined, and LSSA iteration was used to select the best combination of SVM parameters and
SMOTE sampling rate. In addition, the oversampled samples were noise processed by Tomek Link. In this
case, the LSSASMOTE+SVM classification model was constructed to classify the imbalanced data. Eight
of the datasets used in the experiments were obtained on UCI and KEEL, and the other three datasets were
created manually. The experimental results confirm that the model can effectively improve the classification
accuracy of imbalanced data and can be used as a new imbalanced data classification method.
INDEX TERMS Imbalanced data, machine learning, sparrow search algorithm, support vector machine,
oversampling.
process. Addressing class imbalance and class overlap in TABLE 1. Number of iterations of the optimization algorithm.
a rational manner can be beneficial for the classification
process and the application of imbalanced data.
α represents a random number in (0,1); Q represents a posi- In formula (8), s refers to the flight path L(λ); the value
tively distributed random number; R2 represents the uniform range of the parameter β is 0 < β < 2, generally taking
random number in (0,1); ST represents the alarm threshold β = 1.5; the parameters u and v are typically distributed
and safety value, with a value range of [0.5,1.0]. R2 < ST random numbers, obeying formula (8) shown in the normal
indicates that the current environment is safe and that spar- distribution; the values of standard deviations σu and σv of
rows can find food. R2 > ST indicates that a predator is the normal distribution corresponding to formula (9) are in
approaching, and the sparrow issues an alert. At this time, line with the calculation shown in (10):
all the sparrows fly to a safe place to feed. (
The primary function of the follower is to follow the dis- u N (0, σu2 )
(9)
coverer. The formula for a simplified position update is shown v N (0, σv2 )
in (6).
0(1 + β) sin(πβ/2) 1/β
σu =
t+1
xi,d 0[(1 + β)/2]β2(β−1)/2 (10)
σ = 1
xwti,d − xi,d
t v
n
i>
Q · exp(
),
2 Firstly, in the individual selection, the inertia weight factor is
i 2
= 1 XD
n adopted in this paper, and the roulette method is used to select
t t t
xbi,d + D (rand {−1, 1} · (|xbi,d − xi,d |)), i ≤
sparrow individuals for the Levy flight variation, as shown
2
d=1 in (11):
(6)
f = 1 − iter/Max_iter (11)
xw represents the worst position of sparrows in the popu-
lation. xb represents the best position of the sparrows in the In formula (11), f is the inertia weight factor, iter ∈
population. i > 2n indicates that the i follower is hungry and {1, 2, . . . , Max_iter}, and Max_iter denotes the number of
flies to other places for food. i ≤ n2 indicates that the follower iterations of sparrow search. If rand > f , use roulette wheel
moves to the sweet spot and stays near the sweet spot, and the selection and take a random number rand to perform levy
variance from the optimal position becomes smaller. Some flight mutation on the selected sparrow individuals.
followers also act as scouts to help discoverers find food. Secondly, the improved sparrow search algorithm is used
When the scouts find danger, they will immediately abandon to select the parameters of SVM, assign different weights to
their existing food and move to a new place. As shown in (7). different categories of samples, and reduce the dimensionality
of samples. At the same time, the value of the SMOTE
x + β|xi,d
t t
− xbti,d |, fi ̸ = fg sampling rate is also included in the optimization process.
i,d
t+1 " t # Through the LSSA algorithm, the problem of obtaining the
xi,d = |xi,d − xwti,d | (7)
optimal parameters can be transformed into the problem of
xi,d + K (f − f ) + ε , fi = fg
t
i w solving the maximum value of the function. The algorithm is
defined as:
β represents a random number in a normal distribution;
K represents the range between [-1,1]; ε represents a very maximize : y = f (X ), X = (x1 , x2 , . . . , xD ) (12)
small and non-zero number; fg and fw represents the best and
In formula (12), f (X ) is the fitness function, that is, the
worst fitness values, respectively; fi represents the individual
prediction accuracy of samples of the minority class, and X
fitness value of a sparrow; When fi = fg , they need to change
refers to the position of different sparrows in the D dimension
their position quickly and fly to another sparrow to prevent
of the initial population of sparrows.
danger.
The SMOTE sampling rate in this algorithm is defined as:
IV. LSSASMOTE+SVM CLASSIFICATION MODEL Zmin < round(Zi ) < Zmax , i = 1, 2, . . . , M (13)
A. LSSASMOTE ALGORITHM
In formula (13), Zmin and Zmax are the minimum and max-
The sparrow search algorithm has the shortcomings of lack
imum values of the sampling ratio Zi of the minority class
of diversity and local optimization ability in the late iteration.
samples, respectively, which are determined by the number of
Ma and Zhu [27] introduced the mechanism of flight distur-
minority class samples; Zi takes a value within this interval,
bance levy to enhance the optimization performance of SSA,
and its value is rounded off by the round() function; M refers
which solves the optimization problem of high-dimensional
to the dimension of decision space, that is, the number of
space to some extent.
samples from the minority class.
During the calculation of the search path L(λ) of the levy
flight, the calculation formula of the simulated levy flight [28]
B. DESIGN OF THE FITNESS FUNCTION
path was generally used, as shown in (8):
In the improved sparrow search algorithm, the individual
u position of a sparrow is related to its fitness value, and the
s= (8)
|v|1/β complexity of its fitness function also directly affects the
efficiency of the algorithm. In this case, the split data set was Step 2: Take the penalty parameter C and the kernel param-
predicted and the accuracy of the prediction result was used eter g of the SVM as the individual position of the sparrow to
as the fitness value. learn the training set and construct the fitness function fitness
The function is constructed as shown in (14): by taking the classification accuracy as the fitness value of
the individual position of the sparrow.
fitness = acc(validation(X )) + acc(train(X )) (14)
Step 3: Calculate and sort the fitness value using the
Combined with formula (12), validation in formula (14) rep- LSSA algorithm, iterate according to the number of iterations
resents the classification label of the validation set; train, MaxIter, select individuals to mutate using the roulette selec-
the classification label of the training set; X , the position of tion method, and choose the best combination of parameters
different sparrows in the D dimension of the initial population C and g.
of sparrows; acc, the accuracy of the prediction result; and Step 4: Create a new sparrow population based on the
fitness, the fitness value, with a higher corresponding position number of minority classes in different datasets.
that indicates a better individual position of the sparrow. Step 5: Take the sampling rate in the SMOTE algorithm
The pseudo-code of the fitness value solving process is as the individual position of the sparrow, combine it with the
shown in Algorithm (1): optimized SVM algorithm, and select the best sampling rate
according to the fitness function.
Algorithm 1 Fitness Solution Algorithm Step 6: Perform oversampling with the selected sampling
Input: population array X , validation set validation, train set ratio and use the Tomek link for noise processing to obtain a
Train data set with balanced sample categories.
Output: fitness Step 7: Balance the data set and combine the selected
1: Classifier ← SVM .SVC(C ← X [0], kernel, gamma ← parameter combinations C and g to establish the
X [1]) LSSASMOTE+SVM classification model.
{The initial fitness value was calculated and the SVM In the LSSASMOTE+SVM classification model, the
classifier was trained with X } SMOTE sampling rate and the parameters of the SVM are
2: Train ← Classifier.Predict(Train) the individual positions of the sparrows. Therefore, the opti-
3: validation ← Classifier.Predict(validation) mal parameters were selected to solve the optimal individual
{Calculate the training set and validation set prediction positions of the sparrows. The pseudo-code of the solution
labels} process is shown in Algorithm (2).
4: Fun(0) ← acc(validation) + acc(Train)
{Calculate the initial fitness value Fun (0) }
V. EXPERIMENTAL
5: pop ← X .shape(0)
A. DATA SET PREPARATION
{Construct zero matrix}
6: for i ← 0 to pop by 1 do To verify the validity of the LSSASMOTE+SVM model,
7: fitness[i] ← Fun(X [i, :]) experimental analyses based on eight imbalanced datasets
{ Calculate the fitness value based on the pop size} from UCI and KEEL are conducted in this paper, and the
8: end for structural characteristics of the datasets are listed. In addition,
9: fitness[i] ← Sort(fitness) this paper uses three manually created datasets which are gen-
{ Sort the fitness values and select the best fitness value} erated by make_classification in sklearn. The performance of
the model is further demonstrated by controlling the degree
10: return fitness of imbalance and the proportion of noisy data. The proportion
of noise in simulated data 1(Sim1) is 10%, the proportion of
noise in simulated data 2(Sim2) is 20%, and the proportion
C. LSSASMOTE+SVM CLASSIFICATION MODEL
of noise in simulated data 3(Sim3) is 30%. See Table (2) for
an example.
Based on the characteristics of the imbalanced data, the
To ensure the consistency of the sample imbalance rate
LSSASMOTE+SVM classification model selected the best
between the validation and training sets, the data set was
combination of parameters of the SMOTE sampling rate,
divided into 70% of the training set, 15% of the test set
the SVM penalty parameter and the kernel parameter. The
and 15% of the validation set. In order to fully validate the
influence of noise among different samples after sampling is
classification effect of the algorithm and reduce randomness,
optimized. This classification model not only obtains more
the following metric results are the average values obtained
ideal balanced data, but also improves the classification accu-
after 5 times of hierarchical cross-validation.
racy of imbalanced data. The steps to build the model are as
follows:
Step 1: Initialize the sparrow population and set the param- B. EVALUATION INDICATORS
eters of the LSSA algorithm, which consist of population size The classification accuracy rate is usually taken as the eval-
pop, dimension dim, maximum iteration number MaxIter, uation indicator by the traditional SVM model. However,
lower boundary lb and upper boundary ub. the accuracy rate is suitable for evaluating the balanced
C. EXPERIMENTAL RESULTS
When the data sets are the same, the sampling mul-
tiplier of each sampling method is set to three. The
LSSASMOTE+SVM model was hereby compared with the
SMOTE+SVM model [8], the SSMOTE+SVM model [30],
the LD-SMOTE+SVM model [31], the L-SMOTE+SVM
model [32] and the FTL-SMOTE+Mixed Kernel SVM mod-
dataset. In the classification results of imbalanced datasets, els [15]. The SVM parameter C in the above algorithm uses
the accuracy of a few samples may be too low, so the classi- the default regularization parameter value of 1.0, and the
fication accuracy cannot represent the classification results. parameter g uses the default value of auto.
Therefore, F_measure and G_mean were used as evaluation The G_mean and F_measure values of each classification
indicators, and both were calculated based on a confusion model in different datasets are shown in Table (4). According
matrix [29]. See Table (3) for an example. to the experimental results shown in Table (4), the F_measure
TABLE 4. Comparison of the LSSASMOTE+SVM model with other models. TABLE 5. Comparison of LSSASMOTE optimized SVM classifier with other
classifiers.
TABLE 6. The running time of each algorithm in a partial data set. VI. CONCLUSION
The LSSASMOTE+SVM classification model was proposed
by combining the improved sparrow search algorithm with
the SMOTE and SVM algorithms. The experiments con-
firm the feasibility of the proposed model. LSSA selects the
best combination of parameters for this model, improves the
blindness of the selection of sampling rate of the SMOTE
algorithm, and obtains a better balanced and stable data
set. The choice of this parameter combination is conducive
to improving the classification accuracy of the SVM clas-
sifier while dealing with imbalanced data. Overall, the
LSSASMOTE+SVM classification model proposed in this
paper has a good classification effect on imbalanced data.
However, more efforts are still required to reduce the run time
and improve the classification accuracy of the LSSASMOTE
algorithm.
In addition, there is still room for improvement in the
multi-classification problem, and further research is needed
on the time complexity and boundary partition.
LSSASMOTE+SVM model removes the effects of noise
during its process, resulting in the model having better classi- REFERENCES
fication results on three artificially created simulated datasets. [1] J. Wang and J. Yan, ‘‘Classification algorithm based on undersampling
However, due to the increasing percentage of noise, the clas- and cost-sensitiveness for unbalanced data,’’ Comput. Appl., vol. 41, no. 1,
pp. 48–52, 2021.
sification performance of the model in this paper shows a
[2] C. Tian and L. Zhou, ‘‘Credit assessment method based on majority weight
decreasing trend. minority oversampling technique and random forest,’’ Comput. Appl.,
To further assess the classification effectiveness of the vol. 39, no. 6, pp. 1707–1712, 2019.
LSSASMOTE+SVM model and its superiority over other [3] B. Yang, L. Shi, G. Chi, and Y. Dong, ‘‘Design and application of credit
rating model based on BPNN-LDAMCE based on unbalanced data,’’
classifiers, the model was compared experimentally with J. Quant. Tech. Econ., vol. 39, no. 3, pp. 152–169, 2022.
unoptimized SVM, Bayes [33], C4.5 [34], Random For- [4] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for dense
est [35], and AdaBoost [36] using eleven datasets presented object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
pp. 2999–3007.
in Table (2). Both the LSSASMOTE+SVM and other classi- [5] P. Iba and W. Langley, ‘‘Experiments with easyensemble,’’ in Proc. Int.
fiers used the dataset after LSSA-SMOTE optimization. The Conf. Artif. Intell. (ICAI), 2000, pp. 9–16.
experimental results are presented in Table (5). [6] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, C. Soares, S. Wilk,
Based on the experimental results, it can be concluded that and J. Santos, ‘‘On the joint-effect of class imbalance and overlap: A crit-
ical review,’’ Artif. Intell. Rev., vol. 55, no. 8, pp. 6207–6275, Dec. 2022.
the LSSA-SVM model outperforms most of the other classi- [7] J. Chen and Z. Zheng, ‘‘Over-sampling method on imbalanced data
fiers for the majority of datasets. Although, in some cases, based on WKmeans and smote,’’ Comput. Eng. Appl., vol. 57, no. 23,
the random forest and C4.5 algorithms also exhibit good pp. 106–112, 2021.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE:
performance. Therefore, the proposed LSSASMOTE+SVM Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16,
model not only optimizes imbalanced datasets at the data pp. 321–357, Jun. 2002.
level but also selects a suitable combination of parameters for [9] D. Meng and Y. Li, ‘‘An imbalanced learning method by combining
SMOTE with center offset factor,’’ Appl. Soft Comput., vol. 120, May 2022,
the SVM classifier at the algorithm level, resulting in more Art. no. 108618.
accurate classification of imbalanced data. [10] B. Krawczyk, C. Bellinger, R. Corizzo, and N. Japkowicz, ‘‘Undersam-
During the experiments, it is found that the pling with support vectors for multi-class imbalanced data classification,’’
LSSASMOTE+SVM model has a longer running time than in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2021, pp. 1–7.
[11] H. Yudan, G. Qing, C. Zhihua, and Y. Lei, ‘‘Classification method for
other classification models. Some of the dataset runtimes imbalance dataset based on genetic algorithm improved synthetic minority
are shown in Table (6). Table (6) displays three datasets over-sampling technique,’’ J. Comput. Appl., vol. 35, no. 1, pp. 121–124,
with different imbalance rates, and it is evident that the 2015.
[12] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, ‘‘Review of classifi-
LSSASMOTE+SVM algorithm requires more computa- cation methods for unbalanced data sets,’’ Comput. Eng. Appl., vol. 57,
tional time. This is because the LSSA algorithm employs no. 22, pp. 42–52, 2021.
multiple sparrow individuals for the search process, along [13] V. Vapnik, ‘‘Statistical learning theory,’’ Ann. Inst. Stat. Math., vol. 55,
no. 2, pp. 371–389, Aug. 2003.
with a significant number of random perturbations and
[14] Z. Sun, G. Wang, M. Gao, L. Gao, and A. Jiang, ‘‘Research on excess
local search strategies, which necessitate more computational location method of sealed electronic equipment based on parameter opti-
resources and time. Furthermore, the LSSA algorithm in this mization support vector machine,’’ Electron. Meas. Instrum., vol. 35, no. 8,
paper optimizes parameter search for both the sampling and pp. 162–174, 2021.
[15] K. Luo and G. Wang, ‘‘Research on imbalanced data classification based
classification algorithms, which also contributes to the higher on L-SMOTE and SVM,’’ Comput. Eng. Appl., vol. 55, no. 17, pp. 55–61,
running time. 2019.
[16] H. Ma and M. Zhu, ‘‘IgwoSMOTE: An over sampling method based on [31] X. Wen, J. Chen, W. Jing, and K. Xu, ‘‘Research on optimization of
improved gray wolf algorithm for SVM imbalanced data classification,’’ classification model for imbalanced data set,’’ Comput. Eng., vol. 44, no. 4,
Comput. Eng. Sci., vol. 44, no. 6, pp. 1133–1140, 2022. pp. 268–273, 2018.
[17] R. Xiao, Z. Feng, and J. Wang, ‘‘Concept discrimination, research progress [32] B. Yi, J. Zhu, and J. Li, ‘‘Imbalanced data classification on micro-credit
and application analysis of swarm intelligence,’’ J. Nanchang. Inst. Tech- company customer credit risk assessment using improved smote support
nol., vol. 41, no. 1, pp. 1–21, 2022. vector machine,’’ Chin. J. Manag. Sci., vol. 24, no. 3, pp. 24–30, 2016.
[18] L. Zhang, Y. Zhang, and G. Song, ‘‘LSSA-based feature extraction [33] D. Barber, Bayesian Reasoning and Machine Learning. Cambridge, U.K.:
and classification of hyperspectral images with small training samples,’’ Cambridge Univ. Press, 2012.
Remote Sens. Lett., vol. 8, no. 7, pp. 625–634, 2017. [34] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,
[19] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine USA: Morgan Kaufmann, 1993.
Learning. Reading, MA, USA: Addison-Wesley, 1989. [35] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
[20] M. Dorigo and L. M. Gambardella, ‘‘Ant colony system: A cooperative 2001.
learning approach to the traveling salesman problem,’’ IEEE Trans. Evol. [36] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on-
Comput., vol. 1, no. 1, pp. 53–66, Aug. 1997. line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55,
[21] S. Das and P. N. Suganthan, ‘‘Differential evolution: A survey of the state- pp. 119–139, Aug. 1995.
of-the-art,’’ IEEE Trans. Evol. Comput., vol. 15, no. 1, pp. 4–31, Feb. 2011.
[22] I. Tomek, ‘‘Two modifications of CNN,’’ IEEE Trans. Syst., Man, Cybern.,
vol. SMC-6, no. 11, pp. 769–772, Nov. 1976.
[23] X. Fan and L. Cui, ‘‘Antitumor drug target prediction method based on ZHI WANG was born in 1998. He is currently
network attribute and its application,’’ Data Anal. Knowl. Discovery, vol. 2, pursuing the degree with the School of Com-
no. 12, pp. 98–108, 2018. puter and Control Engineering, Yantai University,
[24] T.-B. Du, G.-H. Shen, Z.-Q. Huang, Y.-S. Yu, and D.-X. Wu, ‘‘Automatic Yantai, Shandong, China. His main research inter-
traceability link recovery via active learning,’’ Frontiers Inf. Technol. Elec- est includes imbalanced data.
tron. Eng., vol. 21, no. 8, pp. 1217–1225, Aug. 2020.
[25] L. Cai, G. Li, J. Fang, and L. Yu, ‘‘Research on imbalanced data clustering
mining for urban hot spots,’’ Comput. Sci., vol. 46, no. 8, pp. 16–22, 2018.
[26] J. Xue and B. Shen, ‘‘A novel swarm intelligence optimization approach:
Sparrow search algorithm,’’ Syst. Sci. Control Eng., vol. 8, no. 1, pp. 22–34,
Jan. 2020.
[27] W. Ma and X. Zhu, ‘‘Sparrow search algorithm based on Levy flight
disturbance strategy,’’ J. Appl. Sci., vol. 40, no. 1, pp. 116–130, 2022. QICHENG LIU was born in 1970. He received
[28] R. N. Mantegna, ‘‘Fast, accurate algorithm for numerical simulation of the Ph.D. degree. He is currently a Professor with
Levy stable stochastic processes,’’ Phys. Rev. E, Stat. Phys. Plasmas Fluids the School of Computer and Control Engineering,
Relat. Interdiscip. Top., vol. 49, no. 5, pp. 4677–4683, May 1994. Yantai University, Yantai, Shandong, China. His
[29] A. Arshad, S. Riaz, and L. Jiao, ‘‘Semi-supervised deep fuzzy C-mean main research interests include big data, intelligent
clustering for imbalanced multi-class classification,’’ IEEE Access, vol. 7, information processing, and data mining.
pp. 28100–28112, 2019.
[30] C. Wang, Z. Pan, L. Dong, and C. Ma, ‘‘Research on classification
for imbalanced dataset based on improved smote,’’ Comput. Eng. Appl.,
vol. 49, no. 2, pp. 184–187, 2013.