0% found this document useful (0 votes)
6 views

Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE

This document presents a new oversampling method called LSSASMOTE for classifying imbalanced data using support vector machines (SVM). The method optimizes SVM parameters and sampling rates through the levy sparrow search algorithm (LSSA) and incorporates noise processing via Tomek Link to enhance classification accuracy. Experimental results demonstrate that the LSSASMOTE+SVM model effectively improves the classification performance of imbalanced datasets.

Uploaded by

padmajakamaraj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE

This document presents a new oversampling method called LSSASMOTE for classifying imbalanced data using support vector machines (SVM). The method optimizes SVM parameters and sampling rates through the levy sparrow search algorithm (LSSA) and incorporates noise processing via Tomek Link to enhance classification accuracy. Experimental results demonstrate that the LSSASMOTE+SVM model effectively improves the classification performance of imbalanced datasets.

Uploaded by

padmajakamaraj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received 4 March 2023, accepted 20 March 2023, date of publication 27 March 2023, date of current version 4 April 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3262460

Imbalanced Data Classification Method


Based on LSSASMOTE
ZHI WANG AND QICHENG LIU
School of Computer and Control Engineering, Yantai University, Yantai, Shandong 264000, China
Corresponding author: Qicheng Liu ([email protected])
This work was supported by the National Natural Science Foundation of China under Grant 62272405.

ABSTRACT Imbalanced data exist extensively in the real world, and the classification of imbalanced data is a
hot topic in machine learning. In order to classify imbalanced data more effectively, an oversampling method
named LSSASMOTE is proposed in this paper. First, the kernel function parameters and penalty parameters
of the support vector machine (SVM) were optimized using levy sparrow search algorithm (LSSA), and a
fitness function was correspondingly designed. Then, during the optimization process, SMOTE sampling
rate was combined, and LSSA iteration was used to select the best combination of SVM parameters and
SMOTE sampling rate. In addition, the oversampled samples were noise processed by Tomek Link. In this
case, the LSSASMOTE+SVM classification model was constructed to classify the imbalanced data. Eight
of the datasets used in the experiments were obtained on UCI and KEEL, and the other three datasets were
created manually. The experimental results confirm that the model can effectively improve the classification
accuracy of imbalanced data and can be used as a new imbalanced data classification method.

INDEX TERMS Imbalanced data, machine learning, sparrow search algorithm, support vector machine,
oversampling.

I. INTRODUCTION losses to the bank loan business [3]. Therefore, in some


Imbalanced datasets refer to two different sets of instances practical situations, it is very important to classify minority
with significant imbalances and asymmetries. The class with classes accurately.
a larger amount of data in the dataset is called the major- There are various approaches to address the imbalanced
ity class, while the one with a smaller amount of data is data problem, such as resampling, algorithm tuning, inte-
called the minority class [1]. Imbalanced data exist in various grated learning, and deep learning. The difference between
applications, such as medical diagnosis, garbage detection, these methods lies in their technical means and implemen-
credit risk identification, etc. [2]. When classifying data, tation, and some new techniques have further innovated and
most classification algorithms learn models by minimizing improved in these aspects. For instance, the SMOTE algo-
the overall misclassification rate without considering the dif- rithm balances data by synthesizing samples of minority
ferences in sample sizes between categories. In imbalanced classes, the Focal Loss method [4] improves classification
data, where a few categories have a small number of samples, accuracy by reducing the weight of easily classified samples,
the classifier may prefer to classify the samples as majority and the EasyEnsemble algorithm [5] splits data into multiple
classes, leading to biased decision boundaries towards the subsets and trains a classifier on each subset to enhance
majority classes. This bias can have negative consequences overall classification performance.
in practical applications. For example, in identifying bank Moreover, solving class imbalance often gives rise to the
credit risk, the number of customers with bad credit is much class overlap problem [6], where the boundaries between dif-
lower than that of customers with good credit. In addition, ferent classes become blurred, making it challenging to dif-
if wrongly classified as good credit, they can cause financial ferentiate them. The methods to address this problem include
feature selection, feature extraction, integrated learning, and
The associate editor coordinating the review of this manuscript and anomaly detection. These methods offer significant help in
approving it for publication was Alberto Cano . the application of imbalanced data and in the classification
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
32252 VOLUME 11, 2023
Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

process. Addressing class imbalance and class overlap in TABLE 1. Number of iterations of the optimization algorithm.
a rational manner can be beneficial for the classification
process and the application of imbalanced data.

II. RELATED WORK


Existing imbalanced data classification solutions are broadly
classified into deep and shallow models. Among them, the
deep model is a neural network-based model that automat-
ically learns feature representations suitable for imbalanced
data and can take into account the weight differences between process [14]. For example, the penalty parameter C will
different categories during the learning process. The main take a higher value for samples of minority class in the
focus of this paper is to investigate the use of shallow models imbalanced data classification process, which may deviate
for handling imbalanced data, which typically refer to tradi- from the probability distribution of the initial data. Therefore,
tional machine learning models that rely on algorithm tuning different solutions have been proposed to improve the classi-
and data preprocessing to handle imbalanced data [7]. fication performance of SVM. Luo and Wang [15] proposed
From the data level, the data set is mainly processed using the FTL-SMOTE algorithm and introduced the hybrid kernel
the oversampling methods and the undersampling methods, function to improving SVM, which effectively improves the
among which the synthetic minority oversampling tech- classification effect of the imbalanced data. Ma and Zhu [16]
nique [8] (SMOTE) is a classic oversampling method. It ran- proposed the IGWOSMOTE algorithm, which combines the
domly replicates minority samples based on the principle of gray wolf algorithm with the SMOTE method to improve
linear interpolation, increasing thus the number of minority the blindness of the SMOTE sampling rate, and improves
samples and improving the classification accuracy for the SVM through the gray wolf algorithm, which significantly
minority class. However, since the SMOTE technique fails improves the overall classification accuracy. However, the
to consider the distribution of adjacent samples and is blind gray wolf algorithm can easily get stuck in a local optimum,
in the selection of sampling magnification, it is exposed and thus affects the oversampling result. The SMOTE and
to the risk of introducing noise instances and overfitting SVM algorithms have been improved in the above studies,
problems. Various attempts have been made to improve the but still, need to be optimized in terms of algorithm parameter
SMOTE algorithm based on these two problems. For exam- selection and classification accuracy.
ple, Meng and Li [9] proposed a method of combining the In this paper, the problems encountered in the imbalanced
center offset factor and SMOTE, which first removes noise data classification process are combined, and the swarm intel-
using Tomek Link technology, then calculates the center ligence optimization algorithm [17] is used as an inspiration
offset factor to select the sparsely distributed minority class for research. Therefore, an oversampling method based on
samples, and combines these samples with SMOTE to gener- the levy sparrow search algorithm (LSSA) [18] is proposed.
ate better minority classes, thus improving classification per- The sparrow search algorithm is a depth-first search-based
formance on imbalanced datasets. Krawczyk et al. proposed algorithm that performs better than genetic algorithms (GA)
an undersampling method using support vector machine opti- [19], ant colony algorithms (ACO) [20], and differential
mization to improve the computing time and classification evolution algorithms (DE) [21] compared to function opti-
accuracy [10]. Huo et al. [11] proposed the GASMOTE over- mization problems, as shown in Table (1). And LSSA is a
sampling method, which selects the SMOTE sampling rate sparrow search algorithm based on Lévy flight, which has
using the genetic algorithm for oversampling, providing the better search efficiency and results through techniques such
sampling rate with a certain flexibility. However, the classifi- as random wandering and pruning.
cation accuracy of this method still needs to be improved. Based on the excellent performance of LSSA, firstly, this
From the algorithm level, the representative algorithms paper introduces LSSA to optimize the kernel function and
are cost-sensitive random forest, support vector machine, penalty parameters of SVM, and designs an adaptation func-
etc. [12]. Support vector machine (SVM) [13], proposed by tion. Second, the SMOTE sampling rate was involved in
Vapnik, is characterized by a solid theoretical foundation, the optimization process, and the LSSA was used to select
simple implementation, generalization, and excellent classifi- the best combination of SVM parameters and the SMOTE
cation performance. However, when SVM deals with unbal- sampling rate iteratively. Finally, Tomek Link [22] is used to
anced classification problems, the classification accuracy is denoise the oversampled samples and solve the class over-
not optimistic because of misclassification and inseparabil- lap problem to build the LSSASMOTE+SVM classification
ity. The reason for the misclassification is that the sample model.
distribution of different classes is imbalanced and that the
minority class is much smaller than the majority class, which III. RELATED THEORIES
makes the classification hyperplane tilt toward the majority A. TOMEK LINK
class. In addition, the value of the related parameters in the Tomek Link is a method used to solve the class overlap
support vector machine is also significant in the classification problem by effectively identifying and removing noisy points

VOLUME 11, 2023 32253


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

In formula (1), ω represents the normal vector of the hyper-


plane; xi represents the training sample; yi represents the
sample category; b represents the threshold of sample train-
ing; C represents the penalty parameter; εi represents the
relaxation variable.
For nonlinear cases, the kernel function k(xi , xj ) is intro-
duced to map the samples from the low-dimensional space to
the high-dimensional space, so that the samples can be sepa-
rable in the high-dimensional space [23]. As shown in (2).
m
X
f (x) = αi yi k(xi , xj ) + b (2)
i=1
The radial basis kernel function (RBF) is generally adopted
by the kernel function k(xi , xj ). As shown in (3).
FIGURE 1. Before deleting Tomek links.
2
k(xi , xj ) = exp(−g xi − xj ) (3)
According to formulas (1) to (3), the penalty parameter C and
kernel parameter g should be optimized by SVM. Therefore,
the swarm intelligence algorithm can be used to select the
optimal parameters of SVM, to improve the classification
performance of SVM.

C. SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE


The basic idea of the synthetic minority oversampling tech-
nique (SMOTE) is to add a new sample to the dataset from
the interval of several types of samples, so that the number
of positive and negative samples can be balanced [24]. The
basic flow of the algorithm is as follows:
(1) Each sample is set as z and the distance Smin between
FIGURE 2. After deleting Tomek links. all samples is calculated according to the Euclidean distance
to obtain the k nearest neighbor.
(2) The sampling magnification is N . Some samples zn are
between adjacent classes, thus improving the performance of randomly selected from the k neighbors of a small number of
the classifier. The core idea of the Tomek Link algorithm is samples z [25].
to find those sample points between adjacent classes that are (3) A new sample is constructed with samples zn and z,
close to each other and labeled with the same Tomek Link, as shown in (4).
i.e., noisy points that need to be removed.
In Figure (1), samples A and B are samples of different znew = z + rand(0, 1) · |z − zn| (4)
categories. The nearest neighbor of A is B, while the nearest
D. SPARROW SEARCH ALGORITHM
neighbor of B is A, and then A and B are Tomek links.
The entire Tomek link is deleted. As shown in Figure (2), The sparrow search algorithm (SSA) [26] is a swarm intel-
the boundaries between the samples become more apparent, ligence optimization algorithm that mimics sparrow foraging
the noise samples are removed, the classification difficulty is behavior. Each sparrow has a location attribute indicating the
reduced, and the classification accuracy is improved. place where it finds food. At the same time, every sparrow
can be a finder and follower, and all sparrows can detect
B. SUPPORT VECTOR MACHINES and warn. The position of each sparrow in the d-dimensional
space is X = (x1 , x2 , . . . , xD ), with the fitness value of fi =
The Support Vector Machine (SVM), a model for small and
f (x1 , x2 , . . . , xD ).
medium-sized data samples, maps the feature vectors of the
The formula for the position update of the discoverer is
samples to some points in the space and uses SVM to draw a
shown in (5).
line to distinguish the two types of points. These two points 
are then used to divide the plane, as shown in (1). −i
 x t exp( ), R2 < ST
 n
t+1
xi,d = i,d
α · itermax (5)
 min 1 ∥ω∥2 + C R2 > ST
 t
xi,d + Q · D,
X
εi

2 i = 1, 2, · · ·, m (1) t+1
 i=1 In formula (5), xi,d represents the position of the i individual
s.t.yi (ω xi + b) ≥ 1 − εi
T

in the t generation of the population in the d dimension;

32254 VOLUME 11, 2023


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

α represents a random number in (0,1); Q represents a posi- In formula (8), s refers to the flight path L(λ); the value
tively distributed random number; R2 represents the uniform range of the parameter β is 0 < β < 2, generally taking
random number in (0,1); ST represents the alarm threshold β = 1.5; the parameters u and v are typically distributed
and safety value, with a value range of [0.5,1.0]. R2 < ST random numbers, obeying formula (8) shown in the normal
indicates that the current environment is safe and that spar- distribution; the values of standard deviations σu and σv of
rows can find food. R2 > ST indicates that a predator is the normal distribution corresponding to formula (9) are in
approaching, and the sparrow issues an alert. At this time, line with the calculation shown in (10):
all the sparrows fly to a safe place to feed. (
The primary function of the follower is to follow the dis- u N (0, σu2 )
(9)
coverer. The formula for a simplified position update is shown v N (0, σv2 )
in (6).

0(1 + β) sin(πβ/2) 1/β
 
σu =


t+1
xi,d 0[(1 + β)/2]β2(β−1)/2 (10)
σ = 1

xwti,d − xi,d
t v

n
i>

Q · exp(
 ),
2 Firstly, in the individual selection, the inertia weight factor is

i 2
= 1 XD
n adopted in this paper, and the roulette method is used to select
t t t
xbi,d + D (rand {−1, 1} · (|xbi,d − xi,d |)), i ≤

sparrow individuals for the Levy flight variation, as shown


2
d=1 in (11):
(6)
f = 1 − iter/Max_iter (11)
xw represents the worst position of sparrows in the popu-
lation. xb represents the best position of the sparrows in the In formula (11), f is the inertia weight factor, iter ∈
population. i > 2n indicates that the i follower is hungry and {1, 2, . . . , Max_iter}, and Max_iter denotes the number of
flies to other places for food. i ≤ n2 indicates that the follower iterations of sparrow search. If rand > f , use roulette wheel
moves to the sweet spot and stays near the sweet spot, and the selection and take a random number rand to perform levy
variance from the optimal position becomes smaller. Some flight mutation on the selected sparrow individuals.
followers also act as scouts to help discoverers find food. Secondly, the improved sparrow search algorithm is used
When the scouts find danger, they will immediately abandon to select the parameters of SVM, assign different weights to
their existing food and move to a new place. As shown in (7). different categories of samples, and reduce the dimensionality
of samples. At the same time, the value of the SMOTE
x + β|xi,d
 t t
− xbti,d |, fi ̸ = fg sampling rate is also included in the optimization process.
 i,d


t+1 " t # Through the LSSA algorithm, the problem of obtaining the
xi,d = |xi,d − xwti,d | (7)
optimal parameters can be transformed into the problem of
 xi,d + K (f − f ) + ε , fi = fg

 t
i w solving the maximum value of the function. The algorithm is
defined as:
β represents a random number in a normal distribution;
K represents the range between [-1,1]; ε represents a very maximize : y = f (X ), X = (x1 , x2 , . . . , xD ) (12)
small and non-zero number; fg and fw represents the best and
In formula (12), f (X ) is the fitness function, that is, the
worst fitness values, respectively; fi represents the individual
prediction accuracy of samples of the minority class, and X
fitness value of a sparrow; When fi = fg , they need to change
refers to the position of different sparrows in the D dimension
their position quickly and fly to another sparrow to prevent
of the initial population of sparrows.
danger.
The SMOTE sampling rate in this algorithm is defined as:
IV. LSSASMOTE+SVM CLASSIFICATION MODEL Zmin < round(Zi ) < Zmax , i = 1, 2, . . . , M (13)
A. LSSASMOTE ALGORITHM
In formula (13), Zmin and Zmax are the minimum and max-
The sparrow search algorithm has the shortcomings of lack
imum values of the sampling ratio Zi of the minority class
of diversity and local optimization ability in the late iteration.
samples, respectively, which are determined by the number of
Ma and Zhu [27] introduced the mechanism of flight distur-
minority class samples; Zi takes a value within this interval,
bance levy to enhance the optimization performance of SSA,
and its value is rounded off by the round() function; M refers
which solves the optimization problem of high-dimensional
to the dimension of decision space, that is, the number of
space to some extent.
samples from the minority class.
During the calculation of the search path L(λ) of the levy
flight, the calculation formula of the simulated levy flight [28]
B. DESIGN OF THE FITNESS FUNCTION
path was generally used, as shown in (8):
In the improved sparrow search algorithm, the individual
u position of a sparrow is related to its fitness value, and the
s= (8)
|v|1/β complexity of its fitness function also directly affects the

VOLUME 11, 2023 32255


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

efficiency of the algorithm. In this case, the split data set was Step 2: Take the penalty parameter C and the kernel param-
predicted and the accuracy of the prediction result was used eter g of the SVM as the individual position of the sparrow to
as the fitness value. learn the training set and construct the fitness function fitness
The function is constructed as shown in (14): by taking the classification accuracy as the fitness value of
the individual position of the sparrow.
fitness = acc(validation(X )) + acc(train(X )) (14)
Step 3: Calculate and sort the fitness value using the
Combined with formula (12), validation in formula (14) rep- LSSA algorithm, iterate according to the number of iterations
resents the classification label of the validation set; train, MaxIter, select individuals to mutate using the roulette selec-
the classification label of the training set; X , the position of tion method, and choose the best combination of parameters
different sparrows in the D dimension of the initial population C and g.
of sparrows; acc, the accuracy of the prediction result; and Step 4: Create a new sparrow population based on the
fitness, the fitness value, with a higher corresponding position number of minority classes in different datasets.
that indicates a better individual position of the sparrow. Step 5: Take the sampling rate in the SMOTE algorithm
The pseudo-code of the fitness value solving process is as the individual position of the sparrow, combine it with the
shown in Algorithm (1): optimized SVM algorithm, and select the best sampling rate
according to the fitness function.
Algorithm 1 Fitness Solution Algorithm Step 6: Perform oversampling with the selected sampling
Input: population array X , validation set validation, train set ratio and use the Tomek link for noise processing to obtain a
Train data set with balanced sample categories.
Output: fitness Step 7: Balance the data set and combine the selected
1: Classifier ← SVM .SVC(C ← X [0], kernel, gamma ← parameter combinations C and g to establish the
X [1]) LSSASMOTE+SVM classification model.
{The initial fitness value was calculated and the SVM In the LSSASMOTE+SVM classification model, the
classifier was trained with X } SMOTE sampling rate and the parameters of the SVM are
2: Train ← Classifier.Predict(Train) the individual positions of the sparrows. Therefore, the opti-
3: validation ← Classifier.Predict(validation) mal parameters were selected to solve the optimal individual
{Calculate the training set and validation set prediction positions of the sparrows. The pseudo-code of the solution
labels} process is shown in Algorithm (2).
4: Fun(0) ← acc(validation) + acc(Train)
{Calculate the initial fitness value Fun (0) }
V. EXPERIMENTAL
5: pop ← X .shape(0)
A. DATA SET PREPARATION
{Construct zero matrix}
6: for i ← 0 to pop by 1 do To verify the validity of the LSSASMOTE+SVM model,
7: fitness[i] ← Fun(X [i, :]) experimental analyses based on eight imbalanced datasets
{ Calculate the fitness value based on the pop size} from UCI and KEEL are conducted in this paper, and the
8: end for structural characteristics of the datasets are listed. In addition,
9: fitness[i] ← Sort(fitness) this paper uses three manually created datasets which are gen-
{ Sort the fitness values and select the best fitness value} erated by make_classification in sklearn. The performance of
the model is further demonstrated by controlling the degree
10: return fitness of imbalance and the proportion of noisy data. The proportion
of noise in simulated data 1(Sim1) is 10%, the proportion of
noise in simulated data 2(Sim2) is 20%, and the proportion
C. LSSASMOTE+SVM CLASSIFICATION MODEL
of noise in simulated data 3(Sim3) is 30%. See Table (2) for
an example.
Based on the characteristics of the imbalanced data, the
To ensure the consistency of the sample imbalance rate
LSSASMOTE+SVM classification model selected the best
between the validation and training sets, the data set was
combination of parameters of the SMOTE sampling rate,
divided into 70% of the training set, 15% of the test set
the SVM penalty parameter and the kernel parameter. The
and 15% of the validation set. In order to fully validate the
influence of noise among different samples after sampling is
classification effect of the algorithm and reduce randomness,
optimized. This classification model not only obtains more
the following metric results are the average values obtained
ideal balanced data, but also improves the classification accu-
after 5 times of hierarchical cross-validation.
racy of imbalanced data. The steps to build the model are as
follows:
Step 1: Initialize the sparrow population and set the param- B. EVALUATION INDICATORS
eters of the LSSA algorithm, which consist of population size The classification accuracy rate is usually taken as the eval-
pop, dimension dim, maximum iteration number MaxIter, uation indicator by the traditional SVM model. However,
lower boundary lb and upper boundary ub. the accuracy rate is suitable for evaluating the balanced

32256 VOLUME 11, 2023


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

Algorithm 2 Optimal sparrow individual position TABLE 3. Confusion matrix.


Input: population size pop, dimension dim, maximum itera-
tion number MaxIter, population array X , fitness
Output: Optimal individual position GbestPosition
1: ST ← 0.6
{Warning value}
2: PD ← 0.7 According to Table (3), the following evaluation indicators
{Ratio of discoverers} are easy to calculate:
3: SD ← 0.2 The recall rate for most samples is shown in (15).
{Be aware of the dangerous proportion of sparrows} TP
4: GbestPosition ← X rrTP = × 100% (15)
TP + FN
5: for i ← 1 to MaxIter by 1 do
6: X ← PDUpdate(X , ST , dim) The recall rate for minority samples is shown in (16).
7: X ← JDUpdate(X , PD, dim) TN
rrTN = × 100% (16)
8: X ← SDUpdate(X , SD, dim, fitness) FN + TN
9: GbestPosition ← X The accuracy of the minority samples is shown in (17).
{ Update sparrow position by formula (5) to (7)}
10: factor ← 1 − i/MaxIter TN
prTN = × 100% (17)
{ Inertia factor} FP + TN
11: for j ← 1 to pop by 1 do The G_mean value is shown in (18).
12: if random() > factor then √
13: L ← Levy(dim) G_mean = rrTN × rrTP × 100% (18)
{Levy flight mutation on selected individuals} The F_measure value is shown in (19).
14: ds ← L · (X [j, :] − GbestPosition[0, :])
2rrTN × prTN
15: Temp ← X [j, :] + ds F_measure = × 100% (19)
16: fitnew ← Fun[Temp[0, :]] rrTN + prTN
{Calculate the new fitness value} The recall rate of the two types of samples is considered
17: end if by the G_mean value, which becomes larger in the case of a
18: end for larger recall rate of the two types of samples, with a larger
19: end for G_mean value indicating a stronger classification ability of
20: fitness ← Sort(fitness) the model for different types of samples. Hence, G_mean can
{Sort the new fitness values} perfectly reveal the performance of the model. The accuracy
21: X ← SortPosition(X ) of the classification and the recall rate of minority samples
{ Population sorting} are considered comprehensively by the F_measure value,
22: GbestPositon[0, :] ← copy.copy(X [0, :]) which excellently reveals the accuracy of minority samples,
{Update the optimal individual position} with a larger F_measure value indicating a more accurate
23: return GbestPosition classification of the model for minority samples. Therefore,
the larger G_mean and F_measure indicate that the model is
TABLE 2. Data from ten experiments created by UCI and KEEL and
more effective in classifying imbalanced data. In this case,
manually. G_mean value and F_measure value are mainly used in the
experimental part of the paper to evaluate the classification
performance of the model.

C. EXPERIMENTAL RESULTS
When the data sets are the same, the sampling mul-
tiplier of each sampling method is set to three. The
LSSASMOTE+SVM model was hereby compared with the
SMOTE+SVM model [8], the SSMOTE+SVM model [30],
the LD-SMOTE+SVM model [31], the L-SMOTE+SVM
model [32] and the FTL-SMOTE+Mixed Kernel SVM mod-
dataset. In the classification results of imbalanced datasets, els [15]. The SVM parameter C in the above algorithm uses
the accuracy of a few samples may be too low, so the classi- the default regularization parameter value of 1.0, and the
fication accuracy cannot represent the classification results. parameter g uses the default value of auto.
Therefore, F_measure and G_mean were used as evaluation The G_mean and F_measure values of each classification
indicators, and both were calculated based on a confusion model in different datasets are shown in Table (4). According
matrix [29]. See Table (3) for an example. to the experimental results shown in Table (4), the F_measure

VOLUME 11, 2023 32257


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

TABLE 4. Comparison of the LSSASMOTE+SVM model with other models. TABLE 5. Comparison of LSSASMOTE optimized SVM classifier with other
classifiers.

values of the LSSASMOTE+SVM classification model in


the Yeast3, Ecoli2, Blood, Glass0, and Pima datasets are namely Yeast3, Ecoli2, Blood, Seeds, Pima, and Ionosphere,
better than those of other models. Moreover, the G_mean and the average G_mean across all eight datasets is 9.11%
value has also achieved favorable results in six datasets, higher than that of the basic SMOTE+SVM model. The

32258 VOLUME 11, 2023


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

TABLE 6. The running time of each algorithm in a partial data set. VI. CONCLUSION
The LSSASMOTE+SVM classification model was proposed
by combining the improved sparrow search algorithm with
the SMOTE and SVM algorithms. The experiments con-
firm the feasibility of the proposed model. LSSA selects the
best combination of parameters for this model, improves the
blindness of the selection of sampling rate of the SMOTE
algorithm, and obtains a better balanced and stable data
set. The choice of this parameter combination is conducive
to improving the classification accuracy of the SVM clas-
sifier while dealing with imbalanced data. Overall, the
LSSASMOTE+SVM classification model proposed in this
paper has a good classification effect on imbalanced data.
However, more efforts are still required to reduce the run time
and improve the classification accuracy of the LSSASMOTE
algorithm.
In addition, there is still room for improvement in the
multi-classification problem, and further research is needed
on the time complexity and boundary partition.
LSSASMOTE+SVM model removes the effects of noise
during its process, resulting in the model having better classi- REFERENCES
fication results on three artificially created simulated datasets. [1] J. Wang and J. Yan, ‘‘Classification algorithm based on undersampling
However, due to the increasing percentage of noise, the clas- and cost-sensitiveness for unbalanced data,’’ Comput. Appl., vol. 41, no. 1,
pp. 48–52, 2021.
sification performance of the model in this paper shows a
[2] C. Tian and L. Zhou, ‘‘Credit assessment method based on majority weight
decreasing trend. minority oversampling technique and random forest,’’ Comput. Appl.,
To further assess the classification effectiveness of the vol. 39, no. 6, pp. 1707–1712, 2019.
LSSASMOTE+SVM model and its superiority over other [3] B. Yang, L. Shi, G. Chi, and Y. Dong, ‘‘Design and application of credit
rating model based on BPNN-LDAMCE based on unbalanced data,’’
classifiers, the model was compared experimentally with J. Quant. Tech. Econ., vol. 39, no. 3, pp. 152–169, 2022.
unoptimized SVM, Bayes [33], C4.5 [34], Random For- [4] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for dense
est [35], and AdaBoost [36] using eleven datasets presented object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
pp. 2999–3007.
in Table (2). Both the LSSASMOTE+SVM and other classi- [5] P. Iba and W. Langley, ‘‘Experiments with easyensemble,’’ in Proc. Int.
fiers used the dataset after LSSA-SMOTE optimization. The Conf. Artif. Intell. (ICAI), 2000, pp. 9–16.
experimental results are presented in Table (5). [6] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, C. Soares, S. Wilk,
Based on the experimental results, it can be concluded that and J. Santos, ‘‘On the joint-effect of class imbalance and overlap: A crit-
ical review,’’ Artif. Intell. Rev., vol. 55, no. 8, pp. 6207–6275, Dec. 2022.
the LSSA-SVM model outperforms most of the other classi- [7] J. Chen and Z. Zheng, ‘‘Over-sampling method on imbalanced data
fiers for the majority of datasets. Although, in some cases, based on WKmeans and smote,’’ Comput. Eng. Appl., vol. 57, no. 23,
the random forest and C4.5 algorithms also exhibit good pp. 106–112, 2021.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE:
performance. Therefore, the proposed LSSASMOTE+SVM Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16,
model not only optimizes imbalanced datasets at the data pp. 321–357, Jun. 2002.
level but also selects a suitable combination of parameters for [9] D. Meng and Y. Li, ‘‘An imbalanced learning method by combining
SMOTE with center offset factor,’’ Appl. Soft Comput., vol. 120, May 2022,
the SVM classifier at the algorithm level, resulting in more Art. no. 108618.
accurate classification of imbalanced data. [10] B. Krawczyk, C. Bellinger, R. Corizzo, and N. Japkowicz, ‘‘Undersam-
During the experiments, it is found that the pling with support vectors for multi-class imbalanced data classification,’’
LSSASMOTE+SVM model has a longer running time than in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2021, pp. 1–7.
[11] H. Yudan, G. Qing, C. Zhihua, and Y. Lei, ‘‘Classification method for
other classification models. Some of the dataset runtimes imbalance dataset based on genetic algorithm improved synthetic minority
are shown in Table (6). Table (6) displays three datasets over-sampling technique,’’ J. Comput. Appl., vol. 35, no. 1, pp. 121–124,
with different imbalance rates, and it is evident that the 2015.
[12] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, ‘‘Review of classifi-
LSSASMOTE+SVM algorithm requires more computa- cation methods for unbalanced data sets,’’ Comput. Eng. Appl., vol. 57,
tional time. This is because the LSSA algorithm employs no. 22, pp. 42–52, 2021.
multiple sparrow individuals for the search process, along [13] V. Vapnik, ‘‘Statistical learning theory,’’ Ann. Inst. Stat. Math., vol. 55,
no. 2, pp. 371–389, Aug. 2003.
with a significant number of random perturbations and
[14] Z. Sun, G. Wang, M. Gao, L. Gao, and A. Jiang, ‘‘Research on excess
local search strategies, which necessitate more computational location method of sealed electronic equipment based on parameter opti-
resources and time. Furthermore, the LSSA algorithm in this mization support vector machine,’’ Electron. Meas. Instrum., vol. 35, no. 8,
paper optimizes parameter search for both the sampling and pp. 162–174, 2021.
[15] K. Luo and G. Wang, ‘‘Research on imbalanced data classification based
classification algorithms, which also contributes to the higher on L-SMOTE and SVM,’’ Comput. Eng. Appl., vol. 55, no. 17, pp. 55–61,
running time. 2019.

VOLUME 11, 2023 32259


Z. Wang, Q. Liu: Imbalanced Data Classification Method Based on LSSASMOTE

[16] H. Ma and M. Zhu, ‘‘IgwoSMOTE: An over sampling method based on [31] X. Wen, J. Chen, W. Jing, and K. Xu, ‘‘Research on optimization of
improved gray wolf algorithm for SVM imbalanced data classification,’’ classification model for imbalanced data set,’’ Comput. Eng., vol. 44, no. 4,
Comput. Eng. Sci., vol. 44, no. 6, pp. 1133–1140, 2022. pp. 268–273, 2018.
[17] R. Xiao, Z. Feng, and J. Wang, ‘‘Concept discrimination, research progress [32] B. Yi, J. Zhu, and J. Li, ‘‘Imbalanced data classification on micro-credit
and application analysis of swarm intelligence,’’ J. Nanchang. Inst. Tech- company customer credit risk assessment using improved smote support
nol., vol. 41, no. 1, pp. 1–21, 2022. vector machine,’’ Chin. J. Manag. Sci., vol. 24, no. 3, pp. 24–30, 2016.
[18] L. Zhang, Y. Zhang, and G. Song, ‘‘LSSA-based feature extraction [33] D. Barber, Bayesian Reasoning and Machine Learning. Cambridge, U.K.:
and classification of hyperspectral images with small training samples,’’ Cambridge Univ. Press, 2012.
Remote Sens. Lett., vol. 8, no. 7, pp. 625–634, 2017. [34] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,
[19] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine USA: Morgan Kaufmann, 1993.
Learning. Reading, MA, USA: Addison-Wesley, 1989. [35] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
[20] M. Dorigo and L. M. Gambardella, ‘‘Ant colony system: A cooperative 2001.
learning approach to the traveling salesman problem,’’ IEEE Trans. Evol. [36] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on-
Comput., vol. 1, no. 1, pp. 53–66, Aug. 1997. line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55,
[21] S. Das and P. N. Suganthan, ‘‘Differential evolution: A survey of the state- pp. 119–139, Aug. 1995.
of-the-art,’’ IEEE Trans. Evol. Comput., vol. 15, no. 1, pp. 4–31, Feb. 2011.
[22] I. Tomek, ‘‘Two modifications of CNN,’’ IEEE Trans. Syst., Man, Cybern.,
vol. SMC-6, no. 11, pp. 769–772, Nov. 1976.
[23] X. Fan and L. Cui, ‘‘Antitumor drug target prediction method based on ZHI WANG was born in 1998. He is currently
network attribute and its application,’’ Data Anal. Knowl. Discovery, vol. 2, pursuing the degree with the School of Com-
no. 12, pp. 98–108, 2018. puter and Control Engineering, Yantai University,
[24] T.-B. Du, G.-H. Shen, Z.-Q. Huang, Y.-S. Yu, and D.-X. Wu, ‘‘Automatic Yantai, Shandong, China. His main research inter-
traceability link recovery via active learning,’’ Frontiers Inf. Technol. Elec- est includes imbalanced data.
tron. Eng., vol. 21, no. 8, pp. 1217–1225, Aug. 2020.
[25] L. Cai, G. Li, J. Fang, and L. Yu, ‘‘Research on imbalanced data clustering
mining for urban hot spots,’’ Comput. Sci., vol. 46, no. 8, pp. 16–22, 2018.
[26] J. Xue and B. Shen, ‘‘A novel swarm intelligence optimization approach:
Sparrow search algorithm,’’ Syst. Sci. Control Eng., vol. 8, no. 1, pp. 22–34,
Jan. 2020.
[27] W. Ma and X. Zhu, ‘‘Sparrow search algorithm based on Levy flight
disturbance strategy,’’ J. Appl. Sci., vol. 40, no. 1, pp. 116–130, 2022. QICHENG LIU was born in 1970. He received
[28] R. N. Mantegna, ‘‘Fast, accurate algorithm for numerical simulation of the Ph.D. degree. He is currently a Professor with
Levy stable stochastic processes,’’ Phys. Rev. E, Stat. Phys. Plasmas Fluids the School of Computer and Control Engineering,
Relat. Interdiscip. Top., vol. 49, no. 5, pp. 4677–4683, May 1994. Yantai University, Yantai, Shandong, China. His
[29] A. Arshad, S. Riaz, and L. Jiao, ‘‘Semi-supervised deep fuzzy C-mean main research interests include big data, intelligent
clustering for imbalanced multi-class classification,’’ IEEE Access, vol. 7, information processing, and data mining.
pp. 28100–28112, 2019.
[30] C. Wang, Z. Pan, L. Dong, and C. Ma, ‘‘Research on classification
for imbalanced dataset based on improved smote,’’ Comput. Eng. Appl.,
vol. 49, no. 2, pp. 184–187, 2013.

32260 VOLUME 11, 2023

You might also like