Application of Natural Languag
Application of Natural Languag
Article
Application of Natural Language Processing and Machine
Learning Boosted with Swarm Intelligence for Spam
Email Filtering
Nebojsa Bacanin 1 , Miodrag Zivkovic 1 , Catalin Stoean 2,3, * , Milos Antonijevic 1 , Stefana Janicijevic 1 ,
Marko Sarac 1 and Ivana Strumberger 1
1 Faculty of Informatics and Computing, Singidunum University, Danijelova 32, 11010 Belgrade, Serbia
2 Human Language Technologies Center, Faculty of Mathematics and Computer Science,
University of Bucharest, Academiei 14, 010014 Bucharest, Romania
3 Department of Computer Science, Faculty of Sciences, University of Craiova, A.I.Cuza, 13,
200585 Craiova, Romania
* Correspondence: [email protected]; Tel.: +40-726-300-376
Abstract: Spam represents a genuine irritation for email users, since it often disturbs them during
their work or free time. Machine learning approaches are commonly utilized as the engine of spam
detection solutions, as they are efficient and usually exhibit a high degree of classification accuracy.
Nevertheless, it sometimes happens that good messages are labeled as spam and, more often, some
spam emails enter into the inbox as good ones. This manuscript proposes a novel email spam
detection approach by combining machine learning models with an enhanced sine cosine swarm
intelligence algorithm to counter the deficiencies of the existing techniques. The introduced novel
sine cosine was adopted for training logistic regression and for tuning XGBoost models as part of the
Citation: Bacanin, N.; Zivkovic, M.;
hybrid machine learning-metaheuristics framework. The developed framework has been validated
Stoean C.; Antonijevic, M.; Janicijevic,
on two public high-dimensional spam benchmark datasets (CSDMC2010 and TurkishEmail), and the
S.; Sarac, M.; Strumberger I.
extensive experiments conducted have shown that the model successfully deals with high-degree
Application of Natural Language
Processing and Machine Learning
data. The comparative analysis with other cutting-edge spam detection models, also based on
Boosted with Swarm Intelligence for metaheuristics, has shown that the proposed hybrid method obtains superior performance in terms
Spam Email Filtering. Mathematics of accuracy, precision, recall, f1 score, and other relevant classification metrics. Additionally, the
2022, 10, 4173. https://doi.org/ empirically established superiority of the proposed method is validated using rigid statistical tests.
10.3390/math10224173
Keywords: machine learning; spam detection; natural language processing; metaheuristics algorithm;
Academic Editor: Alfonso Mateos
swarm intelligence; artificial intelligence; sine cosine algorithm; optimization; classification
Caballero
exists to properly implement the model and to tune the control parameters appropriately for
the method to deliver the expected accuracy level for each particular classification problem.
The XGBoost hyperparameters’ tuning refers to the choice of the appropriate weights
for the parameters of the approach, and as the model must be tuned for each individual
problem that needs to be solved, it deserves adequate attention. The task belongs to the
NP-hard challenges. Traditionally, deterministic techniques cannot be applied, as they
would require an impractical amount of time to find the solution. Metaheuristic algorithms,
on the other hand, belong to a group of stochastic techniques and have been utilized
to address a variety of optimization tasks in different domains, with significant success.
Metaheuristic approaches can be used to address NP-hard problems that are otherwise
considered impossible to solve by utilizing conventional methods [30].
A prominent subclass of the metaheuristic algorithms is swarm intelligence [31]. Al-
gorithms that fall into this group are inspired by the natural processes of animal social
behavior, mathematically modeled through the actions performed by the individuals in
the population. A variety of efficient optimization algorithms have been developed by
modeling the complex behaviors of groups of wolves, ants, bees, fireflies, whales, and so on,
in the shape of the gray wolf optimizer (GWO) [32], ant colony optimization (ACO) [33], ar-
tificial bee colony(ABC) [34], firefly algorithm (FA), [35] and whale optimization algorithm
(WOA) [36]. Notable exceptions are the group of algorithms influenced by the mathematical
laws and properties of the specific functions, where the most famous representatives are the
arithmetic optimization algorithms (AOA) [37] and the sine cosine algorithm (SCA) [38],
the latter one being used in this paper as well.
The SCA algorithm, proposed by [38] in 2016, gained significant popularity with
researchers, and has been proven to be a powerful and efficient optimizer. It was inspired
by the mathematical properties shown by sine and cosine functions that are used to guide
the search of the algorithm. Promising results on benchmark functions under test condi-
tions made it an attractive option for scientists in various domains; however, extensive
simulations have shown that there is enough room for additional enhancements of the
basic implementation. This research proposes an enhanced version of the SCA algorithm,
which was named diversity-oriented SCA (DOSCA), and implemented with the goal to
address the known deficiencies of the basic SCA.
The goal of this manuscript is to employ the implemented DOSCA within two ML
models, and to apply them for the spam classification problem, similarly as it was presented
in [4]. DOSCA was used to optimize the input weight and hidden neurons bias values of
the XGBoost model, and to train the logistic regression (LR) model. The most important
contributions of this research can be summarized in the following way:
• Develop a novel SCA-based algorithm that specifically targets the known drawbacks
of the basic SCA implementation.
• Utilize the novel SCA algorithm to train the LR model and to tune the hyperparameters
of the XGBoost model.
• Evaluate the proposed models against two benchmark spam detection datasets.
The evaluation methodology implies achieving a comparison of the results of several
methods based on measures such as accuracy, precision, recall, and f1 score. Two datasets
have been used in the experiments, CSDMC2010 and the Turkish Email Dataset. The Turk-
ish language is very challenging for the classification of spam emails, since this language
has more complex semantic structure than English. As always, in data mining, data prepro-
cessing takes an important role, since adequate preparation implies the effective results
of the used algorithms. Both datasets have been tackled with 500 and 1000 of the most
representative features, as is described in Section 4. After preprocessing, the models have
been evaluated on these two datasets in terms of classification accuracy and recall, since the
false negative (FN) rate was especially targetted. The quality of the results of the proposed
models was superior when compared to the outputs obtained using other methods.
The rest of the paper is structured as follows: Section 2 brings forward the background
on the spam filtering systems and metaheuristics optimization. Section 3 describes the
Mathematics 2022, 10, 4173 4 of 31
basic SCA metaheuristics, highlights its deficiencies, and proposes an improved version
of the algorithm. The whole of Section 4 is dedicated to preprocessing, as it is a separate
problem that falls into the natural language processing (NLP) domain.
intrusion detection [62], fake news detection [63], cloud computing workflow planning [64],
sensor networks tuning [65], and many others.
According to the famous no free lunch theorem, the perfect single solution that is
the best for all optimization problems cannot be proven to exist. Therefore, recent re-
search is focused on the enhancement of the existing algorithms through modification and
hybridization, and the implementation of novel, more efficient options.
One of the models for spam detection systems in social networks utilizing artificial
neural network machine learning techniques enhanced with the artificial bee colony opti-
mization algorithm is presented in [69]. In [70], the community-inspired firefly algorithm
for spam detection is proposed for searching for the features that provide good performance
for SVM, KNN, and Random forest classifiers. The effectiveness of the proposed method
is validated by comparing the results with the other existing feature selection methods,
Mathematics 2022, 10, 4173 7 of 31
with the results are showing improved performance in terms of accuracy, false positive rate,
and F1-measure on two benchmark Twitter spam datasets. The authors in [71] compare
five bio-inspired optimization techniques in combination with the k-Nearest Neighbours
machine learning approach for the same dataset from the UCI repository. The presented
results show 100% mean values for accuracy, precision, recall, and F1-measure, by applying
different algorithms when Manhattan distance is used in KNN for classifying the emails as
spam or legitimate, while the approaches provide different results for different distance
metrics. For example, the grasshopper optimization algorithm provides the highest average
accuracy and whale optimization algorithm provides the highest average value of precision
and F1-measure.
1
p( x ) = (2)
1 + e−( β0 + β1 x)
Predictive analytics and classification frequently use logistic regression. Based on a
given dataset of independent variables, logistic regression calculates the likelihood that an
event will occur. Since the result is a probability, the dependent variable’s range is limited
to 0 and 1. In logistic regression, the odds—that is, the probability of success divided by the
probability of failure—are transformed using the logit formula. The Equations (3) and (4)
are used to represent this logistic function, which is sometimes referred to as the log odds
or the natural logarithm of the odds.
p
logitp = σ−1 ( p) = ln f orp ∈ (0, 1) (3)
1− p
Pi = Prob(yi = 1)
1
= => (4)
1 + e(−( β0 + β1 x1 +...β k xk +ε))
Pi
ln( ) = β 0 + β 1 x1 + . . . β k x k + ε
1 − Pi
Mathematics 2022, 10, 4173 8 of 31
2.4.2. XGBoost
The XGBoost model recently became very popular in the scientific community, as it is
generally capable of delivering good overall results. Fundamentally, XGBoost deals with
the resolution of the linear classification task.
There is an objective function given by Equation (5), where l specifies the loss of
the t-th round and const refers to the constants, while Ω denotes the regularization term
obtained by Equation (6). Within the latter, γ and λ specify the control parameters. f t is the
loss function in iteration t, y is the real value, and ŷ is the predicted (expected) value.
n
obj(t) = ∑ l (yi , ŷit−1 + f t (xi )) + Ω( f t ) + const (5)
i =1
T
1
Ω( f t ) = γ · Tt + λ
2 ∑ w2j (6)
j =1
Next, the first derivative is obtained with respect to Equation (8) and the second
derivative is defined by Equation (9).
The combination of Equations (6), (8), and (9) allows the forming of Equation (7);
and after determining the derivative, the loss function is determined by Equation (10),
while the weight values are obtained by Equation (11).
T
1 ( ∑ g )2
obj∗ = −
2 ∑ ∑ hi +i λ + γ · T (10)
j =1
Mathematics 2022, 10, 4173 9 of 31
∑ gi
w∗j = − (11)
∑ hi + λ
The flexibility of the XGBoost model allows it to be completely adapted to every
particular problem; however, it also means that its hyperparameters must be tuned for
every given task individually. There are many parameters that may be considered for
fine-tuning, and we restricted ourselves to the ones enumerated below (further information
about them and their possible values is offered within Section 5):
• Learning rate,
• Minimum sum of instance weight (hessian) needed in a child,
• Subsample ratio of the training instances,
• Subsample ratio of columns when constructing each tree,
• The maximum depth of a tree,
• The minimum loss reduction required to make a further partition on a leaf node of
the tree.
The hyperparameter optimization task refers to the problem of choosing the optimal
values for the particular problem that needs to be solved, and it is regarded as an NP-hard
challenge. If this task is executed manually, through trials and errors, it would require an
impractical amount of time and resources to be solved.
3. Proposed Method
This section first gives details of the SCA metaheuristics. Afterward, the observed
shortcomings of its basic version are elaborated. Finally, details of the proposed method
that overcomes the deficiencies of the basic SCA are provided.
During the execution of the algorithm, the search process is controlled by the four
arbitrary control parameters r1−4 , which impact on the positions of the current and the
best solutions. The balance among the solutions is necessary for efficient convergence to
the global optimal solution, and the ant is granted by change of the functions’ range in an
ad-hoc way.
The characteristic of both the sine and cosine functions is that they exhibit a cyclic
pattern, allowing for repositioning in the proximity of the solution, and therefore, granting
exploitation. The changing of the range of both functions allows for a search outside of the
dedicated destinations. Additionally, each individual is required not to overlap its position
with other individuals in the population.
To improve the quality of randomness, the control parameter r2 is being produced in
the range [0, 2π ], therefore guaranteeing the exploration. The balance of the exploration
and exploitation processes is controlled by the Equation (15), where t denotes the current
round of execution and T denotes the maximum number of rounds in a single run, while a
represents a constant value.
a
r1 = a − t (15)
T
It is also important to mention that some of the recent research papers [75] noted
that large domains of the search space could be mapped if the quasi-reflection-based
learning (QRL) mechanism is employed to the initial solutions produced by Equation (16).
qr
This procedure generates a quasi-reflexive-opposite component (x j ) per each individual’s
parameter j (x j ), according to the Equation (17), where the rnd procedure is used to choose
lb j + ub j
pseudo-random values within the range , xj .
2
lb j + ub j
qr
Xj = rnd , xj (17)
2
The proposed initialization mechanism that utilizes QRL does not increase the com-
plexity of the algorithm with respect to the fitness function evaluations (FFEs), as at first,
only N/2 solutions are initialized, where N represents the count of the solutions in the
population. This mechanism is given in Algorithm 1.
The proposed procedure provides better diversity in the initial population, which
subsequently leads to a boost to the search phase, as can be observed in the quality of the
results, in the experimental Results section.
m
1
∑ xij − x j
p
Dj = (19)
m i =1
n
1
∑ Dj
p
Dp = (20)
n i =1
In every run of the algorithm, during the initialization phase, where the solutions are
produced by utilizing the standard method given by Equation (16), the diversity of the
population is large. Nevertheless, the diversity will gradually decrease in later rounds
of the run, as the algorithm begins to converge in the proximity of the potential solution.
Mathematics 2022, 10, 4173 12 of 31
The L1 measure is utilized to control the population’s diversity, together with the dynamic
threshold parameter Dt .
The suggested diversity-oriented method operates as follows—during the initialization
phase, the initial values of Dt (Dt0 ) are obtained. For every following round, the conditional
statement D P < Dt is evaluated. In case the condition has been evaluated as satisfied,
the population’s diversity is not satisfactory, and nrs of the worst individuals are replaced
with random solutions. Therefore, nrs is an additional control parameter that defines the
quantity of the individuals to be removed and reinitialized with new ones. The equation
that is used to calculate Dt0 is given by Equation (21).
n (ub j − lb j )
Dt0 = ∑ 2·n
(21)
j =1
This presumes that the majority of the individuals would be produced in the proximity
of the mean of the higher and lower parameter’s boundaries, as per Equation (16) and
Algorithm 1. Additionally, for rounds where it can be anticipated that the population is
converging into the direction of the more promising areas, and moving away from the
starting value Dt = Dt0 , the Dt is being reduced according to Equation (22), where iter
and iter + 1 denote the ongoing and following rounds of execution, respectively; further, T
determines the maximal repetition count for the execution. That being a case, with every
following round, the value of Dt will be dynamically decreased towards the last round,
regardless of D P .
iter
Dt,iter+1 = Dt,iter − Dt,iter · (22)
T
f d (t)
t f (t, d) = (24)
maxw∈d f d (w)
Mathematics 2022, 10, 4173 14 of 31
|D|
id f (t, D ) = ln (25)
| {d ∈ D : t ∈ d} |
t f -id f (t, d, D ) = t f (t, d) · id f (t, D ) (26)
Feature selection boosts a classifier’s success by removing non-discriminatory phrases
that are present in almost all classifications. It takes a lot of computation time and resources
to train a classification model utilizing all the terms. Reducing computing complexity
and getting rid of superfluous characteristics are the goals. Each email is represented by
a numerical feature vector. The t f -id f procedure [78] proved to be successful for feature
selection, and it is utilized in the current work as well.
After the t f -id f vectors are computed (Equation (26)), we normalize the vectors with
the Euclidean (L2) norm, which is a basic technique, as in Equation (27) [79]. Long docu-
ments make the terms appear more important than they actually are, since their likelihood
of occurrence increases. By taking the document length into account, the normalization
seeks to avoid this bias in lengthy papers. By applying normalizing approaches to t f -id f
vectors, Amayri and Bouguila [80] found that using the L2-norm produced better re-
sults than using the L1-norm, and that normalization can help defend against assaults on
sparse data.
vi
vnorm i = i (i = 1 . . . N ) (27)
i
v1 + v2 + · · · + v n i
The total weight of a phrase is calculated by summing the corresponding normalized
t f -id f weights of the term across all documents. We sorted descendingly the overall
weights. We then chose the feature set with the highest overall weight. The criterion for
the highest overall weight starts with the weight sorting descendent for all terms, and
then the first half of the items are selected out of the total. Each document was intended
to be represented as a feature vector made up of the normalized t f -id f weights of these
terms. Feature set (S = t1 , . . . , tn ) was formed according to critical terms so that the feature
selection determines the final set.
Whole dataframe
Figure 1. WorldCloud function outputs for English spam (top left), non-spam emails (top right), and
the whole dataframe (bottom).
WorldCloud function outputs for Turkish spam emails, non-spam emails, and whole
dataframe are shown in Figure 2.
Mathematics 2022, 10, 4173 16 of 31
The class distribution for both the English and Turkish datasets is shown in Figure 3.
The English dataset contains around 68% of non-spam and 32% of spam emails, while the
Turkish dataset comprises around 60% non-spam, and 40% of spam emails, respectively.
The procedure of feature selection was establish according to previous data preprocess-
ing steps: cleaning, tokenization, stemming, stop words removal, pruning, bag of words,
and tf-idf vectorization. The tf-idf vectorization leads to weighted values for every word
that is sorted by descendent. The first 500 words from the ordered set are selected, and
a data frame is created for training with 500 features. The data frame for training with
1000 features is established in a similar manner. The weighed values for both datasets are
shown in Figure 4.
Spam emails Non spam emails
Whole dataframe
Figure 2. WorldCloud function outputs for Turkish spam (top left), non-spam emails (top right), and
the whole dataframe (bottom).
&