0% found this document useful (0 votes)
24 views32 pages

Application of Natural Languag

Uploaded by

adarsh4arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views32 pages

Application of Natural Languag

Uploaded by

adarsh4arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

mathematics

Article
Application of Natural Language Processing and Machine
Learning Boosted with Swarm Intelligence for Spam
Email Filtering
Nebojsa Bacanin 1 , Miodrag Zivkovic 1 , Catalin Stoean 2,3, * , Milos Antonijevic 1 , Stefana Janicijevic 1 ,
Marko Sarac 1 and Ivana Strumberger 1

1 Faculty of Informatics and Computing, Singidunum University, Danijelova 32, 11010 Belgrade, Serbia
2 Human Language Technologies Center, Faculty of Mathematics and Computer Science,
University of Bucharest, Academiei 14, 010014 Bucharest, Romania
3 Department of Computer Science, Faculty of Sciences, University of Craiova, A.I.Cuza, 13,
200585 Craiova, Romania
* Correspondence: [email protected]; Tel.: +40-726-300-376

Abstract: Spam represents a genuine irritation for email users, since it often disturbs them during
their work or free time. Machine learning approaches are commonly utilized as the engine of spam
detection solutions, as they are efficient and usually exhibit a high degree of classification accuracy.
Nevertheless, it sometimes happens that good messages are labeled as spam and, more often, some
spam emails enter into the inbox as good ones. This manuscript proposes a novel email spam
detection approach by combining machine learning models with an enhanced sine cosine swarm
intelligence algorithm to counter the deficiencies of the existing techniques. The introduced novel
sine cosine was adopted for training logistic regression and for tuning XGBoost models as part of the
Citation: Bacanin, N.; Zivkovic, M.;
hybrid machine learning-metaheuristics framework. The developed framework has been validated
Stoean C.; Antonijevic, M.; Janicijevic,
on two public high-dimensional spam benchmark datasets (CSDMC2010 and TurkishEmail), and the
S.; Sarac, M.; Strumberger I.
extensive experiments conducted have shown that the model successfully deals with high-degree
Application of Natural Language
Processing and Machine Learning
data. The comparative analysis with other cutting-edge spam detection models, also based on
Boosted with Swarm Intelligence for metaheuristics, has shown that the proposed hybrid method obtains superior performance in terms
Spam Email Filtering. Mathematics of accuracy, precision, recall, f1 score, and other relevant classification metrics. Additionally, the
2022, 10, 4173. https://doi.org/ empirically established superiority of the proposed method is validated using rigid statistical tests.
10.3390/math10224173
Keywords: machine learning; spam detection; natural language processing; metaheuristics algorithm;
Academic Editor: Alfonso Mateos
swarm intelligence; artificial intelligence; sine cosine algorithm; optimization; classification
Caballero

Received: 22 September 2022 MSC: 62H30; 62M45; 62M20; 62K25; 90C26


Accepted: 4 November 2022
Published: 8 November 2022

Publisher’s Note: MDPI stays neutral


with regard to jurisdictional claims in 1. Introduction
published maps and institutional affil- An important amount of unsolicited emails comes from legitimate e-commerce busi-
iations. nesses that encourage users to buy their products. These are not particularly dangerous
for the email recipients, although they are distracting and most people would prefer
that these messages are sent directly to their spam folder. However, the most dangerous
ones represent the phishing messages that target sensitive information from the users,
Copyright: © 2022 by the authors.
such as usernames, passwords, card numbers, or pins, or that ask directly for a sum of
Licensee MDPI, Basel, Switzerland.
money with the promise that a lot more will be paid back later [1]. The scammers also
This article is an open access article
took advantage of the uncertainty and fear of the population around COVID-19, sending
distributed under the terms and
messages regarding financial assistance, offering protective equipment, and often using
conditions of the Creative Commons
the impersonation of various medical bodies, with the users being rather susceptible
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
to the topic [2]. Phishing emails also target the employees of large companies aiming
4.0/).
to create breaches that can be later speculated. According to a recent report from IBM

Mathematics 2022, 10, 4173. https://doi.org/10.3390/math10224173 https://www.mdpi.com/journal/mathematics


Mathematics 2022, 10, 4173 2 of 31

(https://www.ibm.com/security/data-breach, accessed on 17 September 2022), the global


average total cost of a data breach was USD 4.35 million in 2022, an increase from USD
4.24 million in 2021, while stolen and compromised credentials are responsible for 19% of
breaches, and phishing, for 16%.
Spam messages do not only burden email users, but they also use a high amount of
network bandwidth, occupy disk space, and often include malware products in attach-
ments. A general classification of anti-spam approaches partitions these methods into
two expansive categories: static and dynamic [3]. Filtering strategies based on specified
whitelists and blacklists are examples of stationary techniques. However, the malware
software mentioned above sometimes transforms the receiver device into a machine that
further sends spam without the user being aware of this information [4]. Hence, static
methods prove to be insufficient since they are based on a list of IPs or email addresses that
are often used as senders, but spammers utilize the newly infected devices as transmitters.
In order to determine whether an email is spam, dynamic algorithms typically take into
account the content of the message, utilizing text modeling techniques developed with
statistical or machine-learning methodologies.
Different algorithms approach tasks in various ways, and to differing degrees of
success. Artificial intelligence is a viable solution for issues in the constantly evolving field
of network security, due to its capacity to learn and to adapt to a changing environment.
Traditional approaches such as firewalls, blacklists, and others are still used, but their
effectiveness must be constantly monitored and maintained. Researchers have tried to
enhance current models and increase network security as a whole by applying AI to
these issues.
In the last decade, considerable advances in the domain of Artificial Intelligence (AI)
have resulted in the widespread adoption of machine learning (ML) algorithms in a wide
spectrum of different industries. Business [5], finance [6], various healthcare [7] , and other
fields [8] are relying on AI for everyday tasks and operations. Consequently, the amount
and the variety of solutions that AI offers to real-world challenges is rapidly increasing.
The most recent applications of the AI models include network security and intrusion
detection [9–11], phishing [12,13], IoT networks botnets discovery [14–16], and many more.
Spam detection is still a particularly open issue that has been addressed several times
with ML approaches, as can be seen from the recent literature [4,17–19]. This problem
falls within the domain of classical classification tasks. However, with respect to the no
free lunch theorem, a universal solution does not exist, and it is always possible to find a
method that will bring better results. Inspired by the research provided in [4], this research
proposes two approaches, a logistic regression model trained by a metaheuristic algorithm
(the same setup as in [4]), and an XGBoost model tuned using the same metaheuristic.
Logistic regression (LR) [20] is an approach that is very well capable of determining the
connections between the features and the specific outcomes, and consequently, is frequently
used for classification tasks. It is considered to be the baseline ML method in natural
language processing. LR can be used either for binary or multi-class classification problems.
It is a very simple and efficient model that allows for fast classification, which is extremely
useful in real-time applications, and it has already been used to tackle the spam filtering
task [21]. Notwithstanding the obvious advantages of the LR, the shortcoming of this
approach is that it commonly utilizes stochastic gradient descent (SGD) for training, which
can lead to premature convergence towards poor local optima. Similarly as in [4], this paper
proposes avoiding this problem through training with the help of a metaheuristic algorithm.
Concerning classification tasks, ensemble learning models typically outperform single
model techniques. The XGBoost model has been established as a powerful optimizer, and
it is very popular among the research community for solving various difficult challenges.
As a result, the XGBoost model has been utilized in a variety of application domains, which
include healthcare [22–25], finance [26,27], and many others [28,29]. Notwithstanding the
respectable level of performance that the XGBoost model is capable of, a challenge still
Mathematics 2022, 10, 4173 3 of 31

exists to properly implement the model and to tune the control parameters appropriately for
the method to deliver the expected accuracy level for each particular classification problem.
The XGBoost hyperparameters’ tuning refers to the choice of the appropriate weights
for the parameters of the approach, and as the model must be tuned for each individual
problem that needs to be solved, it deserves adequate attention. The task belongs to the
NP-hard challenges. Traditionally, deterministic techniques cannot be applied, as they
would require an impractical amount of time to find the solution. Metaheuristic algorithms,
on the other hand, belong to a group of stochastic techniques and have been utilized
to address a variety of optimization tasks in different domains, with significant success.
Metaheuristic approaches can be used to address NP-hard problems that are otherwise
considered impossible to solve by utilizing conventional methods [30].
A prominent subclass of the metaheuristic algorithms is swarm intelligence [31]. Al-
gorithms that fall into this group are inspired by the natural processes of animal social
behavior, mathematically modeled through the actions performed by the individuals in
the population. A variety of efficient optimization algorithms have been developed by
modeling the complex behaviors of groups of wolves, ants, bees, fireflies, whales, and so on,
in the shape of the gray wolf optimizer (GWO) [32], ant colony optimization (ACO) [33], ar-
tificial bee colony(ABC) [34], firefly algorithm (FA), [35] and whale optimization algorithm
(WOA) [36]. Notable exceptions are the group of algorithms influenced by the mathematical
laws and properties of the specific functions, where the most famous representatives are the
arithmetic optimization algorithms (AOA) [37] and the sine cosine algorithm (SCA) [38],
the latter one being used in this paper as well.
The SCA algorithm, proposed by [38] in 2016, gained significant popularity with
researchers, and has been proven to be a powerful and efficient optimizer. It was inspired
by the mathematical properties shown by sine and cosine functions that are used to guide
the search of the algorithm. Promising results on benchmark functions under test condi-
tions made it an attractive option for scientists in various domains; however, extensive
simulations have shown that there is enough room for additional enhancements of the
basic implementation. This research proposes an enhanced version of the SCA algorithm,
which was named diversity-oriented SCA (DOSCA), and implemented with the goal to
address the known deficiencies of the basic SCA.
The goal of this manuscript is to employ the implemented DOSCA within two ML
models, and to apply them for the spam classification problem, similarly as it was presented
in [4]. DOSCA was used to optimize the input weight and hidden neurons bias values of
the XGBoost model, and to train the logistic regression (LR) model. The most important
contributions of this research can be summarized in the following way:
• Develop a novel SCA-based algorithm that specifically targets the known drawbacks
of the basic SCA implementation.
• Utilize the novel SCA algorithm to train the LR model and to tune the hyperparameters
of the XGBoost model.
• Evaluate the proposed models against two benchmark spam detection datasets.
The evaluation methodology implies achieving a comparison of the results of several
methods based on measures such as accuracy, precision, recall, and f1 score. Two datasets
have been used in the experiments, CSDMC2010 and the Turkish Email Dataset. The Turk-
ish language is very challenging for the classification of spam emails, since this language
has more complex semantic structure than English. As always, in data mining, data prepro-
cessing takes an important role, since adequate preparation implies the effective results
of the used algorithms. Both datasets have been tackled with 500 and 1000 of the most
representative features, as is described in Section 4. After preprocessing, the models have
been evaluated on these two datasets in terms of classification accuracy and recall, since the
false negative (FN) rate was especially targetted. The quality of the results of the proposed
models was superior when compared to the outputs obtained using other methods.
The rest of the paper is structured as follows: Section 2 brings forward the background
on the spam filtering systems and metaheuristics optimization. Section 3 describes the
Mathematics 2022, 10, 4173 4 of 31

basic SCA metaheuristics, highlights its deficiencies, and proposes an improved version
of the algorithm. The whole of Section 4 is dedicated to preprocessing, as it is a separate
problem that falls into the natural language processing (NLP) domain.

2. Background and Literature Review


This section first provides an overview of the most relevant spam detection approaches
from the recent literature. Afterwards, a brief summary of the metaheuristics optimization
algorithms is provided. Finally, this section gives a survey of hybrid ML and swarm
intelligence approaches that have been applied to the spam detection problem.

2.1. Spam Detection


Modern spam filtering systems typically include a ML model that is used for classi-
fication. The three most commonly applied ML models include logistic regression (LR),
extreme learning machine (ELM), and XGBoost.
LR classifiers are based on a logistic function that models dependent dichotomous
variables, and it is based on the supposition that each pair of components in a feature
vector is independent [39]. Because of its simplicity, quick convergence, and accurate
interpretation, LR is frequently employed in spam filtering. The algorithm is linear with a
nonlinear transform on output.
ELM is an ML model that gained popularity within the academic community re-
cently, although it was introduced in 2004 by [40]. ELM operates with single-hidden layer
feed-forward neural networks (SLFNs). This approach has been able to achieve lever-
aged generalization performance in comparison to other feed-forward neural network
approaches, while providing excellent learning speed and high efficiency. ELM classifiers
are used in classification, regression and clustering, so as sparse approximation, compres-
sion and feature learning. Typically, they are created using a single layer or several layers
of hidden nodes, with the hidden nodes’ settings left untuned. Olatunji [41] emphasized
that on a comparative scale based on accuracy, SVM outperformed ELM. However, ELM
fared substantially better than SVM in terms of the speed of operation.
The XGBoost algorithm is utilizing the additive training method for the optimization
of the objective function [42], meaning that every step of the optimization task depends on
the result from the preceding step.
Dedeturk et al. [4] concluded that LR is sensitive to equal feature weights, and that it
obtains optimal weight and bias values that may converge to local minima. The LR classifier
behaves stably after tf-idf feature extraction, and it yields good accuracy when a dataset
is preprocessed with the feature reduction technique. The study published in [43] used
the pre-trained bidirectional encoder representation from a transformer (BERT) and LR
algorithm to categorize ham or spam emails in order to solve the NLP email classification
challenge. The LR algorithm produced the best classification results in training and test
datasets. Another research [44] compared accuracy, recall, and precision for many classifiers,
and only LR achieved equal percentages for all three measures comparing all classifiers.
Goodman and Yih [45] displayed a straightforward LR model that was improved using an
online gradient descent approach. In comparison to the best published generative approach,
they claimed that their model can give outcomes that are competitive.
Lucay [46] suggested a hybrid approach, explaining that the weights between the
input and hidden layers are given random values by the ELM, while the hidden layer’s
biases are fixed during training. The online sequential-ELM performs well in terms of
solution quality and execution speed. ELM is an excellent method for imbalanced datasets,
and email datasets very often have such a form. Roul [47] proposed a method for text
mining search where they used Multilayer ELM, since the procedure overcame limitations
in classic backpropagation neural networks. They saved time for training data, since the
procedure did not use the fine-tuning of hyperparameters converged to a global optimum
without using kernel techniques on feature separation.
Mathematics 2022, 10, 4173 5 of 31

The next advanced spam filtering technique is an XGBoost technique, which is an


ensemble method that uses decision trees with a gradient-boosting framework. It is used
usually for the predictions of unstructured data such as images and text. The main pillars in
the XGBoost procedure are parallel processing, tree-pruning, handling missing values, and
using regularization to avoid overfitting and bias. Ismail et al. [48] reported that an Extreme
Gradient Boosting-based spam detection model has improved, although, as far as they are
aware, it is receiving little attention for spam email detection issues. Anitha et al. [49] re-
ported increased accuracy in spam detection based on the XGBoost classifier. They explored
the proposed system’s performance using a more comprehensive range of experimental
metrics and better accuracy (95%) when compared to other classifiers. Pandey et al. [50]
suggested the XGBoost method, which chooses the most crucial characteristics for effective
phishing website detection using a feature selection strategy. Both trustworthy and fraudu-
lent websites follow a specific pattern. The outcomes of the machine learning classifiers
will serve as the foundation for the phishing detection model. This method was applied to
evaluate phishing websites. In comparison to the AdaBoost and Gradient boosting machine
learning methods, XGBoost performs better.
Generally, despite the fact that these classifiers are simple to use, are generalizable,
and are reasonably effective, they frequently have drawbacks such as the ones indicated by
Dedeturk et al. [4]:
• the dimensionality curse,
• significant expenses for computation,
• misclassification rates,
• responsiveness to feature weights,
• modest operation speeds for practical applications,
• overfitting or getting stuck in local minima.
The effectiveness of these classifiers is also influenced by the nature of the problem
and a few specified criteria. The novel spam filter methodology proposed in this paper
aims to identify spam in emails in light of the shortcomings of the current spam detec-
tion techniques.

2.2. Metaheuristics Optimization


Traditional, deterministic algorithms are not suitable for solving NP-hard tasks, as they
would require an impractical amount of time and resources to find the solution. Meta-
heuristics algorithms, on the other hand, provide satisfying solutions (not guaranteed to
be the best, but good enough) in a reasonable time. One of the most prominent group of
metaheuristics optimization algorithms is swarm intelligence, where the algorithms are
inspired by behavior expressed by different types of animals and processes found in nature.
Some of the most famous nature-inspired methods include the ABC [34] metaheuristics,
which are frequently utilized to optimize the performance of neural networks [51]. Another
notable example is the WOA [36], which models the unique fishing techniques of humpback
whales. WOA is extremely famous due to interesting search patterns, and it was utilized to
address many real-world challenges with great success [52]. Another popular metaheuris-
tics, the GWO [32], was modeled to mimic the hunting techniques exhibited by a pack of
gray wolves, and has also been established as a very powerful optimizer with numerous
applications such as [53]. The original FA metaheuristics [35] is also well-known and is
capable of superior performance due to its powerful search, and it was recently used in a
wide range of domains, including credit card fraud detection [54], medical diagnostics [55],
neural networks optimization [56], plants classification task [57], and many others.
Recent publications show the successful combination of neural networks and swarm
intelligence metaheuristics, and also other successful application domains. The most no-
table contemporary applications of metaheuristics optimization include COVID-19 MRI
classification and illness severity prediction [58], computer-assisted tumor MRI classifica-
tion [59], feature selection task [60], cryptocurrencies trends estimation [61], security and
Mathematics 2022, 10, 4173 6 of 31

intrusion detection [62], fake news detection [63], cloud computing workflow planning [64],
sensor networks tuning [65], and many others.
According to the famous no free lunch theorem, the perfect single solution that is
the best for all optimization problems cannot be proven to exist. Therefore, recent re-
search is focused on the enhancement of the existing algorithms through modification and
hybridization, and the implementation of novel, more efficient options.

2.3. Hybrid Machine Learning and Metaheuristics Approaches to Spam Detection


Table 1 briefly outputs a comparative view of various studies that hybridized ML
approaches with metaheuristics through various manners. All the entries from the table
are subsequently described in the following paragraphs.
Various combinations of machine learning and metaheuristics techniques are consid-
ered in the spam detection process. For instance, in [66], an integrated approach of the
Naive Bayes (NB) algorithm is used for the learning and classification of email content, and
the Particle Swarm Optimization model for the global optimization of the parameters of the
machine learning algorithm is used for email spam detection. The Naive Bayes classifier,
together with the binary firefly algorithm, is proposed in [67]. In this case, metaheuristics
are used for decreasing the dimensionality of features and enhancing the accuracy of the
email spam classification process. The obtained results show that the proposed approach
achieved an accuracy of 95.14% on the SpamBase dataset used. Another example of facing a
feature selection (FS) problem in ML is presented in [68]. In the described experiments, the
k-Nearest Neighbours (KNN) classifier is used to define the target function of the FS chal-
lenge, which is solved by the proposed hybrid model of whale optimization and the flower
pollination algorithm based on the concept of opposition-based learning. The authors in [4]
proposed a new spam detection model based on the logistic regression (LR) classification
algorithm combined with artificial bee colony metaheuristics. Experiments are conducted
on the Enron, CSDMC2010, and TurkishEmail datasets, and are compared with the support
vector machine (SVM), LR, and NB classifiers. The research conclusion is that, in terms of
classification accuracy, the hybrid model outperforms other spam detection techniques.

Table 1. Overview of the selected ML and metaheuristics for spam detection.

ML Approach Metaheuristics Application Ref.


Naive Bayes Particle swarm optimization Parameter tuning [66]
Naive Bayes Binary firefly algorithm Feature selection [67]
k-Nearest Whale optimization and flower
Feature selection [68]
Neighbours pollination algorithm
Determine weight
Logistic regression Artificial bee colony [4]
and bias values of LR
Neural network Artificial bee colony Feature selection [69]
SVM, KNN, Random Feature selection and
Firefly algorithm [70]
forest combination
Grey wolf optimization, firefly
k-Nearest optimization, chicken swarm
Parameter tuning [71]
Neighbours optimization, grasshopper optimization,
and whale optimization

One of the models for spam detection systems in social networks utilizing artificial
neural network machine learning techniques enhanced with the artificial bee colony opti-
mization algorithm is presented in [69]. In [70], the community-inspired firefly algorithm
for spam detection is proposed for searching for the features that provide good performance
for SVM, KNN, and Random forest classifiers. The effectiveness of the proposed method
is validated by comparing the results with the other existing feature selection methods,
Mathematics 2022, 10, 4173 7 of 31

with the results are showing improved performance in terms of accuracy, false positive rate,
and F1-measure on two benchmark Twitter spam datasets. The authors in [71] compare
five bio-inspired optimization techniques in combination with the k-Nearest Neighbours
machine learning approach for the same dataset from the UCI repository. The presented
results show 100% mean values for accuracy, precision, recall, and F1-measure, by applying
different algorithms when Manhattan distance is used in KNN for classifying the emails as
spam or legitimate, while the approaches provide different results for different distance
metrics. For example, the grasshopper optimization algorithm provides the highest average
accuracy and whale optimization algorithm provides the highest average value of precision
and F1-measure.

2.4. Text Mining Models


This section introduces the text mining models used in this research. A brief overview
of the logistic regression is given first, and is followed by a description of the XGBoost model.

2.4.1. Logistic Regression


The approach models the probability of an event. It belongs to a class of ML models
with a statistical base. The event’s logarithm is a linear combination of one or more
independent variables (“predictors”). The logstic model belongs to a field of regression
analysis, since the objective is to estimate the parameters (the coefficients in the linear
combination). The independent variables in binary logistic regression can each be either
binary (two classes, coded by an indicator variable) or continuous variables. The binary
dependent variable in binary logistic regression has a single binary class, coded by an
indicator variable, and the two values are labeled with true/false or 1/0. The associated
probability can range from 0 to 1. Because it is difficult to determine the precise line dividing
the two classes in a linear model, some points from the classes in logistic regression suffer
from random decisions.
The logistic function is of the form (1), where µ is a location parameter (the midpoint
of the curve, where p(µ) = 12 ) and s is a scale parameter. This expression may be rewritten
−µ
as (2), where β 0 = s and is known as the intercept (it is the vertical intercept or the
y-intercept of the line y = β 0 + β 1 x), and β 1 = 1s (inverse scale of rate parameter): these
− β0
are the y-intercept and slope of the log-odds as a function of x. Conversely, µ = β1 and
1
s= β1 .
1
p( x ) = −( x −µ)
(1)
1+e s

1
p( x ) = (2)
1 + e−( β0 + β1 x)
Predictive analytics and classification frequently use logistic regression. Based on a
given dataset of independent variables, logistic regression calculates the likelihood that an
event will occur. Since the result is a probability, the dependent variable’s range is limited
to 0 and 1. In logistic regression, the odds—that is, the probability of success divided by the
probability of failure—are transformed using the logit formula. The Equations (3) and (4)
are used to represent this logistic function, which is sometimes referred to as the log odds
or the natural logarithm of the odds.
p
logitp = σ−1 ( p) = ln f orp ∈ (0, 1) (3)
1− p

Pi = Prob(yi = 1)
1
= => (4)
1 + e(−( β0 + β1 x1 +...β k xk +ε))
Pi
ln( ) = β 0 + β 1 x1 + . . . β k x k + ε
1 − Pi
Mathematics 2022, 10, 4173 8 of 31

Regularization can be used to train a model to better generalize to unseen data,


preventing the algorithm from overfitting the training dataset. A regression model that
uses L1 norm for regularization is called Lasso regression, and a model that employs
L2 norm is known as Ridge regression. If the hyperparameter (L2) is equal to 0, then
overfitting occurs easily, and if it is very large, then it will add too much weight, which
will lead to underfitting. We find the best hyperparameter using cross-validation (CV). CV
is a decision-aiding method by comparing metrics from different samples by reserving a
portion of its data for use in model estimation. We set for the LR a training of 90% of the
data and the sample rest of 10% in the test set. For XGBoost, we used 80% for training and
20% for testing. The reason for this is that the LR is trained by the swarm, and we needed a
bigger portion of the dataset for this task, while the XGBoost only has its parameter values
tuned. These proportions for the separation were established during pre-experimental
setup via trial and error.

2.4.2. XGBoost
The XGBoost model recently became very popular in the scientific community, as it is
generally capable of delivering good overall results. Fundamentally, XGBoost deals with
the resolution of the linear classification task.
There is an objective function given by Equation (5), where l specifies the loss of
the t-th round and const refers to the constants, while Ω denotes the regularization term
obtained by Equation (6). Within the latter, γ and λ specify the control parameters. f t is the
loss function in iteration t, y is the real value, and ŷ is the predicted (expected) value.
n
obj(t) = ∑ l (yi , ŷit−1 + f t (xi )) + Ω( f t ) + const (5)
i =1
T
1
Ω( f t ) = γ · Tt + λ
2 ∑ w2j (6)
j =1

Since this is a minimization optimization problem that needs to be solved, gradient


boosted trees and Taylor approximation methods are applied. Both methods imply a
linearization of function that could be classified as a linear classification. The Taylor
approximation approach is a transformation to a simple function around a single point
that is obtained from the previous step t − 1. After the second-order Taylor approximation,
there implies a loss function in Equation (7), where const refers to the constants, while Ω
denotes the regularization term in Equation (6).
A further enhancement of optimization is achieved during the training process, where
every iteration is dependent on the result achieved in the preceding one.
n
1
obj(t) = ∑ [l (yi , ŷit−1 ) + gi f t (xi ) + 2 hi f t2 (xi )] + Ω( f t ) + const (7)
i =1

Next, the first derivative is obtained with respect to Equation (8) and the second
derivative is defined by Equation (9).

gi = ∂ŷt−1 l (yy , ŷit−1 ) (8)


i

hi = ∂2ŷt−1 l (yy , ŷit−1 ) (9)


i

The combination of Equations (6), (8), and (9) allows the forming of Equation (7);
and after determining the derivative, the loss function is determined by Equation (10),
while the weight values are obtained by Equation (11).

T
1 ( ∑ g )2
obj∗ = −
2 ∑ ∑ hi +i λ + γ · T (10)
j =1
Mathematics 2022, 10, 4173 9 of 31

∑ gi
w∗j = − (11)
∑ hi + λ
The flexibility of the XGBoost model allows it to be completely adapted to every
particular problem; however, it also means that its hyperparameters must be tuned for
every given task individually. There are many parameters that may be considered for
fine-tuning, and we restricted ourselves to the ones enumerated below (further information
about them and their possible values is offered within Section 5):
• Learning rate,
• Minimum sum of instance weight (hessian) needed in a child,
• Subsample ratio of the training instances,
• Subsample ratio of columns when constructing each tree,
• The maximum depth of a tree,
• The minimum loss reduction required to make a further partition on a leaf node of
the tree.
The hyperparameter optimization task refers to the problem of choosing the optimal
values for the particular problem that needs to be solved, and it is regarded as an NP-hard
challenge. If this task is executed manually, through trials and errors, it would require an
impractical amount of time and resources to be solved.

3. Proposed Method
This section first gives details of the SCA metaheuristics. Afterward, the observed
shortcomings of its basic version are elaborated. Finally, details of the proposed method
that overcomes the deficiencies of the basic SCA are provided.

3.1. The Original SCA Method


The SCA algorithm belongs to a novel group of optimization metaheuristics, inspired
by the mathematical properties of trigonometric functions [38]. Sine and cosine functions
are responsible for updating the positions of the solutions within the populace, making
them oscillate in the proximity of the best solution. As both functions are returning values
in the range [−1, 1], they make sure that the solutions fluctuate. At the beginning of the
algorithm, during the initialization phase, a defined number of candidate solutions are
generated in an arbitrary fashion within the limits of the search domain. The exploration
and exploitation processes are driven by random adjustable control variables during the
algorithm’s execution.
The procedure of updating the individuals’ positions (which, in our particular case,
encode the values for the parameters considered for tuning) is executed in every round
by utilizing the Equations (12) and (13), as defined by [38], where Xit and Xit+1 represent
the current individual’s location in the i-th dimension at the t-th and i + 1-th round; r1−3
are pseudo-arbitrary produced control values, while Pi∗ denotes the destination point’s
location (the latest best approximation of the optimal value) within the i-th dimension.

Xit+1 = Xit + r1 · sin(r2 ) · |r3 · Pi∗t − Xit | (12)

Xit+1 = Xit + r1 · cos(r2 ) · |r3 · Pi∗t − Xit | (13)


The fourth control variable r4 is utilized to control the search mechanism by switching
between these two equations, as given by the Equation (14), where r4 is a randomly
produced value in the range [0, 1]. The novel values of pseudo-random control parameters
r1−4 are produced for each part of every individual within the populace.
(
t +1 Xit+1 = Xit + r1 · sin(r2 ) · |r3 · Pi∗t − Xit |, r4 < 0.5
Xi = (14)
Xit+1 = Xit + r1 · cos(r2 ) · |r3 · Pi∗t − Xit |, r4 ≥ 0.5
Mathematics 2022, 10, 4173 10 of 31

During the execution of the algorithm, the search process is controlled by the four
arbitrary control parameters r1−4 , which impact on the positions of the current and the
best solutions. The balance among the solutions is necessary for efficient convergence to
the global optimal solution, and the ant is granted by change of the functions’ range in an
ad-hoc way.
The characteristic of both the sine and cosine functions is that they exhibit a cyclic
pattern, allowing for repositioning in the proximity of the solution, and therefore, granting
exploitation. The changing of the range of both functions allows for a search outside of the
dedicated destinations. Additionally, each individual is required not to overlap its position
with other individuals in the population.
To improve the quality of randomness, the control parameter r2 is being produced in
the range [0, 2π ], therefore guaranteeing the exploration. The balance of the exploration
and exploitation processes is controlled by the Equation (15), where t denotes the current
round of execution and T denotes the maximum number of rounds in a single run, while a
represents a constant value.
a
r1 = a − t (15)
T

3.2. Limitation of Basic SCA and Proposed Improvements


The original version of SCA metaheuristics is known to be relatively simple as it does
not include many control parameters, but it is still capable of achieving an outstanding
level of performance for both the bound-constrained and constrained benchmarks [38]. It
has also been employed for tackling numerous real-world challenges recently [72].
Although the original SCA exhibits excellent exploitation and exploration capabilities,
extensive experiments on both the benchmark functions and practical problems have
empirically shown that in some runs of the basic algorithm, the convergence to the optimal
search region happens in later rounds of execution, not leaving enough iterations for the
algorithm to execute a fine-grained exploitation. The reason for this behavior is found in the
fundamental search equation (Equation (14)), which executes the sine and cosine functions,
and it directs the search in the direction of the most recent approximation of the optimal
(Pi∗ ) for every individuals’ parameter i. Consequently, although the original SCA’s search
performs an exploitation very efficiently, there is still enough room for enhancements.

3.3. Diversity-Oriented Sine Cosine Algorithm (DOSCA)


As discussed earlier, the basic implementation of the SCA algorithm tends to converge
to the sub-optimal solutions in the early rounds of some runs, having a significant impact
on the algorithm’s performance level, as the correct search region is found late and there is
no time for fine-tuned exploitation. Researchers have recently proposed several options to
tackle this drawback, such as [58,59,73,74].
This paper suggests an improvement of the SCA through a definition of the proper
population diversity procedure during the initialization phase. Moreover, the proposed
improved SCA tries to keep the population diversity over the course of the complete
execution of the algorithm by incorporating two additional procedures:
• A new initialization procedure to provide the best possible start of the run.
• A system that keeps population diversity control during the entire run.

3.3.1. A Novel Initialization Procedure


The initialization procedure used to produce the starting population of individuals for
this research is provided in Equation (16), where xi,j denotes the j-th variable of the i-th
individual, and lb j and ub j determine the higher and lower limits for variable j, respectively.
Moreover, ψ denotes a pseudo-random value drawn from the normal distribution in
range [0, 1].
xi,j = lb j + ψ · (ub j − lb j ) (16)
Mathematics 2022, 10, 4173 11 of 31

It is also important to mention that some of the recent research papers [75] noted
that large domains of the search space could be mapped if the quasi-reflection-based
learning (QRL) mechanism is employed to the initial solutions produced by Equation (16).
qr
This procedure generates a quasi-reflexive-opposite component (x j ) per each individual’s
parameter j (x j ), according to the Equation (17), where the rnd procedure is used to choose
lb j + ub j
 
pseudo-random values within the range , xj .
2

lb j + ub j
 
qr
Xj = rnd , xj (17)
2

The proposed initialization mechanism that utilizes QRL does not increase the com-
plexity of the algorithm with respect to the fitness function evaluations (FFEs), as at first,
only N/2 solutions are initialized, where N represents the count of the solutions in the
population. This mechanism is given in Algorithm 1.

Algorithm 1 QRL-based initialization procedure pseudo-code


Step 1: Produce population Pinit of N/2 solutions by applying Equation (16)
Step 2: Produce QRL population Pqr based on Pinit by applying Equation (17)
Step 3: Produce the starting population P by uniting Pinit and Pqr (P ∪ Pqr )
Step 4: Obtain fitness value for every individual from P
Step 5: Sort all individuals within P in terms of fitness value

The proposed procedure provides better diversity in the initial population, which
subsequently leads to a boost to the search phase, as can be observed in the quality of the
results, in the experimental Results section.

3.3.2. Procedure for Keeping the Population Diversity


The procedure for population diversification can be used for monitoring the conver-
gence and/or divergence speed throughout the search phase, as discussed by [76]. This
research utilizes the population diversity measure, called the L1 norm proposed in [76],
which comprises diversities over two properties—the elements (individuals) and the di-
mensionality of the problem. As discussed in [76], the dimension-related measure of the L1
norm brings important information regarding the algorithm’s search process.
Let m denotes the count of individuals in the population, and n specifies the count
of dimensions. The L1 norm expression is formulated as defined in Equations (18)–(20),
whereby x represents the vector with the average location of the solutions in every dimen-
p
sion; D j denotes the solutions’ position diversity vector as the L1 norm, while D p denotes
the diversity value as scalar, for the entire populace.
m
1
x=
m ∑ xij (18)
i =1

m
1
∑ xij − x j
p
Dj = (19)
m i =1
n
1
∑ Dj
p
Dp = (20)
n i =1

In every run of the algorithm, during the initialization phase, where the solutions are
produced by utilizing the standard method given by Equation (16), the diversity of the
population is large. Nevertheless, the diversity will gradually decrease in later rounds
of the run, as the algorithm begins to converge in the proximity of the potential solution.
Mathematics 2022, 10, 4173 12 of 31

The L1 measure is utilized to control the population’s diversity, together with the dynamic
threshold parameter Dt .
The suggested diversity-oriented method operates as follows—during the initialization
phase, the initial values of Dt (Dt0 ) are obtained. For every following round, the conditional
statement D P < Dt is evaluated. In case the condition has been evaluated as satisfied,
the population’s diversity is not satisfactory, and nrs of the worst individuals are replaced
with random solutions. Therefore, nrs is an additional control parameter that defines the
quantity of the individuals to be removed and reinitialized with new ones. The equation
that is used to calculate Dt0 is given by Equation (21).
n (ub j − lb j )
Dt0 = ∑ 2·n
(21)
j =1

This presumes that the majority of the individuals would be produced in the proximity
of the mean of the higher and lower parameter’s boundaries, as per Equation (16) and
Algorithm 1. Additionally, for rounds where it can be anticipated that the population is
converging into the direction of the more promising areas, and moving away from the
starting value Dt = Dt0 , the Dt is being reduced according to Equation (22), where iter
and iter + 1 denote the ongoing and following rounds of execution, respectively; further, T
determines the maximal repetition count for the execution. That being a case, with every
following round, the value of Dt will be dynamically decreased towards the last round,
regardless of D P .
iter
Dt,iter+1 = Dt,iter − Dt,iter · (22)
T

3.3.3. The Inner Workings and Complexity of the Proposed Method


With regard to the described modifications, a novel algorithm derived from SCA has
been introduced, and is called the diversity-oriented SCA (DOSCA). The pseudo-code for
the novel DOSCA method is presented in Algorithm 2.

Algorithm 2 The DOSCA pseudo-code


Initialize a population of N solutions according to Algorithm 1
Initialize SCA control parameters r1 , r1 , r3 , and r4
Determine values of Dt0 and Dt
Evaluate each of the solutions with respect to the objective function value
while t < T do
Update r1 , r1 , r3 , and r4
Update the position of individuals using Equation (14)
Calculate the value of D P
if (D P < Dt ) then
Replace worst nrs individuals with new ones produced by Equation (16)
end if
Assess the population
Find the current best solution
Increment iter
Update Dt by applying Equation (22)
end while
return the best (optimal) solution determined so far

4. Employed Datasets and Data Preprocessing


In this section, the utilized data preprocessing technique is explained first. Afterwards,
the datasets that were used in the research are described in detail.
Mathematics 2022, 10, 4173 13 of 31

4.1. Data Preprocessing


Let us define class c j ∈ C = {c1 , . . . , c|C| } as a text classification corresponding to
a document di ∈ D = {d1 , . . . , d| D| }, where C and D are collections of categories and
documents, respectively. An appropriate format should be used to express a message using
preprocessing techniques before it is delivered to a text modeling method. The ground
base of the text mining is represented by preprocessing procedures that convert words into
understandable vectors. The first sub-procedures in text mining algorithms are stemming,
lemming, tokenization, pruning, stop word removing, and at the end, there is feature
selection and feature extraction. Feature selection is very important to create dimension
reduction for the final data set.
A function ϕ : D × C → { T, F } returns true (T) if a document di is assigned to a
class c j , and returns false (F) otherwise. In spam detection, we are given a set of emails
D = {d1 , d2 , . . . , dm }, and two classes C = {cspam , clegitimate }. The objective function is to
specify every email to one of the two classes accurately and precisely. This paper presents
a creative spam filtering method that contains the advanced metaheuristic method of
a swarm intelligence algorithm, which is an effective optimization method in real-time
applications, while training to prevent its convergence to subpar local minima. It is an
algorithm that mimics the grazing behavior of swarming and was inspired by nature.
In this study, we assessed the efficacy of the algorithms for spam email filtering using
two publicly available datasets. One is the CSDMC2010 spam corpus, a well-known dataset
for the ICONIP 2010 data mining competition. There are a total of 4327 English email
messages, of which 1378 or 31.85% are classified as spam, and 2949 or 68.15% as legal
emails. The labels of the emails are contained in the file SPAMTrain.label, where 0 denotes
a HAM and 1 represents SPAM. The testing dataset contains 4292 messages without known
class labels.
TurkishEmail dataset is the second dataset, and it represents Turkish emails, where
400 or 50% of them are marked as spam; the other half are marked as non-spam emails [77].
Consequently, we define a set of emails D = {d1 , d2 , . . . , dm } and two classes
C = {c-spam, c-non-spam}. The objective function is to relate an email di ∈ D to one
of the two classes. In the algorithm schema, the major steps of the research approach are
listed. The non-informational html tags are removed before the text, and the topic of a
given email is taken from the English and Turkish databases. Tokenization is used; all
letters are changed to lowercase to separate the strings into lists of spline strings. The aim
is to develop substrings that are free of punctuation and stop words, with the exception
of the exclamation points, which are often used in spam emails. Two distinct stemming
libraries are employed for the two sets, since Turkish and English word morphologies and
structures differ significantly from one another: the subject is detailed in Section 4.2.
Machine-learning classifiers are applied to the data after the preprocessing step, point-
ing reduced vector representation, and the classifier maps documents to classes, as shown
in Equation (23). ( 0
0 1, if ~di ∈ c j
φ(~di , c j ) = (23)
0, otherwise
Each document di ∈ D is represented by a vector of terms t1 , t2 , . . . , t| B| . t f (t, d)
represents the number of occurrences of the term t in the document d divided by the total
number of terms in d, and it is computed as in (24), where f d (t) denotes the frequency of
the term t in document d. id f (t, D ) verifies the term against the total number of documents,
as in Equation (25), by applying a logarithm to the fraction between the total number of
documents and the number of documents that include that specific term. Finally, t f -id f
(Equation (26)) considers both the frequency of a term in a document, but its importance
becomes reduced if t appears in many other documents.

f d (t)
t f (t, d) = (24)
maxw∈d f d (w)
Mathematics 2022, 10, 4173 14 of 31

|D|
id f (t, D ) = ln (25)
| {d ∈ D : t ∈ d} |
t f -id f (t, d, D ) = t f (t, d) · id f (t, D ) (26)
Feature selection boosts a classifier’s success by removing non-discriminatory phrases
that are present in almost all classifications. It takes a lot of computation time and resources
to train a classification model utilizing all the terms. Reducing computing complexity
and getting rid of superfluous characteristics are the goals. Each email is represented by
a numerical feature vector. The t f -id f procedure [78] proved to be successful for feature
selection, and it is utilized in the current work as well.
After the t f -id f vectors are computed (Equation (26)), we normalize the vectors with
the Euclidean (L2) norm, which is a basic technique, as in Equation (27) [79]. Long docu-
ments make the terms appear more important than they actually are, since their likelihood
of occurrence increases. By taking the document length into account, the normalization
seeks to avoid this bias in lengthy papers. By applying normalizing approaches to t f -id f
vectors, Amayri and Bouguila [80] found that using the L2-norm produced better re-
sults than using the L1-norm, and that normalization can help defend against assaults on
sparse data.
vi
vnorm i = i (i = 1 . . . N ) (27)
i
v1 + v2 + · · · + v n i
The total weight of a phrase is calculated by summing the corresponding normalized
t f -id f weights of the term across all documents. We sorted descendingly the overall
weights. We then chose the feature set with the highest overall weight. The criterion for
the highest overall weight starts with the weight sorting descendent for all terms, and
then the first half of the items are selected out of the total. Each document was intended
to be represented as a feature vector made up of the normalized t f -id f weights of these
terms. Feature set (S = t1 , . . . , tn ) was formed according to critical terms so that the feature
selection determines the final set.

4.2. Dataset Details and Basic Exploratory Data Analysis


We provide an exploratory analysis for both datasets. According to the counting
of distinct terms, there are 82,148 and 25,650 different terms for the CSDMC2010 and
TurkishEmail datasets, respectively. A further direction of exploratory analysis is the
evaluation of sparsity and imbalancing. The latter is often presented through an imbalance
ratio.The formula for this ratio is calculated by dividing the number of emails that are not
spam by the number of spam messages. TurkishEmail and CSDMC2010 datasets have
imbalance ratios of 1.51 and 2.14, respectively. The TurkishEmail is a more balanced set
than CSDMC2010, but they are both rather unbalanced. For the Turkish dataset, in [4] the
dataset was perfectly balanced, but according to our preprocessing, the ratio between the
two classes was different.
For the evaluation of the sparsity, the feature vector size was 1000, and the degrees
of sparsity of the TurkishEmail and of the CSDMC2010 datasets were 90.02% and 90.48%,
respectively. Consequently, the two datasets were scarce, as shown by the sparsity per-
centages. This is a brief explanation of the ground truth about the CSDMC2010 and
TurkishEmail datasets. An exploratory analysis in Table 2 represents the main frequency
statistics of the trained bundles.
Mathematics 2022, 10, 4173 15 of 31

Table 2. Frequency statistics for the two datasets.

CSDMC2010 Dataset TurkishEmail Dataset


len(tokens) 1,574,504 289,598
len(distinct tokens) 90,392 46,529
len(not stopwords) 47,385 30,808
len(tokens when the string consists of
47,533 30,861
alphabetic characters )
len(stemming) 35,652 20,195
len(dictionary tfidf for vectorization) 35,617 20,138
Shape(dataframe) (4327, 35,617) (826, 20,138)

As a stemmer for the CSDMC2010 dataset, PorterStemmer is used, since this is a


commonly used Python function for these types of data. As a stemmer for the TurkishEmail
dataset, TurkishStemmer is used. It is a part of the Python library and is very often used in
machine learning and natural language processing applications (https://kandi.openweaver.
com/python/otuncelli/turkish-stemmer-python (accessed on 17 July 2022)). A complexity
test that has been applied is the Fraction of Borderline Points. This measure was proposed
by Friedman for calculation if two multivariate samples are from the same distribution.
A percentage of points is given while connecting two opposite classes from the training
samples. This measure counts the boundary points number, and each of these points is a
case related to a different class. The final value is normalized, since the result is from 0 to 1.
A result of near 0 implies that the data are separable, and close to 1 indicates that the data
are not separable. The measure is sensitive to imbalanced data and to the separability of
the classes.
The WorldCloud function outputs for English spam emails, non-spam emails, and
whole dataframe are shown in Figure 1.
Spam mails Non-spam mails

Whole dataframe

Figure 1. WorldCloud function outputs for English spam (top left), non-spam emails (top right), and
the whole dataframe (bottom).

WorldCloud function outputs for Turkish spam emails, non-spam emails, and whole
dataframe are shown in Figure 2.
Mathematics 2022, 10, 4173 16 of 31

The class distribution for both the English and Turkish datasets is shown in Figure 3.
The English dataset contains around 68% of non-spam and 32% of spam emails, while the
Turkish dataset comprises around 60% non-spam, and 40% of spam emails, respectively.
The procedure of feature selection was establish according to previous data preprocess-
ing steps: cleaning, tokenization, stemming, stop words removal, pruning, bag of words,
and tf-idf vectorization. The tf-idf vectorization leads to weighted values for every word
that is sorted by descendent. The first 500 words from the ordered set are selected, and
a data frame is created for training with 500 features. The data frame for training with
1000 features is established in a similar manner. The weighed values for both datasets are
shown in Figure 4.
Spam emails Non spam emails

Whole dataframe

Figure 2. WorldCloud function outputs for Turkish spam (top left), non-spam emails (top right), and
the whole dataframe (bottom).

&6'0&GLVWULEXWLRQRIFODVVHV 7XUNLVKGLVWULEXWLRQRIFODVVHV
   

 
1XPEHURILQVWDQFHV
1XPEHURILQVWDQFHV

 

 



 

 
+DP 6SDP KDP VSDP
&ODVVHV &ODVVHV

Figure 3. Class distribution for English and Turkish datasets.


Mathematics 2022, 10, 4173 17 of 31

Label /DEHO
60 Spam  KDP
Ham VSDP
50 TotalWeight  7RWDO:HLJKW

40 

30 

20 

10 

0 

VD\
\HWNLO
JHOL H
GH L H
JQ
WHNQRORMLV
oDO PD
NR O
|]HOOLN

W N
NXUXP
oDO D

NXU

NR XO
LKWL\Do
SHUVRQHO

EQ\H

VD OD\DELOH

GHSDUWPD
maker

free

upfront
one
kind

tri
money

consantli

teas
limit

said
done
demand
bombard
system

card

slick
way
ye
inform

oDO
Figure 4. Weights for English and Turkish datasets.

5. Experimental Setup and Results


This section describes the experimental setup for both the LR and XGBoost experi-
ments. Later on, the experimental outcomes are given and discussed.

5.1. Basic Experimental Setup


The introduced DOSCA algorithm has been utilized to train the LR model and to
optimize the XGBoost hyperparameters for the case of spam detection. Flat swarm encoding
has been employed, where each individual in the population comprises all the parameters
being optimized. In addition, the LR and XGBoost models are different in nature; they
were tested with the global experimental control parameters.
For the LR experiments, DOSCA is used to perform the training, and each solution
represents the LR coefficients and intercept; therefore, the solution length D is determined
as D = n f + 1, where n f is the number of all features (coefficients) and intercept with a
length of 1. The coefficients’ boundaries are determined empirically for the English dataset
with 500 and 1000 features (a detailed description of the English dataset is given in the
Section 4). In the referred paper [4] the boundaries were set to [−8, 8], while this research
utilizes a range of [−6, 6]. In the case of a Turkish dataset with 500 and 1000 features (a
detailed description of the Turkish dataset is also given in the following section), the referred
paper [4] also uses a range of [−8, 8], while this research utilizes a range of [−6, 6]. The
intercept boundaries were not disclosed in [4], but were established empirically for the
purpose of this research through a trial and error process, and were set to [0, 1] and [−1, 1]
for the English and Turkish datasets, respectively. All observed variables in this scenario
are continuous.
In the case of the LR experiments, specific LR parameters (such as C, regularization,
and so on) were set to the default values from the scikitlearn library, since the focus in this
experiment was on model training. All algorithms were tested with 40 individuals in the
population, and the maximum iterations number was set to T = 500. The paper [4] that
inspired the LR experiments utilized a greater number of iterations and solutions. In this
way, all algorithms were treated equally in terms of fitness function evaluations (FFEs),
N + N · T. However, FA was tested with 20 solutions in the population, since the worst
complexity of FA is N 2 · T, while the average is N/2 · T. Each algorithm was executed in
15 independent runs (R = 15).
For the XGBoost simulations, the solution vector’s length was determined with the
number of hyperparameters being optimized, and in this case, D = 6, since a total of six
parameters are tuned by DOSCA. The XGBoost hyperparameters being tuned and their
respective constraints are:
• Learning rate (η), boundaries: [0.1, 0.9], type: continuous,
• min_child_weight, boundaries: [0, 10], type: continuous,
• Subsample, boundaries: [0.01, 1], type: continuous,
• collsample_bytree, boundaries: [0.01, 1], type: continuous,
• max_depth, boundaries: [3, 10], type: integer,
Mathematics 2022, 10, 4173 18 of 31

• gamma, boundaries: [0, 0.5], type: continuous.


All other XGBoost hyperparameters were set to the default values from the scik-
itlearn library.
In the case of the XGBoost experiment, the algorithms are executed with 10 solutions
in the population, and using 30 iterations over 15 independent runs. An FA algorithm is
tested with N = 5 solutions in the population, in order to provide the same number of FFE.
The fitness calculation of a swarm individual (which is a simple classification error
rate) is based on the training set in both experiments, the one with LR and the one with
XGBoost. After all iterations in a run are completed, the best individual (the one with best
fitness on the training set) is validated against the testing set, and this represents the final
result of a run.
The performance of the suggested DOSCA algorithm with respect to the converging
velocity and general optimization capabilities were evaluated. The experimental outcomes
were put into a comparison, with results being obtained by seven other state-of-the-art
metaheuristics algorithms that were employed in the same experimental setup. These
competitor algorithms were namely: the original implementation of SCA [38], ABC [34],
FA [35], BA [81], HHO [82], SNS [83], and TLB [84]. These competitor metaheuristics were
implemented separately by the authors for comparison, and the control parameters setup
was taken from the original publications.
For an easier tracking of the results, the following acronyms were used—for the LR
experiments, all metaheuristics were assigned the LR prefix (for example, LR-DOSCA, LR-
ABC, etc.), and for the XGBoost experiments, the XGB prefix was assigned (XGB-DOSCA,
XGB-HHO, etc.).
The workflow of the proposed simulation is provided in Figure 5.

LR-DOSCA
XGBoost-DOSCA
Run start

INPUT DATASETS
QRL
(CSDMC2010,Turkish) Input into DOSCA population
initialization

SCA
20% test search
process
Calculate metrics
TEXT
PRE-PROCESSING for all runs and
visualize best
Dp<Dt solution

80% train Yes

Yes Population
diversity No
mechanism

Yes

Evaluate
solutions
on the
train set

No

t<T

No

Evaluate
best
solution in
the run on
the test set
and save
metrics

r<R

Figure 5. Work-flow of conducted simulations.


Mathematics 2022, 10, 4173 19 of 31

5.2. Obtained Results and Comparative Analysis


Table 3 depicts the overall metrics for LR experiments achieved by all the metaheuris-
tics algorithms, on the English 500 and 1000, and on the Turkish 500 and 1000 datasets,
respectively. It can be seen that the proposed LR-DOSCA metaheuristic approach obtained
the best results (best metric) in all four experiments.

Table 3. Overall metrics for LR results in terms of classification error. Best results in each row are
in bold.

Method LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB


English 500
Best 0.030023 0.036952 0.039261 0.036952 0.032333 0.039261 0.034642 0.036952
Worst 0.043880 0.046189 0.043880 0.041570 0.043880 0.043880 0.053118 0.046189
Mean 0.036952 0.041570 0.041570 0.039261 0.036374 0.040993 0.043303 0.040993
Median 0.036952 0.041570 0.041570 0.039261 0.034642 0.040416 0.042725 0.040416
Std 0.005164 0.003652 0.002309 0.001633 0.004726 0.001915 0.007000 0.003416
Var 0.000027 0.000013 0.000005 0.000003 0.000022 0.000004 0.000049 0.000012
English 1000
Best 0.025404 0.041570 0.034642 0.034642 0.039261 0.036952 0.039261 0.034642
Worst 0.050808 0.046189 0.055427 0.050808 0.048499 0.046189 0.046189 0.041570
Mean 0.040416 0.043303 0.043303 0.042148 0.044457 0.042148 0.043303 0.039261
Median 0.042725 0.042725 0.041570 0.041570 0.045035 0.042725 0.043880 0.040416
Std 0.009310 0.001915 0.008226 0.005745 0.004123 0.004123 0.002517 0.002829
Var 0.000087 0.000004 0.000068 0.000033 0.000017 0.000017 0.000006 0.000008
Turkish 500
Best 0 0.024096 0.012048 0.024096 0.012048 0.012048 0.012048 0.024096
Worst 0.036145 0.048193 0.048193 0.036145 0.036145 0.036145 0.048193 0.048193
Mean 0.024096 0.036145 0.030120 0.033133 0.024096 0.027108 0.033133 0.033133
Median 0.030120 0.036145 0.030120 0.036145 0.024096 0.030120 0.036145 0.030120
Std 0.014756 0.008519 0.013470 0.005217 0.012048 0.009990 0.013129 0.009990
Var 0.000218 0.000073 0.000181 0.000027 0.000145 0.000100 0.000172 0.000100
Turkish 1000
Best 0 0.012048 0.012048 0.012048 0.012048 0.024096 0.024096 0.024096
Worst 0.036145 0.024096 0.036145 0.036145 0.036145 0.060241 0.048193 0.024096
Mean 0.018072 0.015060 0.024096 0.024096 0.027108 0.039157 0.039157 0.024096
Median 0.018072 0.012048 0.024096 0.024096 0.030120 0.036145 0.042169 0.024096
Std 0.013470 0.005217 0.008519 0.008519 0.009990 0.015651 0.009990 0.000000
Var 0.000181 0.000027 0.000073 0.000073 0.000100 0.000245 0.000100 0.000000

Detailed metrics for the best run in the case of the English and Turkish datasets are
given in Tables 4 and 5. Here, it is possible to see the superiority of the LR-DOSCA
model, which is most obvious in case of the Turkish dataset, where the algorithm achieved
100% accuracy.

Table 4. Detailed metrics for LR results and English dataset. Best results in each row are in bold.

LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB


English 500
Accuracy (%) 96.9977 96.3048 96.0739 96.3048 96.7667 96.0739 96.5358 96.3048
Precision 0 0.970000 0.957377 0.966443 0.963455 0.966777 0.976027 0.963576 0.963455
Precision 1 0.969925 0.976563 0.948148 0.962121 0.969697 0.929078 0.969466 0.962121
M.Avg. Precision 0.969976 0.963492 0.960612 0.963030 0.967708 0.961064 0.965453 0.963030
Recall 0 0.986441 0.989831 0.976271 0.983051 0.986441 0.966102 0.986441 0.983051
Recall 1 0.934783 0.905797 0.927536 0.920290 0.927536 0.949275 0.920290 0.920290
M.Avg. Recall 0.969977 0.963048 0.960739 0.963048 0.967667 0.960739 0.965358 0.963048
F1 Score 0 0.978151 0.973333 0.971332 0.973154 0.976510 0.971039 0.974874 0.973154
F1 Score 1 0.952030 0.939850 0.937729 0.940741 0.948148 0.939068 0.944238 0.940741
M.Avg. F1 Score 0.969826 0.962662 0.960623 0.962824 0.967471 0.960850 0.965110 0.962824
Mathematics 2022, 10, 4173 20 of 31

Table 4. Cont.

LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB


English 1000
Accuracy (%) 97.4596 95.8430 96.5358 96.5358 96.0739 96.3048 96.0739 96.5358
Precision 0 0.973333 0.960133 0.969799 0.963576 0.957237 0.966555 0.960265 0.963576
Precision 1 0.977444 0.954545 0.955556 0.969466 0.968992 0.955224 0.961832 0.969466
M.Avg. Precision 0.974643 0.958352 0.965259 0.965453 0.960983 0.962944 0.960764 0.965453
Recall 0 0.989831 0.979661 0.979661 0.986441 0.986441 0.979661 0.983051 0.986441
Recall 1 0.942029 0.913043 0.934783 0.920290 0.905797 0.927536 0.913043 0.920290
M.Avg. Recall 0.974596 0.958430 0.965358 0.965358 0.960739 0.963048 0.960739 0.965358
F1 Score 0 0.981513 0.969799 0.974705 0.974874 0.971619 0.973064 0.971524 0.974874
F1 Score 1 0.959410 0.933333 0.945055 0.944238 0.936330 0.941176 0.936803 0.944238
M.Avg. F1 Score 0.974468 0.958177 0.965255 0.965110 0.960372 0.962901 0.960458 0.965110

Convergence graphs for objective function (error rate), box plots, and violin diagrams
for all observed methods in the case of LR training with the English dataset with 500 and
1000 features, respectively, are given in Figure 6, while those for the Turkish dataset (500
and 1000 features) are given in Figure 7.

Table 5. Detailed metrics for LR results and Turkish dataset. Best results in each row are in bold.

LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB


Turkish 500
Accuracy (%) 100 97.5904 98.7952 97.5904 98.7952 98.7952 98.7952 97.5904
Precision 0 1.00000 0.980000 1.00000 1.00000 0.980392 1.0000 0.980392 0.980000
Precision 1 1.00000 0.969697 0.970588 0.942857 1.00000 0.970588 1.00000 0.969697
M.Avg. Precision 1.00000 0.975904 0.988306 0.977281 0.988188 0.988306 0.988188 0.975904
Recall 0 1.00000 0.980000 0.980000 0.960000 1.00000 0.980000 1.00000 0.980000
Recall 1 1.00000 0.969697 1.00000 1.00000 0.969697 1.00000 0.969697 0.969697
M.Avg. Recall 1.00000 0.975904 0.987952 0.975904 0.987952 0.987952 0.987952 0.975904
F1 Score 0 1.00000 0.980000 0.989899 0.979592 0.990099 0.989899 0.990099 0.980000
F1 Score 1 1.00000 0.969697 0.985075 0.970588 0.984615 0.985075 0.984615 0.969697
M.Avg. F1 Score 1.00000 0.975904 0.987981 0.976012 0.987919 0.987981 0.987919 0.975904
Turkish 1000
Accuracy (%) 100 98.7952 98.7952 98.7952 98.7952 97.5904 97.5904 97.5904
Precision 0 1.00000 0.980392 1.00000 1.00000 1.00000 0.961538 0.980000 0.961538
Precision 1 1.00000 1.00000 0.970588 0.970588 0.970588 1.00000 0.969697 1.00000
M.Avg. Precision 1.00000 0.988188 0.988306 0.988306 0.988306 0.976830 0.975904 0.97683
Recall 0 1.00000 1.00000 0.980000 0.980000 0.980000 1.00000 0.980000 1.00000
Recall 1 1.00000 0.969697 1.00000 1.00000 1.00000 0.939394 0.969697 0.939394
M.Avg. Recall 1.00000 0.987952 0.987952 0.987952 0.987952 0.975904 0.975904 0.975904
F1 Score 0 1.00000 0.990099 0.989899 0.989899 0.989899 0.980392 0.980000 0.980392
F1 Score 1 1.00000 0.984615 0.985075 0.985075 0.985075 0.968750 0.969697 0.968750
M.Avg. F1 Score 1.00000 0.987919 0.987981 0.987981 0.987981 0.975763 0.975904 0.975763

Table 6 presents the overall metrics for the XGBoost experiments achieved by all
of the metaheuristics algorithms, on the English 500 and 1000, and the Turkish 500 and
1000 datasets, respectively. It can be seen that the proposed XGBoost-DOSCA metaheuristic
approach obtained the best results (best metric) in all four experiments (three superior best
results, and tied for the best result in the case of the English 1000 dataset).

Table 6. Overall metrics for XGBoost results in terms of classification error. Best results in each row
are in bold.

Method X-DOSCA X-SCA X-ABC X-FA X-BA X-HHO X-SNS X-TLB


English 500
Best 0.013857 0.017321 0.017321 0.015012 0.018476 0.019630 0.020785 0.018476
Worst 0.019630 0.019630 0.023095 0.021940 0.023095 0.020785 0.028868 0.020785
Mean 0.018245 0.019169 0.020554 0.018014 0.019861 0.020092 0.022633 0.020092
Median 0.019630 0.019630 0.020785 0.017321 0.018476 0.019630 0.020785 0.020785
Std 0.002239 0.000924 0.001848 0.002263 0.001848 0.000566 0.003150 0.000924
Var 0.000005 0.000001 0.000003 0.000005 0.000003 0.000000 0.000010 0.000001
English 1000
Best 0.013857 0.013857 0.013857 0.015012 0.017321 0.013857 0.016166 0.012702
Worst 0.016166 0.019630 0.020785 0.021940 0.021940 0.018476 0.020785 0.018476
Mean 0.015012 0.017090 0.018707 0.018245 0.018938 0.016166 0.018245 0.016166
Median 0.015012 0.017321 0.019630 0.018476 0.018476 0.016166 0.018476 0.016166
Std 0.000730 0.001987 0.002572 0.002466 0.001728 0.001633 0.001848 0.001932
Var 0.000001 0.000004 0.000007 0.000006 0.000003 0.000003 0.000003 0.000004
Mathematics 2022, 10, 4173 21 of 31

Table 6. Cont.

Method X-DOSCA X-SCA X-ABC X-FA X-BA X-HHO X-SNS X-TLB


Turkish 500
Best 0.066667 0.078788 0.084848 0.090909 0.084848 0.072727 0.084848 0.090909
Worst 0.090909 0.096970 0.109091 0.103030 0.109091 0.096970 0.103030 0.109091
Mean 0.080000 0.089697 0.093333 0.094545 0.095758 0.084848 0.094545 0.096970
Median 0.084848 0.090909 0.090909 0.090909 0.096970 0.084848 0.096970 0.096970
Std 0.008907 0.005938 0.009071 0.004848 0.008040 0.008571 0.006181 0.006639
Var 0.000079 0.000035 0.000082 0.000024 0.000065 0.000073 0.000038 0.000044
Turkish 1000
Best 0.066667 0.090909 0.084848 0.084848 0.078788 0.072727 0.090909 0.084848
Worst 0.084848 0.103030 0.096970 0.090909 0.096970 0.090909 0.103030 0.103030
Mean 0.076364 0.093333 0.092121 0.088485 0.090909 0.077576 0.096970 0.093333
Median 0.072727 0.090909 0.096970 0.090909 0.090909 0.072727 0.096970 0.090909
Std 0.007273 0.004848 0.005938 0.002969 0.006639 0.007068 0.003833 0.006181
Var 0.000053 0.000024 0.000035 0.000009 0.000044 0.000050 0.000015 0.000038

English 500 features - error convergence graphs English 1000 features - error convergence graphs
0.30 LR-DOSCA 0.30 LR-DOSCA
LR-SCA LR-SCA
LR-ABC LR-ABC
0.25 LR-FA 0.25 LR-FA
LR-BA LR-BA
0.20 LR-HHO 0.20 LR-HHO
LR-SNS LR-SNS
Error

Error
0.15 LR-TLB 0.15 LR-TLB

0.10 0.10

0.05 0.05

0 100 200 300 400 500 0 100 200 300 400 500
Iterations Iterations
English 500 features - error box plot diagram English 1000 features - error box plot diagram
0.055

0.050
0.050

0.045 0.045
Error

Error

0.040
0.040

0.035
0.035
0.030

0.030 0.025
LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB
Algorithm Algorithm
English 500 features - error violin plot diagram English 1000 features - error violin plot diagram
0.07

0.06
0.06

0.05 0.05

0.04
Error

Error

0.04
0.03

0.03 0.02

0.01
0.02
LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB
Algorithm Algorithm

Figure 6. Convergence graphs, box plots, and violin diagrams for all observed methods and LR on
the English dataset (500 and 1000 features).
Mathematics 2022, 10, 4173 22 of 31

Turkish 500 features - error convergence graphs Turkish 1000 features - error convergence graphs
0.40 0.35
LR-DOSCA LR-DOSCA
0.35 LR-SCA LR-SCA
LR-ABC 0.30 LR-ABC
0.30 LR-FA LR-FA
LR-BA 0.25 LR-BA
0.25 LR-HHO LR-HHO
LR-SNS 0.20 LR-SNS
0.20

Error

Error
LR-TLB 0.15 LR-TLB
0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 100 200 300 400 500 0 100 200 300 400 500
Iterations Iterations
Turkish 500 features - error box plot diagram Turkish 1000 features - error box plot diagram
0.05
0.06

0.04 0.05

0.04
0.03
Error

Error
0.03
0.02
0.02

0.01
0.01

0.00 0.00
LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB
Algorithm Algorithm
Turkish 500 features - error violin plot diagram Turkish 1000 features - error violin plot diagram

0.08
0.06

0.06
0.04
0.04
Error

Error

0.02
0.02

0.00
0.00

0.02
0.02

LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB
Algorithm Algorithm

Figure 7. Convergence graphs, box plots, and violin diagrams for all observed methods and LR on
the Turkish dataset (500 and 1000 features).

Detailed metrics for the best run in the case of the English and Turkish datasets are
given in Tables 7 and 8. Here, it is possible to see the superiority of the XGBoost-DOSCA
model, which is most obvious in the case of the Turkish 500 and 1000 dataset, where the
algorithm achieved significant difference in comparison to the other methods. The only
case where the proposed method did not reach the best results is English 1000, where the
difference is in the order of decimals for accuracy. This additionally indicates that the
proposed method is not the most accurate for all test cases, as the no free lunch theorem
also suggests. The statistical tests in the subsequent section indicate, however, that overall,
the proposed technique leads to the most accurate results.
To better visualize the results of the XGBoost method and all observed metaheuristics,
the box plots and violin diagrams for English 500 and 1000 datasets are given in Figure 8.
The convergence graphs, box plots, and violin diagrams for Turkish 500 and 1000 datasets
are given in Figure 9.
Mathematics 2022, 10, 4173 23 of 31

Table 7. Detailed metrics for XGBoost results and English dataset. Best results in each row are in bold.

X-DOSCA X-SCA X-ABC X-FA X-BA X-HHO X-SNS X-TLB


English 500
Accuracy (%) 98.6143 98.2679 98.2679 98.4988 98.1524 98.037 97.9215 98.1524
Precision 0 0.984899 0.981575 0.983193 0.983250 0.979933 0.978297 0.981481 0.981544
Precision 1 0.988889 0.985130 0.981550 0.988848 0.985075 0.985019 0.974265 0.981481
M.Avg. Precision 0.986171 0.982708 0.982669 0.985034 0.981572 0.980439 0.979181 0.981524
Recall 0 0.994915 0.993220 0.991525 0.994915 0.993220 0.993220 0.988136 0.991525
Recall 1 0.967391 0.960145 0.963768 0.963768 0.956522 0.952899 0.960145 0.960145
M.Avg. Recall 0.986143 0.982679 0.982679 0.984988 0.981524 0.980370 0.979215 0.981524
F1 Score 0 0.989882 0.987363 0.987342 0.989048 0.986532 0.985702 0.984797 0.986509
F1 Score 1 0.978022 0.972477 0.972578 0.976147 0.970588 0.968692 0.967153 0.970696
M.Avg. F1 Score 0.986102 0.982619 0.982636 0.984936 0.981451 0.980281 0.979174 0.981469
English 1000
Accuracy (%) 98.6143 98.6143 98.6143 98.4988 98.2679 98.6143 98.3834 98.7298
Precision 0 0.986532 0.984899 0.986532 0.983250 0.986464 0.986532 0.980000 0.986555
Precision 1 0.985294 0.988889 0.985294 0.988848 0.974545 0.985294 0.992481 0.988930
M.Avg. Precision 0.986137 0.986171 0.986137 0.985034 0.982665 0.986137 0.983978 0.987312
Recall 0 0.993220 0.994915 0.993220 0.994915 0.988136 0.993220 0.996610 0.994915
Recall 1 0.971014 0.967391 0.971014 0.963768 0.971014 0.971014 0.956522 0.971014
M.Avg. Recall 0.986143 0.986143 0.986143 0.984988 0.982679 0.986143 0.983834 0.987298
F1 Score 0 0.989865 0.989882 0.989865 0.989048 0.987299 0.989865 0.988235 0.990717
F1 Score 1 0.978102 0.978022 0.978102 0.976147 0.972777 0.978102 0.974170 0.979890
M.Avg. F1 Score 0.986116 0.986102 0.986116 0.984936 0.982671 0.986116 0.983753 0.987267

Table 8. Detailed metrics for XGBoost results and Turkish dataset. Best results in each row are in bold.

X-DOSCA X-SCA X-ABC X-FA X-BA X-HHO X-SNS X-TLB


Turkish 500
Accuracy (%) 93.3333 92.1212 91.5152 53.3333 91.5152 92.7273 91.5152 90.9091
Precision 0 0.923077 0.921569 0.897196 0.610000 0.920792 0.922330 0.920792 0.903846
Precision 1 0.950820 0.920635 0.948276 0.415385 0.906250 0.935484 0.906250 0.918033
M.Avg. Precision 0.934174 0.921195 0.917628 0.532154 0.914975 0.927592 0.914975 0.909521
Recall 0 0.969697 0.949495 0.969697 0.616162 0.939394 0.959596 0.939394 0.949495
Recall 1 0.878788 0.878788 0.833333 0.409091 0.878788 0.878788 0.878788 0.848485
M.Avg. Recall 0.933333 0.921212 0.915152 0.533333 0.915152 0.927273 0.915152 0.909091
F1 Score 0 0.945813 0.935323 0.932039 0.613065 0.930000 0.940594 0.930000 0.926108
F1 Score 1 0.913386 0.899225 0.887097 0.412214 0.892308 0.906250 0.892308 0.881890
M.Avg. F1 Score 0.932842 0.920884 0.914062 0.532725 0.914923 0.926856 0.914923 0.908421
Turkish 1000
Accuracy (%) 93.3333 90.9091 91.5152 91.5152 92.1212 92.7273 90.9091 91.5152
Precision 0 0.915094 0.896226 0.912621 0.889908 0.913462 0.914286 0.896226 0.912621
Precision 1 0.966102 0.932203 0.919355 0.964286 0.934426 0.950000 0.932203 0.919355
M.Avg. Precision 0.935497 0.910617 0.915315 0.919659 0.921847 0.928571 0.910617 0.915315
Recall 0 0.979798 0.959596 0.949495 0.979798 0.959596 0.969697 0.959596 0.949495
Recall 1 0.863636 0.833333 0.863636 0.818182 0.863636 0.863636 0.833333 0.863636
M.Avg. Recall 0.933333 0.909091 0.915152 0.915152 0.921212 0.927273 0.909091 0.915152
F1 Score 0 0.946341 0.926829 0.930693 0.932692 0.935961 0.941176 0.926829 0.930693
F1 Score 1 0.912000 0.880000 0.890625 0.885246 0.897638 0.904762 0.880000 0.890625
M.Avg. F1 Score 0.932605 0.908098 0.914666 0.913714 0.920631 0.926611 0.908098 0.914666

English 500 features - error box plot diagram English 1000 features - error box plot diagram
0.022
0.028

0.026 0.020

0.024
0.018
0.022
Error

Error

0.020
0.016
0.018

0.016 0.014

0.014
XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB
Algorithm Algorithm

Figure 8. Cont.
Mathematics 2022, 10, 4173 24 of 31

English 500 features - error violin plot diagram English 1000 features - error violin plot diagram

0.030 0.0250

0.0225
0.025
0.0200

Error

Error
0.0175
0.020

0.0150

0.015 0.0125

0.0100
0.010
XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB
Algorithm Algorithm

Figure 8. Box plots and violin diagrams for error rate for all observed methods and XGBoost on
English dataset (500 and 1000 features).

Turkish 500 features - error convergence graphs Turkish 1000 features - error convergence graphs
XGB-DOSCA XGB-DOSCA
0.13 XGB-SCA 0.13 XGB-SCA
XGB-ABC XGB-ABC
XGB-FA XGB-FA
0.12 XGB-BA 0.12 XGB-BA
XGB-HHO XGB-HHO
XGB-SNS XGB-SNS
0.11 XGB-TLB 0.11 XGB-TLB
Error

Error
0.10 0.10

0.09 0.09

0.08 0.08

0.07 0.07

0 5 10 15 20 25 30 0 5 10 15 20 25 30
Iterations Iterations
Turkish 500 features - error box plot diagram Turkish 1000 features - error box plot diagram
0.11

0.100

0.10 0.095

0.090
0.09
0.085
Error
Error

0.080
0.08
0.075

0.070
0.07

0.065
XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB
Algorithm Algorithm
Turkish 500 features - error violin plot diagram Turkish 1000 features - error violin plot diagram
0.12 0.08

0.11
0.06
0.10
0.04
0.09
Error

Error

0.08 0.02

0.07
0.00

0.06
0.02
0.05
XGB-DOSCA XGB-SCA XGB-ABC XGB-FA XGB-BA XGB-HHO XGB-SNS XGB-TLB LR-DOSCA LR-SCA LR-ABC LR-FA LR-BA LR-HHO LR-SNS LR-TLB
Algorithm Algorithm

Figure 9. Convergence graphs, box plots, and violin diagrams for error rate for all observed methods
and XGBoost on Turkish dataset (500 and 1000 features).

Finally, to better visualize the performance of both the LR-DOSCA and XGBoost-
DOSCA methods, confusion matrices, precision-recall (PR) curves, and receiver operating
characteristic one vs. one (ROC OvO) are shown in Figure 10.
Mathematics 2022, 10, 4173 25 of 31

LR-DOSCA English 500 features - confusion matrix

LR-DOSCA English 500 features - PR curve


LR-DOSCA English 500 features - ROC OvO
1.0
ham vs spam spam vs ham
0.8
Class: ham Class: spam
ham 0.97 0.03 Class: spam Class: ham
0.8 100 100

Count

Count
0.6 50 50
0.6

True label
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

precision
P(x = ham) P(x = spam)
0.4 0.4 ROC Curve OvO ROC Curve OvO
1.0 1.0
spam 0.03 0.97
0.2 0.5 0.5

TPR

TPR
0.2
ham AP:0.992
spam AP:0.967 0.0 0.0
0.0 micro AP: 0.985
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 FPR FPR
recall

am
ha

sp
Predicted label
LR-DOSCA Turkish 1000 features - confusion matrix
1.0
LR-DOSCA Turkish 1000 features - PR curve
LR-DOSCA Turkish 1000 features - ROC OvO
1.0 ham AP:1.000
spam AP:1.000
micro AP: 1.000 ham vs spam spam vs ham
0.8 Class: ham Class: spam
ham 1.00 0.00 20 Class: spam 20 Class: ham
0.8

Count

Count
10 10
0.6
0.6
True label

0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

precision
P(x = ham) P(x = spam)
0.4 0.4 ROC Curve OvO ROC Curve OvO
1.0 1.0
spam 0.00 1.00
0.2 0.5 0.5

TPR

TPR
0.2

0.0 0.0 0.0


0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.0 0.96 0.98 1.00 1.02 1.04 FPR FPR
recall
m

am
ha

sp

Predicted label
XGB-DOSCA English 1000 features - confusion matrix
XGB-DOSCA English 1000 features - PR curve
XGB-DOSCA English 1000 features - ROC OvO
1.0
ham vs spam spam vs ham
0.8
Class: ham Class: spam
ham 0.99 0.01 Class: spam Class: ham
0.8 400 400

Count

Count
200 200
0.6
0.6
True label

0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
precision

P(x = ham) P(x = spam)


0.4 0.4 ROC Curve OvO ROC Curve OvO
1.0 1.0
spam 0.01 0.99
0.2 0.5 0.5

TPR

TPR
0.2
ham AP:0.999
spam AP:0.997 0.0 0.0
0.0 micro AP: 0.999
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 FPR FPR
recall
m

am
ha

sp

Predicted label
XGB-DOSCA Turkish 500 features - confusion matrix
XGB-DOSCA Turkish 500 features - PR curve
XGB-DOSCA Turkish 500 features - ROC OvO
1.0
0.8 ham vs spam spam vs ham
Class: ham Class: spam
ham 0.92 0.08 Class: spam Class: ham
0.8 40 40
Count

Count
0.6 20 20
0.6
True label

0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
precision

P(x = ham) P(x = spam)


0.4 0.4 ROC Curve OvO ROC Curve OvO
1.0 1.0
spam 0.05 0.95
0.2 0.5 0.5
TPR

TPR

0.2
ham AP:0.982
spam AP:0.952 0.0 0.0
0.0 micro AP: 0.966
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.6 0.7 0.8 0.9 1.0 FPR FPR
recall
m

am
ha

sp

Predicted label

Figure 10. The LR-DOSCA and XGBoost-DOSCA visualization for the obtained confusion matrices,
PR curves, and ROC OvO curves for some datasets.

5.3. The DOSCA Improvements Validation via Statistical Tests


In order to confirm the improved performance of the DOSCA algorithm compared
to the opponents, further statistical analyses are necessary that will show whether or
not the obtained improvements are statistically significant. According to the relevant
literature [85–87], statistical tests in this case can be executed by taking the mean values of
the measured objectives over multiple independent runs to construct a results sample for
each approach. The possible disadvantage of this method occurs if the measured variable
has outliers that do not follow a normal distribution. This can lead to false or deceptive
conclusions. The usage of an average objective function value for the purpose of statistical
tests are still an open issue [87].
Therefore, to check whether or not the usage of the mean values is safe, we used the
objective function (classification error rate) for each run, and the data sample for each
method–problem instance pair was constructed. Further, the Shapiro–Wilk test for single
problem analysis [88] was conducted, and for each, the method–problem pairs and all
generated p values were larger than the threshold α = 0.5, implying that the data samples
came from normal distribution. Consequently, it was concluded that the mean values can
be used for further analysis.
Mathematics 2022, 10, 4173 26 of 31

Later, according to [89], we checked the safe usage of the parametric tests conditions,
which include independence, normality, and homoscedasticity of the variances of the data.
With a unique pseudo-random number seed as a staring point, each run was executed
independently, which means that the condition of independence was satisfied. The Shapiro–
Wilk test [88] for multiple problem analysis was then used again to check for the fulfillment
of a normality condition, and these results are shown in Table 9.

Table 9. Shapiro–Wilk test results for multiple methods multiple problem analysis.

DOSCA SCA ABC FA BA HHO SNS TLB


p-value 0.015682 0.016325 0.012733 0.025307 0.030842 0.013288 0.029549 0.035672

Finally, to check homoscedasticity based on means, Levene’s test [90] is employed,


and the p-value of 0.55 is obtained, which yields the conclusion that the homoscedasticity
is satisfied. However, as can be seen from Table 9, all of the obtained p-values from the
Shapiro–Wilk test are smaller than α = 0.05, which means that the safe use of parametric
tests is not satisfied, and so we have proceeded with non-parametric tests. In the following
non-parametric tests, the DOSCA method proposed in this research is established as the
control method.
In order to verify the significance of the proposed DOSCA performance over other
algorithms, the Friedman test [91,92] and a two-way variance analysis by ranks were
employed. The results of this test are presented in Table 10. Furthermore, the Friedman
aligned test was also conducted, and these findings are shown in Table 11.

Table 10. Friedman statistical test results.

Functions DOSCA SCA ABC FA BA HHO SNS TLB


LR English 500 2 6.5 6.5 3 1 4.5 8 4.5
LR English 1000 1 5.5 5.5 2.5 8 2.5 5.5 5.5
LR Turkish 500 1.5 8 4 6 1.5 3 6 6
LR Turkish 1000 2 1 4 4 6 7.5 7.5 4
XGBoost English 500 2 3 7 1 4 5.5 8 5.5
XGBoost English 1000 1 4 7 5.5 8 2.5 5.5 2.5
XGBoost Turkish 500 1 3 4 5.5 7 2 5.5 8
XGBoost Turkish 1000 1 6.5 5 3 4 2 8 6.5
Average Ranking 1.44 4.69 5.38 3.81 4.94 3.69 6.75 5.31
Rank 1 4 7 3 5 2 8 6

Table 11. Friedman aligned statistical test results.

Functions DOSCA SCA ABC FA BA HHO SNS TLB


LR English 500 10 43.5 43.5 22 9 38.5 53 38.5
LR English 1000 12 33.5 33.5 24.5 46 24.5 33.5 33.5
LR Turkish 500 7.5 61 28 51 7.5 11 51 51
LR Turkish 1000 5 2 15 15 37 63.5 63.5 15
XGBoost English 500 18 23 36 17 29 30.5 49 30.5
XGBoost English 1000 13 26 42 40.5 45 20.5 40.5 20.5
XGBoost Turkish 500 3 19 47 54.5 57 6 54.5 60
XGBoost Turkish 1000 1 58.5 56 27 48 4 62 58.5
Average Ranking 8.69 33.31 37.63 31.44 34.81 24.81 50.88 38.44
Rank 1 4 6 3 5 2 8 7

From the findings shown in Table 10, the presented DOSCA method statistically
outperformed other algorithms to which it was compared, by achieving an average rank
value of 1.44. We can see that second best result belongs to the HHO method, with an
obtained average rank of 3.69. The original SCA method accomplished an average ranking
Mathematics 2022, 10, 4173 27 of 31

of 5.33, which provides proof of the superiority of the proposed DOSCA over the original
method. Moreover, the Friedman statistics (χ2r = 21.97) are greater than the χ2 critical value,
with seven degrees of freedom (14.067), at a significance level α = 0.05, and the Friedman
p-value is 1.89 × 10−13 , inferring that significant differences in results between the different
methods exist. Consequently, it is possible to reject the null hypothesis (H0 ) and state
that the proposed DOSCA-obtained performance was significantly different from other
competitors. Similar conclusions can be derived from the Friedman-aligned test results.
As the research [93] indicates that the Iman and Davenport’s test [94] could give
results with more precision than the χ2 , this test was performed as well. The result of
the mentioned test is 4.85, which is significantly larger than the critical value of the F-
distribution (2.20). Additionally, the Iman and Devenport p-value is 4.43 × 10−2 , which is
smaller than the level of significance. From all this, it can be concluded that this test also
rejects H0 .
It can be seen that both tests rejected the null hypothesis, and so, the non-parametric
post hoc Holm’s step-down procedure was applied. The outcomes of the test are presented
in Table 12. In the Table, the compared methods are sorted according to their p-values
and evaluated to α/(k − i ), where k and i describe the degree of freedom (k = 7 in this
research) and the algorithm number, respectively, after sorting with respect to the p value
in ascending order (corresponding to rank). In this experiment, the α values of 0.05 and
0.1 are used. The results of the test shown in Table 12 clearly indicate that the suggested
DOSCA significantly outperformed all compared methods at both significance levels.

Table 12. Holm’s step-down procedure statistical test results.

Comparison p_Values Ranking Alpha = 0.05 Alpha = 0.1 H1 H2


DOSCA vs. SNS 7.20 × 10−6 0 0.007143 0.014286 1 1
DOSCA vs. ABC 6.52 × 10−4 1 0.008333 0.016667 1 1
DOSCA vs. TLB 7.78 × 10−4 2 0.010000 0.020000 1 1
DOSCA vs. BA 2.13 × 10−3 3 0.012500 0.025000 1 1
DOSCA vs. SCA 3.98 × 10−3 4 0.016667 0.033333 1 1
DOSCA vs. FA 2.62 × 10−2 5 0.025000 0.050000 1 1
DOSCA vs. HHO 3.31 × 10−2 6 0.050000 0.100000 1 1

6. Conclusions
This manuscript presented an innovative version of the SCA metaheuristics, imple-
mented in such way as to tackle the drawbacks of the original SCA variant. The novel
algorithm was given the name diversity-oriented SCA (DOSCA), and it was implemented
as part of the framework that is used for machine learning. The suggested DOSCA was
employed for LR training, and for XGBoost hyperparameters optimization.
The goal of the presented research is to try to further enhance spam email filtering
techniques based on intelligent algorithms. Therefore, both models were evaluated on two
benchmark spam email datasets, CSDMC2010 for the English language and one dataset for
the Turkish language.
The experimental outcomes of the DOSCA algorithm were put into comparison with
seven other metaheuristics implemented in the same experimental framework. The ob-
tained simulation results, backed up by the executed statistical tests, clearly suggest that
LR-DOSCA and XGB-DOSCA obtained a superior accuracy level when compared to other
methods that were included in the comparative analysis.
The future experiments in this domain will aim to further test the suggested models
on more real-world datasets, with a goal to build confidence in the models before planning
to implement them in real-world systems that deal with spam detection and the overall
security of Internet, as well as of other networks that use email services.
Mathematics 2022, 10, 4173 28 of 31

Author Contributions: Conceptualization, M.Z., N.B. and C.S.; methodology, N.B., C.S. and S.J.;
software, N.B. and M.Z.; validation, M.A., I.S. and M.S.; formal analysis, M.Z.; investigation, C.S.,
N.B. and S.J.; resources, N.B., M.S., I.S. and C.S.; data curation, M.Z., M.A. and N.B.; writing—original
draft preparation, I.S., M.A. and S.J.; writing—review and editing, C.S., M.Z. and N.B.; visualization,
N.B., M.A. and M.Z.; supervision, N.B.; project administration, M.Z. and N.B.; funding acquisition,
N.B. and C.S. All authors have read and agreed to the published version of the manuscript.
Funding: Catalin Stoean was supported by a grant of the Romanian Ministry of Education and
Research, CCCDI—UEFISCDI, project number 411PED/2020, code PN-III-P2-2.1-PED-2019-2271,
within PNCDI III.
Data Availability Statement: Not applicable.
Conflicts of Interest: All authors declare no conflict of interest.

References
1. Ripa, S.P.; Islam, F.; Arifuzzaman, M. The Emergence Threat of Phishing Attack and The Detection Techniques Using Machine
Learning Models. In Proceedings of the 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0
(ACMI), Rajshahi, Bangladesh, 8–9 July 2021; pp. 1–6. [CrossRef]
2. Rameem Zahra, S.; Ahsan Chishti, M.; Iqbal Baba, A.; Wu, F. Detecting Covid-19 chaos driven phishing/malicious URL attacks
by a fuzzy logic and data mining based intelligence system. Egypt. Inform. J. 2022, 23, 197–214. [CrossRef]
3. Özgür, L.; Güngör, T.; Gürgen, F. Adaptive anti-spam filtering for agglutinative languages: A special case for Turkish. Pattern
Recognit. Lett. 2004, 25, 1819–1831. [CrossRef]
4. Dedeturk, B.K.; Akay, B. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Appl. Soft
Comput. 2020, 91, 106229. [CrossRef]
5. Akerkar, R. Artificial Intelligence for Business; Springer: Cham, Switzerland, 2019.
6. Buchanan, B. Artificial Intelligence in Finance; The Alan Turing Institute: London, UK, 2019.
7. Hamet, P.; Tremblay, J. Artificial intelligence in medicine. Metabolism 2017, 69, S36–S40. [CrossRef] [PubMed]
8. Dias, R.; Torkamani, A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019, 11, 1–12. [CrossRef]
[PubMed]
9. Ahmad, Z.; Shahid Khan, A.; Wai Shiang, C.; Abdullah, J.; Ahmad, F. Network intrusion detection system: A systematic study of
machine learning and deep learning approaches. Trans. Emerg. Telecommun. Technol. 2021, 32, e4150. [CrossRef]
10. Almomani, O.; Almaiah, M.A.; Alsaaidah, A.; Smadi, S.; Mohammad, A.H.; Althunibat, A. Machine learning classifiers for
network intrusion detection system: Comparative study. In Proceedings of the 2021 International Conference on Information
Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 440–445.
11. Saba, T.; Sadad, T.; Rehman, A.; Mehmood, Z.; Javaid, Q. Intrusion detection system through advance machine learning for the
internet of things networks. IT Prof. 2021, 23, 58–64. [CrossRef]
12. Tang, L.; Mahmoud, Q.H. A survey of machine learning-based solutions for phishing website detection. Mach. Learn. Knowl. Extr.
2021, 3, 672–694. [CrossRef]
13. Gandotra, E.; Gupta, D. An efficient approach for phishing detection using machine learning. In Multimedia Security; Springer:
Singapore, 2021; pp. 239–253.
14. Doshi, R.; Apthorpe, N.; Feamster, N. Machine learning ddos detection for consumer internet of things devices. In Proceedings
of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 29–35.
15. Injadat, M.; Moubayed, A.; Shami, A. Detecting botnet attacks in IoT environments: An optimized machine learning approach. In
Proceedings of the 2020 32nd International Conference on Microelectronics (ICM), Aqaba, Jordan, 14–17 December 2020; pp. 1–4.
16. Soe, Y.N.; Feng, Y.; Santosa, P.I.; Hartanto, R.; Sakurai, K. Machine learning-based IoT-botnet attack detection with sequential
architecture. Sensors 2020, 20, 4372. [CrossRef]
17. Rao, S.; Verma, A.K.; Bhatia, T. A review on social spam detection: Challenges, open issues, and future directions. Expert Syst.
Appl. 2021, 186, 115742. [CrossRef]
18. Ahmed, N.; Amin, R.; Aldabbas, H.; Koundal, D.; Alouffi, B.; Shah, T. Machine learning techniques for spam detection in email
and IoT platforms: Analysis and research challenges. Secur. Commun. Netw. 2022, 2022, 1862888. [CrossRef]
19. Hossain, F.; Uddin, M.N.; Halder, R.K. Analysis of optimized machine learning and deep learning techniques for spam detection.
In Proceedings of the 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON,
Canada, 21–24 April 2021; pp. 1–7.
20. Jurafsky, D.; Martin, J.H. Speech and Language Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2014; Volume 3.
21. Han, Y.; Yang, M.; Qi, H.; He, X.; Li, S. The Improved Logistic Regression Models for Spam Filtering. In Proceedings of the 2009
International Conference on Asian Language Processing, Singapore, 7–9 December 2009; pp. 314–317.
22. Kabiraj, S.; Raihan, M.; Alvi, N.; Afrin, M.; Akter, L.; Sohagi, S.A.; Podder, E. Breast cancer risk prediction using XGBoost
and random forest algorithm. In Proceedings of the 2020 11th International Conference on Computing, Communication and
Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–4.
Mathematics 2022, 10, 4173 29 of 31

23. Li, M.; Fu, X.; Li, D. Diabetes prediction based on XGBoost algorithm. In IOP Conference Series: Materials Science and Engineering;
IOP Publishing: Bristol, UK, 2020; Volume 768, p. 072093.
24. Ryu, S.E.; Shin, D.H.; Chung, K. Prediction model of dementia risk based on XGBoost using derived variable extraction and
hyper parameter optimization. IEEE Access 2020, 8, 177708–177720. [CrossRef]
25. Ogunleye, A.; Wang, Q.G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019,
17, 2131–2140. [CrossRef] [PubMed]
26. Nobre, J.; Neves, R.F. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial
markets. Expert Syst. Appl. 2019, 125, 181–194. [CrossRef]
27. Wang, Y.; Guo, Y. Forecasting method of stock market volatility in time series data based on mixed model of ARIMA and XGBoost.
China Commun. 2020, 17, 205–221. [CrossRef]
28. Shi, X.; Li, Q.; Qi, Y.; Huang, T.; Li, J. An accident prediction approach based on XGBoost. In Proceedings of the 2017 12th
International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 24–26 December 2017;
pp. 1–7.
29. Zhang, S.; Zhang, D.; Qiao, J.; Wang, X.; Zhang, Z. Preventive control for power system transient security based on XGBoost and
DCOPF with consideration of model interpretability. CSEE J. Power Energy Syst. 2020, 7, 279–294.
30. Abdel-Basset, M.; Abdel-Fatah, L.; Sangaiah, A.K. Metaheuristic algorithms: A comprehensive review. In Computational
Intelligence for Multimedia Big Data on the Cloud with Engineering Applications; Elsevier: Amsterdam, The Netherlands, 2018;
pp. 185–231.
31. Blum, C.; Li, X. Swarm intelligence in optimization. In Swarm intelligence; Springer: Berlin/Heidelberg, Germany, 2008; pp. 43–85.
32. Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey wolf optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [CrossRef]
33. Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [CrossRef]
34. Karaboga, D. Artificial bee colony algorithm. Scholarpedia 2010, 5, 6915. [CrossRef]
35. Yang, X.S. Firefly algorithms for multimodal optimization. In International Symposium on Stochastic Algorithms; Springer:
Berlin/Heidelberg, Germany, 2009; pp. 169–178.
36. Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [CrossRef]
37. Abualigah, L.; Diabat, A.; Mirjalili, S.; Abd Elaziz, M.; Gandomi, A.H. The arithmetic optimization algorithm. Comput. Methods
Appl. Mech. Eng. 2021, 376, 113609. [CrossRef]
38. Mirjalili, S. SCA: A Sine Cosine Algorithm for solving optimization problems. Knowl.-Based Syst. 2016, 96, 120–133. [CrossRef]
39. Maulud, D.; Abdulazeez, A.M. A review on linear regression comprehensive in machine learning. J. Appl. Sci. Technol. Trends
2020, 1, 140–147. [CrossRef]
40. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In
Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary,
25–29 July 2004; Volume 2, pp. 985–990. [CrossRef]
41. Olatunji, S.O. Extreme Learning machines and Support Vector Machines models for email spam detection. In Proceedings of the
2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May
2017; pp. 1–6.
42. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme gradient boosting. R Package Version
0.4-2 2015, 1, 1–4.
43. Guo, Y.; Mustafaoglu, Z.; Koundal, D. Spam Detection Using Bidirectional Transformers and Machine Learning Classifier
Algorithms. J. Comput. Cogn. Eng. 2022, 1–5. [CrossRef]
44. Vanaja, P.; Kumari, M.V. Machine Learning based Optimization for Efficient Detection of Email Spam. Available online:
http://positifreview.com/gallery/33-june2022.pdf (accessed on 10 July 2022).
45. Goodman, J.; Yih, W.T. Online Discriminative Spam Filter Training. In Proceedings of the CEAS 2006—Third Conference on
Email and AntiSpam, Mountain View, CA, USA, 27–28 July 2006; pp. 1–4.
46. Lucay, F.A. Accelerating Global Sensitivity Analysis via Supervised Machine Learning Tools: Case Studies for Mineral Processing
Models. Minerals 2022, 12, 750. [CrossRef]
47. Roul, R.K. Impact of multilayer ELM feature mapping technique on supervised and semi-supervised learning algorithms. Soft
Comput. 2022, 26, 423–437. [CrossRef]
48. Mustapha, I.B.; Hasan, S.; Olatunji, S.O.; Shamsuddin, S.M.; Kazeem, A. Effective Email Spam Detection System using Extreme
Gradient Boosting. arXiv 2020, arXiv:2012.14430.
49. Anitha, P.; Rao, C.G.; Babu, D.S. Email Spam Filtering Using Machine Learning Based Xgboost Classifier Method. Turk. J. Comput.
Math. Educ. 2021, 12, 2182–2190.
50. Pandey, M.K.; Singh, M.K.; Pal, S.; Tiwari, B. Measure the Performance by Analysis of Different Boosting Algorithms on Various
Patterns of Phishing Datasets. 2022. Available online: https://doi.org/10.21203/rs.3.rs-1794002/v2 (accessed on 14 July 2022).
51. Cuk, A.; Bezdan, T.; Bacanin, N.; Zivkovic, M.; Venkatachalam, K.; Rashid, T.A.; Devi, V.K. Feedforward multi-layer perceptron
training by hybridized method between genetic algorithm and artificial bee colony. In Data Science and Data Analytics: Opportunities
and Challenges; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021; p. 279.
Mathematics 2022, 10, 4173 30 of 31

52. Strumberger, I.; Bezdan, T.; Ivanovic, M.; Jovanovic, L. Improving Energy Usage in Wireless Sensor Networks by Whale
Optimization Algorithm. In Proceedings of the 2021 29th Telecommunications Forum (TELFOR), Belgrade, Serbia, 23–24
November 2021; pp. 1–4.
53. Zivkovic, M.; Bacanin, N.; Zivkovic, T.; Strumberger, I.; Tuba, E.; Tuba, M. Enhanced grey wolf algorithm for energy efficient
wireless sensor networks. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Online,
26–27 May 2020; pp. 87–92.
54. Jovanovic, D.; Antonijevic, M.; Stankovic, M.; Zivkovic, M.; Tanaskovic, M.; Bacanin, N. Tuning Machine Learning Models Using
a Group Search Firefly Algorithm for Credit Card Fraud Detection. Mathematics 2022, 10, 2272. [CrossRef]
55. Tair, M.; Bacanin, N.; Zivkovic, M.; Venkatachalam, K. A Chaotic Oppositional Whale Optimisation Algorithm with Firefly Search
for Medical Diagnostics. Comput. Mater. Contin 2022, 72, 959–982. [CrossRef]
56. Bacanin, N.; Stoean, R.; Zivkovic, M.; Petrovic, A.; Rashid, T.A.; Bezdan, T. Performance of a novel chaotic firefly algorithm
with enhanced exploration for tackling global optimization problems: Application for dropout regularization. Mathematics 2021,
9, 2705. [CrossRef]
57. Bacanin, N.; Zivkovic, M.; Sarac, M.; Petrovic, A.; Strumberger, I.; Antonijevic, M.; Petrovic, A.; Venkatachalam, K. A Novel
Multiswarm Firefly Algorithm: An Application for Plant Classification. In International Conference on Intelligent and Fuzzy Systems;
Springer: Cham, Switzerland, 2022; pp. 1007–1016.
58. Zivkovic, M.; Petrovic, A.; Bacanin, N.; Milosevic, S.; Veljic, V.; Vesic, A. The COVID-19 Images Classification by MobileNetV3
and Enhanced Sine Cosine Metaheuristics. In Mobile Computing and Sustainable Informatics; Springer: Singapore, 2022; pp. 937–950.
59. Bacanin, N.; Zivkovic, M.; Al-Turjman, F.; Venkatachalam, K.; Trojovskỳ, P.; Strumberger, I.; Bezdan, T. Hybridized sine cosine
algorithm with convolutional neural networks dropout regularization application. Sci. Rep. 2022, 12, 6302. [CrossRef] [PubMed]
60. Zivkovic, M.; Stoean, C.; Chhabra, A.; Budimirovic, N.; Petrovic, A.; Bacanin, N. Novel improved salp swarm algorithm: An
application for feature selection. Sensors 2022, 22, 1711. [CrossRef] [PubMed]
61. Salb, M.; Zivkovic, M.; Bacanin, N.; Chhabra, A.; Suresh, M. Support Vector Machine Performance Improvements for Cryp-
tocurrency Value Forecasting by Enhanced Sine Cosine Algorithm. In Computer Vision and Robotics; Springer: Singapore, 2022;
pp. 527–536.
62. Zivkovic, M.; Jovanovic, L.; Ivanovic, M.; Bacanin, N.; Strumberger, I.; Joseph, P.M. XGBoost Hyperparameters Tuning by
Fitness-Dependent Optimizer for Network Intrusion Detection. In Communication and Intelligent Systems; Springer: Singapore,
2022; pp. 947–962.
63. Zivkovic, M.; Stoean, C.; Petrovic, A.; Bacanin, N.; Strumberger, I.; Zivkovic, T. A Novel Method for COVID-19 Pandemic
Information Fake News Detection Based on the Arithmetic Optimization Algorithm. In Proceedings of the 2021 23rd International
Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 7–10 December
2021; pp. 259–266.
64. Bacanin, N.; Zivkovic, M.; Bezdan, T.; Venkatachalam, K.; Abouhawwash, M. Modified firefly algorithm for workflow scheduling
in cloud-edge environment. Neural Comput. Appl. 2022, 34, 9043–9068. [CrossRef]
65. Bacanin, N.; Antonijevic, M.; Bezdan, T.; Zivkovic, M.; Rashid, T.A. Wireless Sensor Networks Localization by Improved Whale
Optimization Algorithm. In 2nd International Conference on Artificial Intelligence: Advances and Applications; Springer: Singapore,
2022; pp. 769–783.
66. Agarwal, K.; Kumar, T. Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle Swarm Optimization. In
Proceedings of the 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai,
India, 14–15 June 2018; pp. 685–690. [CrossRef]
67. Ahmed, B. Wrapper Feature Selection Approach Based on Binary Firefly Algorithm for Spam E-mail Filtering. J. Soft Comput.
Data Min. 2020, 1, 44–52.
68. Mohammadzadeh, H.; Gharehchopogh, F.S. A novel hybrid whale optimization algorithm with flower pollination algorithm for
feature selection: Case study Email spam detection. Comput. Intell. 2021, 37, 176–209.
69. Singh, A.; Chahal, N.; Singh, S.; Gupta, S.K. Spam Detection using ANN and ABC Algorithm. In Proceedings of the 2021 11th
International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 28–29 January 2021;
pp. 164–168. [CrossRef]
70. Elakkiya, E.; Selvakumar, S.; Velusamy, R.L. CIFAS: Community Inspired Firefly Algorithm with fuzzy cross-entropy for feature
selection in Twitter Spam detection. In Proceedings of the 2020 11th International Conference on Computing, Communication and
Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–7. [CrossRef]
71. Batra, J.; Jain, R.; Tikkiwal, V.A.; Chakraborty, A. A comprehensive study of spam detection in e-mails using bio-inspired
optimization techniques. Int. J. Inf. Manag. Data Insights 2021, 1, 100006. [CrossRef]
72. Gabis, A.B.; Meraihi, Y.; Mirjalili, S.; Ramdane-Cherif, A. A comprehensive survey of sine cosine algorithm: Variants and
applications. Artif. Intell. Rev. 2021, 54, 5469–5540. [CrossRef]
73. Wu, S.; Mao, P.; Li, R.; Cai, Z.; Heidari, A.A.; Xia, J.; Chen, H.; Mafarja, M.; Turabieh, H.; Chen, X. Evolving fuzzy k-nearest
neighbors using an enhanced sine cosine algorithm: Case study of lupus nephritis. Comput. Biol. Med. 2021, 135, 104582.
[CrossRef]
74. Gupta, S. Enhanced sine cosine algorithm with crossover: A comparative study and empirical analysis. Expert Syst. Appl. 2022,
198, 116856. [CrossRef]
Mathematics 2022, 10, 4173 31 of 31

75. Rahnamayan, S.; Tizhoosh, H.R.; Salama, M.M.A. Quasi-oppositional Differential Evolution. In Proceedings of the 2007 IEEE
Congress on Evolutionary Computation, Singapore, 25–28 September 2007; pp. 2229–2236. [CrossRef]
76. Cheng, S.; Shi, Y. Diversity control in particle swarm optimization. In Proceedings of the 2011 IEEE Symposium on Swarm
Intelligence, Paris, France, 11–15 April 2011; pp. 1–9.
77. Ergin, S.; Sora Gunal, E.; Yigit, H.; Aydin, R. Turkish anti-spam filtering using binary and probabilistic models. Glob. J. Technol.
2012, 1, 1007–1012.
78. Barushka, A.; Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural
networks. Appl. Intell. 2018, 48, 3538–3556. [CrossRef]
79. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
80. Amayri, O.; Bouguila, N. A study of spam filtering using support vector machines. Artif. Intell. Rev. 2010, 34, 73–108. [CrossRef]
81. Yang, X.S. Bat algorithm for multi-objective optimisation. Int. J. Bio-Inspired Comput. 2011, 3, 267–274. [CrossRef]
82. Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris hawks optimization: Algorithm and applications.
Future Gener. Comput. Syst. 2019, 97, 849–872. [CrossRef]
83. Talatahari, S.; Bayzidi, H.; Saraee, M. Social network search for global optimization. IEEE Access 2021, 9, 92815–92863. [CrossRef]
84. Rao, R.V.; Savsani, V.J.; Vakharia, D. Teaching–learning-based optimization: A novel method for constrained mechanical design
optimization problems. Comput.-Aided Des. 2011, 43, 303–315. [CrossRef]
85. Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for
comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [CrossRef]
86. García, S.; Molina, D.; Lozano, M.; Herrera, F. A study on the use of non-parametric tests for analyzing the evolutionary algorithms’
behaviour: A case study on the CEC’2005 special session on real parameter optimization. J. Heuristics 2009, 15, 617–644. [CrossRef]
87. Eftimov, T.; Korošec, P.; Seljak, B.K. Disadvantages of statistical comparison of stochastic optimization algorithms. In Proceedings
of the Bioinspired Optimizaiton Methods and their Applications, BIOMA 2016, Bled, Slovenia, 18–20 May 2016; pp. 105–118.
88. Shapiro, S.S.; Francia, R. An approximate analysis of variance test for normality. J. Am. Stat. Assoc. 1972, 67, 215–216. [CrossRef]
89. LaTorre, A.; Molina, D.; Osaba, E.; Poyatos, J.; Del Ser, J.; Herrera, F. A prescription of methodological guidelines for comparing
bio-inspired optimization algorithms. Swarm Evol. Comput. 2021, 67, 100973. [CrossRef]
90. Glass, G.V. Testing homogeneity of variances. Am. Educ. Res. J. 1966, 3, 187–190. [CrossRef]
91. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937,
32, 675–701. [CrossRef]
92. Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92.
[CrossRef]
93. Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures; Chapman and Hall/CRC: Boca Raton, FL, USA, 2020.
94. Iman, R.L.; Davenport, J.M. Approximations of the critical region of the fbietkan statistic. Commun. Stat.-Theory Methods 1980,
9, 571–595. [CrossRef]
Reproduced with permission of copyright owner. Further reproduction
prohibited without permission.

You might also like