paper2
paper2
15, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3118573
ABSTRACT A pattern matching method (signature-based) is widely used in basic network intrusion
detection systems (IDS). A more robust method is to use a machine learning classifier to detect anomalies
and unseen attacks. However, a single machine learning classifier is unlikely to be able to accurately detect
all types of attacks, especially uncommon attacks e.g., Remote2Local (R2L) and User2Root (U2R) due to
a large difference in the patterns of attacks. Thus, a hybrid approach offers more promising performance.
In this paper, we proposed a Double-Layered Hybrid Approach (DLHA) designed specifically to address
the aforementioned problem. We studied common characteristics of different attack categories by creating
Principal Component Analysis (PCA) variables that maximize variance from each attack type, and found
that R2L and U2R attacks have similar behaviour to normal users. DLHA deploys Naive Bayes classifier
as Layer 1 to detect DoS and Probe, and adopts SVM as Layer 2 to distinguish R2L and U2R from normal
instances. We compared our work with other published research articles using the NSL-KDD data set. The
experimental results suggest that DLHA outperforms several existing state-of-the-art IDS techniques, and
is significantly better than any single machine learning classifier by large margins. DLHA also displays an
outstanding performance in detecting rare attacks by obtaining a detection rate of 96.67% and 100% from
R2L and U2R respectively.
INDEX TERMS Correlation feature selection, double-layered hybrid approach, machine learning, Naive
Bayes, intrusion detection system, network security, NSL-KDD, SVM.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
138432 VOLUME 9, 2021
T. Wisanwanichthan, M. Thammawichai: DLHA for Network IDS Using Combined Naive Bayes and SVM
a signature-based (misuse) method and an anomaly-based learning technique is the process where multiple base classi-
method. While the signature-based method is able to detect fiers are combined to achieve better predictive capability, for
only known malicious activities but not the novel ones, example, Random Forest (RF) [14], [40].
an anomaly-based method offers a better solution that is In the past years, another approach that has been
capable of detecting unknown attacks including potential adopted largely in the IDS research community is a hybrid
zero-day exploits. It works by observing a deviation from approach. A hybrid approach, in general, refers to a
normal traffic patterns [2]. The signature-based IDS works method that combines two or more learning techniques
by matching the traffic target with the pre-defined signa- e.g., using a signature-based method with an anomaly-based
tures e.g., Snort [6], in this way, it is very accurate in method [41]–[43], or an anomaly-based method with an
finding known threats. However, it is utterly worthless in anomaly-based method. For example, unsupervised ML and
the case of unknown threats [2]. Thus, advanced techniques supervised ML [38], and supervised ML and supervised
for the anomaly-based IDS need to be explored [7]. Even ML [28], [44]–[46]. The main concept behind the hybrid
though anomaly-based IDS usually produce high false alarm approach is to exploit the advantages of each learning tech-
rates [2], nowadays it has gained widespread acceptance nique by combining the strong points of different single
amongst the IDS research community [8], [9]. One of the best classifiers in order to improve the overall detection rate. It is
options in the domain is to use a Machine Learning (ML) also an effective technique that is used to reduce bias towards
approach to create an effective model in order to build a more frequent attacks as a result of data set imbalance [46].
pattern recognition of intruders [1], [8], [9]. Therefore, the hybrid approach is a promising technique to
Various machine learning techniques have been explored address the major concerns in IDS research.
and implemented to build an anomaly-based IDS [10]–[18]. However, there are three key problems in previous studies.
There are two ML techniques that are widely implemented (I) Many works e.g., [37], [47] only focused on using a
in the IDS field 1) Supervised learning, which creates a single machine learning model to detect all attack types. This
mapping function based on pre-defined input-output pairs, led to a drawback of a single classifier that is difficult to
and 2) Unsupervised learning, which allows a model to dis- outperform a hybrid approach. (II) Low-frequency attacks are
cover internal relationships by itself. Supervised ML is the not well detected due to a severe imbalance of classes in the
most widely used technique in IDS. For example, Support training data set, which results in bias in ML models [48].
Vector Machine (SVM) [10], [16], [19], [20], Decision Tree (III) Relevant features for a specific type of attack may not be
(DT) [21], [22], K-Nearest Neighbors (KNN) [17], [23], and necessary for other attacks due to a vast difference in attack
Naive Bayes (NB) [18], [24]–[26]. Unsupervised ML mostly behaviours [49], [50].
refers to clustering algorithms such as K-Means [27]. In order to address the above problems, our contributions
The key challenge in building an efficient IDS is the to the cyber security domain are as follows: (I) We proposed
selection of relevant features in the case of multiple attack a Double-Layered Hybrid Approach (DLHA) that is better
categories. Moreover, there will likely be many attack types than a single ML classifier and the ensemble method. The
in networks for ML to learn. Thus, Feature Selection (FS) proposed approach is composed of two layers that work in
is a crucial process to eliminate uninformative attributes and a cascading manner, where the first layer is to detect DoS
noise. FS is one of the primary factors to enhance accuracy in and Probe, and the second layer is to detect R2L and U2R.
IDS [13], [28]. Thus, many IDS researchers try to explore (II) We performed data analysis using PCA and found that
the best feature selection methods to extract a subset of DoS and Probe are more distinct from the rest, and R2L and
relevant features in order to boost classification results [29] U2R behave similarly to normal traffic patterns. The findings
such as using Local Search Algorithm with K-Means [13], inspired us to design DLHA. Contributions (I) and (II) are
Genetic Algorithm (GA) [19], [28], Particle Swarm Opti- exclusively dedicated to demonstrating the effectiveness of
mization (PSO) [28], Ant Colony Algorithm [28], [30], and implementing a hybrid approach, as opposed to using one
Correlation Coefficient [31]–[33]. In the past years, Artificial classifier as mentioned in problem (I). (III) The uniqueness
Neural Networks (ANN) and Deep Learning (DL) have been of our approach is that we divided the NSL-KDD training
successfully applied to deal with complex patterns, especially data set into two groups i.e., 1) Group 1 that contains all
in image and language processing. There are studies that classes, and 2) Group 2 that contains only R2L, U2R, and
utilized ANN on IDS problem such as Convolutional Neu- Normal classes. These were used to separately train the two
ral Networks (CNN) [34]–[36], Recurrent Neural Networks classifiers in order to have a dedicated classifier for detecting
(RNN) [37]. rare attacks i.e., R2L and U2R amongst normal connections.
Every ML algorithm has its own capability. One can pre- The group-divided strategy allows the algorithm to focus on
cisely detect a specific type of attack, while others are not low-frequency attacks at the second layer to address the prob-
accurate at it [38], [39]. Techniques that combine two or more lem (II). (IV) We presented Intersectional Correlated Feature
learning algorithms have been recently proposed due to supe- Selection (ICFS) using correlation coefficients. It selected
rior performance in detecting various attacks [39]. Ensemble commonly important features from different attack types
method is a popular learning algorithm for IDS, which usually within the subgroups in order to mitigate the problem (III).
offers a better result over a single estimator [11]. Ensemble (V) We conducted an evaluation of our proposed approach
to show that DLHA yields higher detection rates on both RandomTree, REPTree, AdaBoostM1, DecisionStump, and
overall performance and low-frequency-attack performances Naive Bayes. This method claimed to address the high false
compared to many other existing state-of-the-art methods. negative rate. Jiang et al. [34] proposed a combined hybrid
(VI) We showed that DLHA is highly competitive as a hybrid sampling with a Deep Hierarchical Network model. The
method, and it has a substantially superior performance to the model was tasked to balance the class distribution by ini-
traditional single ML techniques. tially employing One-Side Selection (OSS) to reduce samples
The rest of the paper is organized as follows. In section in the majority classes, then use Synthetic Minority Over-
II, related works on anomaly-based IDS are provided. Data sampling Technique (SMOTE) to increase the samples in
analysis on NSL-KDD is explained and shown in Section III. the minority classes. The deep hierarchical network model
The conceptual framework of our proposed DLHA is illus- worked based on spatial feature extraction with Convolution
trated, and the combined NB and SVM detection system is Neural Network (CNN) and temporal feature extraction with
introduced in Section IV. Section V explains the performance Bi-directional Long Short-Term Memory (BiLSTM). The
analysis of DLHA as well as presents an extensive compar- model accurately detected the under-represented classes as a
ison of our results to other anomaly-based IDS techniques. result of a hybrid sampling technique.
A conclusion is provided in Section VI. Biswas et al. [55] proposed hybrid feature selection with
neural network and K-Means clustering. It applied PCA
II. RELATED WORK to K-Means clustering, which specified five clusters as
Numerous anomaly-based IDS nowadays implement a hybrid per the number of classes. Each cluster was trained and
ML model, as it leads to better performance and enhanced evaluated by aggregating the results from different ANN
efficiency [1], [39]. Chi-square feature selection with functions i.e., feed forward neural network algorithm.
multi-class SVM model was proposed in [51]. Chi-square Mazini et al. [56] proposed a new hybrid anomaly-based IDS
was used to calculate statistical significance on each feature, framework to improve detection rates using Artificial Bee
and then the low-rank features were removed. The number of Colony (ABC) as a feature selection technique and AdaBoost
features decreased from 41 to 31 during the feature selection algorithm as a classifier. The authors implemented an ABC
process. Then, hyperparameter tuning was performed for meta-algorithm to select the best subset of relevant features
RBF-kernel SVM to obtain the best combination of param- and deployed AdaBoost.M2 to detect multi-class attacks. The
eters i.e., C and gamma. The model led to an outstand- IDS based on Naive Bayes Classifier (NBC) using Bayesian
ing result, but the authors did not perform an evaluation in probability was presented in [57]. The NBC calculated prob-
KDDTest+. Yao et al. [39] proposed a Hybrid Multi-Level abilities of any attack occurrence and the TCP normal traffic
data mining framework using hybrid feature selection. The based on the Bayesian network. The authors performed a
authors performed several experiments to choose the best ML score map analysis to select the features that boost detection
algorithms to detect each class of attack. The final detection rates. The results of NBC improved the detection rate of
system consisted of four different classifiers, which were: R2L attacks. Çavuşoğlu [58] introduced a new hybrid IDS,
1. Linear SVM to detect DoS, 2. ANN with logistic activa- which used a combination of different classifiers and feature
tion function to detect Probe, 3. ANN with relu activation selection techniques according to each type of attack. The
function to detect R2L, and 4. ANN with identity activation authors performed CfsSubsetEval and WrapperSubsetEval
function to detect U2R. The hybrid framework resulted in feature selection according to protocol types on the different
a superb performance, but the framework could be cumber- feature selection algorithms. The proposed IDS works in a
some for a real-time IDS as it consisted of four classifiers. multi-level manner by having four different techniques for
The data fusion method performed better than using a single each attack class i.e., RF to detect DoS, Stacking method with
classifier alone by integrating multiple different classifiers RF, J48, and KNN to detect R2L, RF to detect Probe, and
and predicting at the last step. It allows flexibility of data J48 and NB to classify normal traffics and U2R.
pre-processing by using different feature selection methods. Hwang et al. [59] presented the three-tier architecture IDS
However, the use of different classifiers for different data approach by implementing a blacklist, whitelist, and SVM.
sources resulted in longer computational time both in training The first tier was to filter out the known attacks, the second
and testing processes [52]. tier was to classify normal connections, and the last tier was
GA-SVM that implemented Genetic Algorithm (GA) com- to detect anomalies from the rest of the connections. The
bined with SVM was introduced in [53]. The genetic algo- authors claimed that the method was efficient and flexible
rithm was used as a feature reduction technique to reduce as all connections were not passed to every tier process.
features from 45 to 10 based on three priorities. The GA Pajouh et al. [60] proposed a Two-layer Dimension reduc-
applied crossover and variation to generate the optimal sub- tion and Two-tier Classification model (TDTC) to focus on
sets of features used in training by SVM. The efficient detecting malicious activities i.e., R2L and U2R. The authors’
anomaly-based IDS hybrid model was proposed in [54]. The framework utilized two dimensionality reduction techniques:
authors used a voting algorithm with information gain to PCA and Linear Discriminate Analysis (LDA). After PCA,
filter out irrelevant features. The designed hybrid classifier LDA was applied with labels to transform data into lower
algorithm utilized ensemble representing J48, Meta Tagging, dimensions in order to have as few dimensions as possible
to suit the IoT environment. At the two-tier classification a satisfactory performance in detecting rare attacks i.e., R2L
system, NB and Certainty Factor of the KNN algorithm were and U2R.
deployed. Tama et al. [28] presented a Two-Stage Ensemble Hoz et al. [63] proposed a hybrid framework using
(TSE-IDS) model that performed three feature selection algo- PCA, Fisher Discriminant Ratio (FDR), and Probabilistic
rithms i.e., Particle Swarm Optimization (PSO), Ant Colony Self-Organizing Maps (PSOMs). PCA was used to extract
Algorithm (ACO), and Genetic Algorithm (GA). The features meaningful components from all data attributes, and FDR
were selected based on the performance of the pruning tree was considered as a feature selection to maintain informative
classifier (REPT). The two-stage meta classifier was pro- features. The PSOMs algorithm was used to detect anomalous
posed using rotating forest and bagging to perform the major- instances. A fuzzy anomaly-based IDS with Content-Centric
ity voting at the end. The predictive features, as a result of Networks was introduced in [64]. The approach hybridized
the three feature selection algorithms, were used in training. the PSO and K-Means algorithm to optimize the proper num-
Then, a 10-fold CV was used to measure average accuracy in ber of clusters obtained from performing K-Means. At the
the training set at the validation stage. The results suggested classification stage, the fuzzy algorithm was deployed to
that a hybrid approach performed relatively better than single distinguish abnormal connections from normal connections.
ML classifiers. Auto-Encoder (AE) intelligent IDS was proposed in [65]. The
Alfantookh [61] introduced Denial of Service Intelligent authors performed feature selection by removing features that
Detection (DoSID), which used feed forward ANN with the contain zeros higher than 80%. The rest features combined
backpropagation algorithm to detect DoS attacks. The author with resulted features from one-hot encoding were used as
presented the Grey Area that used the distribution concept and feature vectors. The AE was trained in an unsupervised
conducted experiments to evaluate different parameter sets manner using the Scaled Conjugate Gradient method (SCG)
to select the best configurations for ANN, such as the num- for 100 epochs. The authors tested the model with several
ber of training epochs. The experimental results displayed a shallow ANN such as Multi-Layer Perceptron (MLP) and
capability to detect unknown attacks that have never been deep ANN such as LSTM.
seen at the training process, as well as an improvement in Recurrent Neural Network (RNN) based IDS was intro-
false negative rates. A two-tier classifier with LDA feature duced in [37]. The authors implemented one-hot encoding
selection was introduced in [4]. The model was trained on and optimized parameters by adjusting hidden nodes and
the training data set that applied SMOTE to make the data set the learning rate. The model performed well on frequent
more balanced in terms of the ratio between anomalies and attacks but not on uncommon attacks because no extra work
attack records. The NB and KNN classification algorithms was done to address the data set imbalance. Honeypot-based
were employed in the proposed IDS system. Compared to intrusion detection and prevention system combined with a
other papers, it achieved a high detection rate on uncommon software-defined switching was presented in [66]. The sys-
attacks such as R2L and U2R. tem was evaluated in a simulation environment, where the
Baykara and Das [62] proposed a hybrid honeypot based results indicated a reduced false alarm rate. The honeypot
real-time intrusion detection and prevention system. The sys- server that worked alongside the intrusion detection system,
tem was developed by utilizing low and high interaction produced signatures of potential zero-day attacks that bene-
honeypots to reduce installation, configuration, maintenance fited anomaly-based IDS to detect future unseen attacks more
and management cost. The approach led to a considerable precisely. Gogoi et al. [38] proposed a Multi-Level Hybrid
drop of a false positive rate, which benefited the real-time (MLH-IDS) data mining technique. It has three levels where it
enterprise network monitoring. An adaptive ensemble ML utilized a supervised ML CatSub+ as the first level to classify
IDS framework was presented in [11]. The authors proposed DoS and Probe, an unsupervised ML K-point algorithm as
a MultiTree algorithm to deal with skewed class distribution the second level to detect normal traffics, and an outlier-based
in the training set. It adjusted a proportion of the training data classifier GBBK as the third level to classify R2L and U2R.
set in order to reduce bias towards over-represented classes. MLH-IDS produced excellent results as a hybrid technique in
The authors evaluated multiple classifiers to select the base detecting all types of attacks using NSL-KDD. However, its
classifiers including Decision Tree, Random Forest, KNN, real performance remains unclear because the authors marked
and Deep Neural Networks. In the end, adaptive majority vot- the attacks that exist in KDDTest+, but not in KDDTrain+,
ing was used to make a final prediction. However, the results as unknown in the testing process.
indicated a high false alarm rate, especially on Probe attacks. Bostani and Sheikhan [67] proposed a graph-based ML
A hybrid approach using a two-step binary classification framework based on a modified Optimum-path Forest model
method was demonstrated in [46]. The authors designed the (OPF). In the framework, the authors used K-Means to parti-
first step to be an ensemble algorithm by deploying several tion the original NSL-KDD data set into K different training
binary classifiers with one aggregation function to predict subsets, which are used in the training process of OPFs. The
the exact class of the connection. The second step was based concept of centrality and prestige in social network analysis
on the outcome of the first step by performing the KNN was employed in a pruning module to extract the most pre-
algorithm to predict its class when the first step failed to dictive samples from the subsets obtained by implementing
confirm a certain class. This hybrid approach accomplished K-Means to accelerate the OPF stage. Instead of using the full
features, Mohammadi et al. [33] proposed a group-based fea- In addition, it also alleviates bias in the evaluation
ture selection, which was called Feature Grouping based on stage, which originally caused a higher detection rate
Linear Correlation Coefficient (FGLCC) combined with Cut- towards frequent attacks [3]. Therefore, NSL-KDD is
terFish Algorithm (CFA) on clustering of different groups. the standardized data set used by a number of net-
FGLCC measured linear correlation coefficients from fea- work IDS researchers [1], [28], [34], [65], [72]–[74].
tures and classes to select the maximum correlation in order In this paper, we only consider three data sets, which
to reduce computational cost in a large sample size. The are KDDTrain+, KDDTrain+_20Percent, and KDDTest+.
algorithm improved the accuracy and the detection rate of KDDTrain+_20Percent is a subset of KDDTrain+, which
IDS. Pervez and Farid [47] developed an anomaly-based IDS contains 20% of instances with the same distribution ratio
using SVM with the proposed feature selection algorithm. of classes. The reason behind the selection of the three data
The feature selection algorithm kept removing one input sets is that we can perform an extensive evaluation of our
feature, then built a classifier to test if a new subset of features algorithm using KDDTest+ that contains 17 unseen attack
led to better classification accuracy. The best classification classes. The training is done by utilizing the full sample size
accuracy was obtained by using 41 features, where it achieved in KDDTrain+ data set first, then a comparatively smaller
98.96% from a 10-fold CV in KDDTrain+. However, it expe- size i.e., KDDTrain+_20Percent data set in order to observe
rienced a major drop in the accuracy down to 82.37% when the difference in performance when the training data are
tested with KDDTest+. relatively smaller. According to NSL-KDD, there are four
Considering past related works, the key difference amongst main categories of attacks as shown in Table 2.
hybrid approaches is feature selection. While many methods The NSL-KDD consists of five classes i.e., DoS, Probe,
perform feature selection based on the most relevant features R2L, U2R, and Normal. The detailed distribution of
to all attacks, the better alternative is to perform feature selec- five classes in KDDTrain+, KDDTrain+_20Percent, and
tion on a specific attack type. For example, a hybrid feature KDDTest+ are displayed in Table 3 and Table 4 respectively.
selection for each hybrid level was used in [39]. Another Although the NSL-KDD is an updated version of KDD99,
major difference is a hybrid design. In [39], [58], the authors it still suffers from an inherited uneven class distribution
employed four classifiers to detect each type of attack, which within the data sets. For example, in the training data set it is
led to better performance but a slower process. On the other observed that normal records take the highest share amongst
hand, Pervez and Farid [47] presented a two-tier hybrid IDS all instances, which is about 53.46% in training data followed
using two classifiers with optimal features derived from PCA by DoS (36.46%), and Probe (9.25%) while R2L (0.79%) and
and LDA. However, the two-tier IDS met an inefficiency U2R (0.04%) sample data are very scarce. The problem is that
in the R2L detection performance. Thus, past papers have if a single model is deployed, it will not be able to detect R2L
failed to make contributions in effective feature selection, and U2R effectively owing to the model’s bias [72]. R2L and
and more efficient hybrid IDS design. Table 1 highlights key U2R attacks, used by hackers, are more harmful than DoS and
differences and a summary of the closest related works to Probe [4].
our study that proposed a hybrid approach. The summary Furthermore, it is also evident that the discrepancy of the
explains feature selection, ML algorithm, evaluation criteria, numbers of R2L between training and testing is very high
and the main contribution, including our work. i.e., R2L takes up to 22.48% of all attacks in testing data, but
only 1.70% in training data. Hence, in order to enhance over-
all IDS performance, R2L attacks need to be well detected.
III. DATA ANALYSIS It is worth noting that the testing data set (KDDTest+) con-
A. DATA SET DESCRIPTION tains 17 additional unseen minor classes of attacks, which
KDD99 [70] was the most widely used data set in evaluating do not appear in the training data set before i.e., apache2,
anomaly-based IDS approaches [71], it captured TCP dump httptunnel, mailbomb, mscan, named, processtable, ps, saint,
data from DARPA98 off-line intrusion detection evaluation sendmail, snmpgetattack, snmpguess, sqlattack, udpstorm,
program. However, the KDD99 has numerous inherent prob- worm, xlock, xsnoop, and xterm. Making it more challeng-
lems. Hence, NSL-KDD data set [3] is instead utilized in ing and realistic to assess our hybrid approach against both
this paper. The NSL-KDD was proposed in 2009 to solve known and unknown categories of attacks. However, there
the KDD99 data set that is skewed, and disproportionately are two minor classes of attacks that appear in the training
distributed [3]. The advantages and improvements that the data, but they are absent in the testing data set i.e., spy and
NSL-KDD holds over the outdated KDD99 are that a huge warezclient.
number of redundant/duplicated data are removed. Also,
selected instances are well represented i.e., the numbers of B. CLASS DISTRIBUTION ANALYSIS
attacks and normal instances are not very distinct, and the Each instance in the NSL-KDD contains 41 features as dis-
difficulty levels of attacks are evenly distributed in the train- played in Table 5. The features can be divided into four cate-
ing and testing sets. This results in more reliable classifica- gories which are: 1. Intrinsic features (feature 1 to 9) derived
tion results when comparing anomaly-based methods using from the header of the packets, 2. Content features (feature
different ML techniques [1], [3], [72]. 10 to 22) contain original packet payloads, 3. Time-based
TABLE 1. Summary of the closest related works that proposed a hybrid approach.
TABLE 2. Major categories of attacks in the NSL-KDD data set. series of connections instead of a 2-second interval. These
features are beneficial to assess attacks that operate longer
than the two-second time span. 39 of the features are numer-
ical, and 3 features are categorical, namely protocol_type,
service, and flag.
To perform data analysis on training data, we first imple-
mented data pre-processing by assigning numerical label tags
from [Normal, DoS, Probe, R2L, U2R] to [0, 1, 2, 3, 4]
respectively. Then, we perform one-hot encoding on those
features (feature 23 to 31) extracted from 2-second interval categorical features. One-hot encoding is a powerful tool
traffic connection records, and 4. Host-based features (feature used to maintain predictive information from converting a
32 to 41) are similar to time-based features but include all categorical feature to numerical features. However, it assumes
TABLE 3. 5-Class distribution in KDDTrain+ and KDDTrain+_20Percent. TABLE 4. 5-Class distribution in KDDTest+ (* are attack categories that
do not appear in the training data).
defined as: matrix. It then ranked the eigenvectors with the highest eigen-
values to be the first principal component and so on. Thus,
σ (d1 , d1 ) · · · σ (d1 , dn )
.. .. .. d 0 is the number of dimensions, sorted in descending order,
C = . . .
obtained from implementing PCA. For the purpose of illustra-
σ (dn , d1 ) · · · σ (dn , dn ) tion, we chose two as the number of the principal components
in order to be able to plot their instances separated by classes
hence, it can be computed by: on a two-dimensional graph. We performed a scatter plot
Pn T of the two-dimensional PCA analysis on training data as
i=1 Xi − X̄ Xi − X̄ visualized in Fig 1.
C= (2)
n−1 In Fig 1, we labelled DoS as orange, Probe as green,
Following this, we calculated eigenvalues and eigenvec- R2L as yellow, U2R as red, and Normal as blue. In the top
tors, Av = λv, corresponding to the computed covariance graph, we excluded Normal. Obviously, most DoS and Probe
q Pn
i=1 (xi −x̄)
2
where n is the number of samples, σ x = n−1 , x̄ =
n n
(x −x̄)(y −ȳ)
P P
i=1 xi i i
n , and cov(x, y) = i=1 n−1 .
Let F be features {F1 , F2 , . . . , Fn } in training data.
In Group 1, we assigned DoS and Probe as 1 and the rest
as 0. Let F(DOS) = {F1 , F2 , . . . , Fi } be the features between
FIGURE 2. A conceptual framework of DLHA anomaly-based IDS. DoS and the rest, which have PCC greater than 0.1. Let
F(Probe) = F1 , F2 , . . . , Fj be the features between Probe
and the rest, which have PCC greater than 0.1. F(DOS) are
predictive features to classify DoS from the rest, F(Probe) are
To handle this problem, we presented ICFS. An example of predictive features to classify Probe from the rest. Therefore,
the ICFS is illustrated in Fig 3. F(DOS) ∩ F(Probe) are common predictive features to classify
At this process, we performed feature selection on the two DoS and Probe from the rest. As a result, F(DOS) and F(Probe)
groups using Pearson Correlation Coefficient (PCC). PCC are the selected features for Group 1. We implemented the
is a bivariate analysis that measures the linear relationship same for Group 2 but with a 0.01 threshold because most
between two random variables, and ranks the features by features are not correlated. In Group 2, R2L and U2R were
importance. This method has low computational complexity, labelled as 1, and normal records were labelled as 0. Then,
and it is scalable for high dimensional data. For numerical PCC was calculated between R2L and Normal as well as
features, Pearson’s correlation coefficients are used to cal- U2R and Normal. Consequently, F(R2 L) ∩ F(U 2 R) are the
culate how much two data points vary together [78]. It is selected features for Group 2. The main aim of ICFS is to
equal to the covariance divided by the product of their stan- remove obvious uncorrelated features from the groups. After
dard deviations. Let X be a random vector with n instances, the ICFS was completed, we normalized the data to be in the
range [0,100] as their standard deviations were fairly small. where n is the number of features after data transform 1. Since
Normalization can be done using a formula in (4). P(x1 , x2 , . . . , xn ) is constant for all. The NBC, then, has the
following classification expression:
xij − min(x)j
xij0 = , i = 1, 2, . . . m; j = 1, 2, . . . , n n
max(x)j − min(x)j Y
y0 = arg max P(y) P (xi | y) (6)
(4) y
i=1
Afterwards, we performed one-hot encoding and PCA As the NBC implements Gaussian algorithm for classifi-
respectively. PCA was used to extract meaningful variance cation, The P (xi | y) is assumed to be Gaussian as follows:
from high dimensional data and turned it into uncorrelated 2 !
1 xi − µy
linearly-transformed lower dimensional data. To build an P (xi | y) = q exp −
efficient IDS, we only use as few features as possible. Thus, 2π σ 2 2σy2
y
we selected the lowest number that can retain 95% of the
variance. We performed data transform individually for each Despite having the feature-wise independence assump-
group since the instances are different. This resulted in a tion violated almost all the time in real-world applications,
difference in the selected features, scaling coefficients, and the NBC has demonstrated outstanding classification results
the number of principal components. Hence, we have two in the IDS problem [18]. It is proven to be efficient in
types of data transforms. One-hot encoding and PCA imple- detecting frequent DDoS attacks [25]. NBC’s computational
mentation details are presented in Section II. Following the complexity is defined as O(cf ) where c is the number of
data transform, data balancing in the training set is critical in classes, and f is the number of features. As the dimensions
order to hinder bias towards overwhelming records. Notice- are reduced in the data transform process, NBC is suitable
ably, we have R2L+U2R = 1,047 instances, and Normal = for dealing with a large amount of connections.
67,343 instances, the ratio is approximately 64:1. To prevent
bias, downsampling of the majority class is required. For b: SUPPORT VECTOR MACHINE (SVM)
example, 1,047 normal instances were randomly selected in SVM is one of the most popular supervised ML algorithm
order to make the ratio 1:1. Since the class ratio in Group 1 is in classification tasks. It was initially proposed in [80], [81]
not high, the downsampling method was not necessary. to deal with linear and non-linear optimization problems.
SVM creates the best hyperplane in a high-dimensional space
2) TRAINING AND VALIDATION in order to separate two classes with the maximum mar-
The training and validation steps are vital. Naive Bayes (NB) gin between them. It has also been applied to the intrusion
is selected as a classifier for Group 1. Support Vector detection research area [19], [20], [82]. It provides flex-
Machine (SVM) is selected as a classifier for Group 2. ibility in implementations by allowing choices of kernels
e.g., linear and radial basis function (RBF). Since RBF is a
a: NAIVE BAYES CLASSIFIER (NBC)
non-linear support vector classifier (SVC) kernel, it is espe-
cially effective in dealing with the data that share complex
NBC is a simple, yet powerful probabilistic estimator based
boundaries [10] i.e., classifying R2L and U2R from normal
on applying the Bayes’ theorem with an assumption that the
connections.
considered attributes are independent amongst all. Meaning
For any given training vector pairs of connection-class
that each feature influences the result independently [79].
(xi , yi ) , i = 1, 2, . . . , n where xi ∈ Rn and y ∈ {1, −1}n ,
In our proposed method, the NBC’s task is to detect DoS and
in which 1 corresponds to a positive class, and -1 corresponds
Probe. To serve this goal, DoS and Probe attacks are labelled
to a negative class. SVM requires a solution to the following
as 1, and the rest are 0. Let y = {y1 , y2 } = {Rest, DoS/Probe},
problem:
and let x be a dependent feature vector in the data such that
n
x = {x1 , x2 , . . . , xn }. The Bayes’ theorem can be written as 1 T X
follows: min w w+C ζi
w,b,ζ 2
i=1
P(y)P (x1 , . . . , xn | y)
P (y | x1 , . . . , xn ) = (5) subject to yi wT φ (xi ) + b ≥ 1 − ζi
P (x1 , . . . , xn )
ζi ≥ 0, i = 1, . . . , n (7)
where P(y) is a prior probability, P (x1 , x2 , . . . , xn | y) is
the likelihood of a given dependent vector relative to its In the equation, it is attempted to maximize the margin
class, P(x1 , x2 , . . . , xn ) is a marginal likelihood or evidence. between the two classes by minimizing w> w = kwk2 . C is the
P (y | x1 , x21 . . . , xn ) is the posterior probability of y happen- penalty strength to control misclassified samples at a distance
ing, given (x1 , x2 , . . . , xn ) has occurred. With the conditional ζi from the correct margin boundary that corresponds to the
value yi wT φ (xi ) + b ≥ 1−ζi . The decision function output
assumption that every feature is independent from each other,
it can be defined as: for any sample x is defined as:
P(y) ni=1 P (xi | y)
Q X
P (y | x1 , . . . , xn ) = yi αi K (xi , x) + b (8)
P (x1 , . . . , xn ) i∈SV
Its sign is the corresponding class from the prediction. The packages are captured and sent through Data Transformation
chosen SVC kernels for validation, in this study, are linear 1 process, then the transformed data are passed to Layer 1,
and RBF. Linear kernel is expressed as: which is NBC, to determine if the connection is DoS, Probe,
or Normal. If the prediction is negative, then the connection
K (x, y) = x, x 0
is highly unlikely to be DoS or Probe. Then, the second layer
RBF kernel is defined as: is activated. The original data are sent through Data Trans-
formation 2 process. Then the transformed data are passed
2
K (x, y) = e−γ kx−yk , γ >0 to Layer 2, which is SVM, to determine if the connection is
R2L, U2R or normal. If the prediction is negative, this con-
It has never been confirmed if a non-linear RBF kernel nection is expected to be normal. If any of the two classifiers
could always perform better than its linear counterpart in predicted positive, the connection is terminated and marked
this task. Then, we selected linear and RBF as two ker- as an anomaly. Since DoS and Probe attacks are more likely
nels for the parameter adjustment to observe R2L and U2R to occur, this framework is computationally efficient to detect
boundary. In order to avoid data leakage and data set over- DoS and Probe first, then R2L and U2R subsequently. DLHA
fitting, we performed SVM’s hyperparameter tuning using algorithm is explained in Algorithm 1.
10-fold stratified cross validation within the training set only
i.e., KDDTrain+ and KDDTrain+_20Percent. The stratified
Algorithm 1 DLHA Algorithm
cross validation is the process of splitting data into folds,
Input: X = {f1 , f2 , . . . , f40 } // 40 attributes captured
in which each fold has to ensure the same proportion of
Output: y ∈ {0, 1}
class labels to other folds. The concerned parameters are C
and gamma. C is available for both linear and RBF, which while DLHA IDS is running do
is a regularization parameter that adds a penalty for each // for every network connection
misclassified instance. The RBF gamma controls the distance after performing data transform 1
of influence of a single training sample. The set of parameters represent Xi as Xt1
are as follows; linear: C = 0.01, 0.1, 1, 10, 100, 1000, RBF: if Layer1 predicts Xt1 as 1 then
C = 0.1, 1, 10, 100 and gamma = 0.01, 0.1, 1, 10. Con- y←1
sequently, we have six parameters for the linear kernel and return y
16 parameter sets for the RBF kernel. else
SVM was implemented by using LIBSVM [83]. The Layer 2 is activated
machine specification is on Ubuntu 20.04 LTS, Intel Corei9- after performing data transform 2
9900 3.10GHz, and 32GB of RAM. represent Xi as Xt2
if Layer2 predicts Xt2 as 1 then
B. DLHA ALGORITHM y←1
Real-time traffic classification using DLHA is displayed return y
in Fig 4. DLHA is proposed to improve the overall detection else
rate, and especially the detection rate of rare attacks that y←0
are more hostile i.e., R2L and U2R in this study. It is also return y
designed to be an efficient real-time IDS since we have ICFS end if
and PCA to reduce data dimensions as much as possible. end if
DLHA algorithm works as follows: the network connection end while
As our 2-classifier hybrid approach is dedicated to max- 5. False Alarm Rate is the proportion of wrongly predict-
imizing the detection rates of R2L and U2R attacks, there ing attacks. FAR infers overestimation that falsely requires
are few continuing costs of operation as a trade-off. Firstly, human interference. It can be computed as:
time spent on attack detection increases because the decision
process becomes more complex, where two negative pre- FP
(13)
dictions are required to confirm that the connection is safe. FP + TN
Additionally, performing data transformation for each layer
leads to higher resource consumption. Powerful machines are In this work, we mainly focused on Detection Rate (DR).
recommended for this approach to avoid traffic bottlenecks. DR is critical because it implies how many attacks the model
Significantly, machine learning approaches rely on quality can identify out of the total number of actual attacks.
data to establish a reliable model. Collecting attack signatures
e.g., using a honeypot strategy, would be beneficial for a long B. EXPERIMENTAL RESULT
term IDS implementation [62]. At the training stage, we re-created the training data into
2 groups as mentioned previously. Then, we conducted the
V. EVALUATION AND RESULT ICFS. The correlated features between DoS and the rest
To evaluate the performance of our proposed DLHA, {F1 , F2 , . . . , Fi } are [8, 12, 23, 25, 26, 27, 28, 29, 30, 31, 32,
we conducted experiments using the two training data sets 33, 34, 35, 36, 37, 38, 39, 40, 41] (see Table 5). The correlated
features between Probe and the rest F1 , F2 , . . . , Fj = [1,
KDDTrain+ and KDDTrain+_20Percent in order to analyze
the framework on a large sample size and a small sample size. 12, 23, 24, 25, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38,
To measure generalization of the model, training and valida- 39, 40, 41]. Therefore, the intersect features of DoS/Probe
tion were only implemented using training data as described are [12, 23, 25, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37,
in Section IV. Thus, the testing data in KDDTest+ are left 38, 39, 40, 41] In group 2, the correlated features between
unseen. R2L and Normal {F1 , F2 , . . . , Fk } are [1, 5, 6, 9, 10, 11,
12, 14, 18, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37,
A. EVALUATION METRICS 38, 39]. The correlated features between U2R and Normal
There are five metrics presented in this work i.e., 1) Accuracy, {F1 , F2 , . . . , Fl } are [9, 10, 12, 14, 17, 18, 24, 31, 32,
2) F1 Score, 3) Precision, 4) Detection Rate (Recall), and 33, 36, 37]. Hence, the intersect features of R2L/U2R are
5) False Alarm Rate. The four measures used to calculate [9, 10, 12, 14, 18, 24, 31, 32, 33, 36, 37]. This is reasonable
the metrics are presented as follows: True Positive (TP) = e.g., count is commonly high in DoS and Probe attacks, and
correctly predicted attacks, True Negative (TN) = correctly num_shells is commonly relevant to R2L and U2R patterns.
predicted normal instances, False Positive (FP) = incorrectly After that, normalization and one-hot encoding were per-
predicted attacks, and False Negative (FN) = incorrectly pre- formed respectively. PCA is the last step in the Data Trans-
dicted normal instances. form process. 95% of cumulative variance was chosen as a
1. Accuracy is the overall percentage of correct classi- threshold. The cumulative variance against the number of
fication. However, it is unreliable for imbalanced data set, principal components is visualized in Fig 5. It indicated that
particularly for the IDS problem. It can be computed as: 28 is the suitable number of components in Group 1, which
(TP + TN )
(9)
TP + TN + FP + FN
2. F1 Score is the harmonic mean of precision and recall.
It can be computed as:
Precision x Recall 2TP
2x = (10)
Precision + Recall 2TP + FP + FN
3. Precision is the classification ability to correctly detect
attacks out of the total positive predictions. It can be com-
puted as:
TP
(11)
TP + FP
4. Detection Rate (Recall) is the classification ability to
correctly predict attacks from actual attacks. It can be com-
puted as:
FIGURE 5. Cumulative explained variance against the number of principal
TP components measured in both groups to select the optimal number of
(12) dimensions.
TP + FN
138444 VOLUME 9, 2021
T. Wisanwanichthan, M. Thammawichai: DLHA for Network IDS Using Combined Naive Bayes and SVM
FIGURE 6. Box-and-whisker plots present mean, median, range, and quartile distribution of the detection rates from different parameters for SVM
10-fold CV in KDDTrain+.
FIGURE 7. Box-and-whisker plots present mean, median, range, and quartile distribution of the detection rates from different parameters for SVM
10-fold CV in KDDTrain+_20Percent.
represented 95.07% of variance. 13 is the selected number of lowered. It is evident that the higher the gamma value is,
components in Group 2, which constituted 96.55% variance. when C is equal to 0.1, the more the detection rate dropped.
Then, downsampling was carried out on the frequent Additionally, when C is equal to or greater than 1, the per-
records i.e., Normal on Group 2 to keep a 1:1 ratio between formances are relatively consistent as seen in configurations
anomaly and normal. At the last step, we performed hyper- 10-21. The highest detection rate is located at configuration 6,
parameter tuning on Group 2 with a series of linear and where C equals 0.1 and gamma equals 0.01. It accomplished
RBF kernel parameters. The same set of parameters was an acceptable average detection rate of 0.9943 with STD =
also implemented on the comparatively smaller data set 0.0061 and 0.1337 in FAR.
i.e., KDDTrain+_20Percent to evaluate a variety of different The second experiment used KDDTrain+_20Percent in
configurations with a primary performance boost based on training. In Fig 7, we observed a small difference where the
the stratified 10-fold cross validation method. The results configurations in linear kernel performed moderately better
were shown in Fig 6, and Fig 7 respectively. Our main goal compared to its previous evaluation. Most configurations of
is to maximize the detection rates of the model in order to linear kernel performed worse when the data set becomes
prevent losses caused by intruders. Accordingly, each box- larger as shown in Fig 6. Noticeably, the same pattern is con-
and-whisker plot measured the detection rates as a result of firmed in a smaller data set, that the RBF kernel has a similar
each testing fold from 10 folds. The horizontal line in the box performance in configurations no.10-21. It performed best
indicated the median detection rate value, and the + specified when C is equal to 0.1, and gamma is approximately 0.01 or
the average detection rate of 10 scores. 0.1. The performance dramatically dropped when C is equal
The first experiment used KDDTrain+ in training. to 0.1 and gamma is equal to 10 by reducing to lower than
We attempted to select the best parameters to classify R2L 0.6 in the detection rates on some testing folds. The highest
and U2R attacks out of normal instances. Fig 6 indicated detection rate is attained in configuration 7, where C equals
that linear kernel performed well on lower C and dropped 0.1, and gamma equals 0.1 by acquiring the average detection
its performance on higher C. The RBF kernel performed rate of 0.9864 with STD = 0.0291 and 0.1136 in FAR. The
comparatively better in most combinations of parameters. results intuitively suggested that in order to detect R2L and
There is an exception that when C is equal to 0.1 and gamma U2R accurately, the penalty on misclassified samples should
is equal to 10, where the SVM performance is significantly not be high (low C), and a single training instance should
FIGURE 8. Detection rates of major attack categories. FIGURE 9. Training and testing time of DLHA in seconds.
not have too much influence on the decision boundary (low additional attack categories in KDDTest+, the attack cat-
gamma). egories that are absent in the training data set. There
To evaluate our framework on the two experiments, are 12,833 attacks in KDDTest+, 9,083 belong to known
we tested DLHA on the unseen data i.e., KDDTest+ using attack categories, and 3,750 are in unseen attack categories.
the procedure explained in Algorithm 1 and the best param- DLHA, using KDDTrain+ in training, achieved detection
eters derived from the CV process. Our proposed framework rates of 94.01% (8,539 out of 9,083) from known attack cat-
presented outstanding classification results achieving 88.97% egories, and 90.90% (3,411 out of 3,750) from unseen attack
in accuracy, 90.57% in F1 score, 88.17% in precision, and categories. DLHA, using KDDTrain+_20Percent in training,
93.11% in detection rate with 11.82% of false alarm rate achieved detection rates of 89.81% (8,157 out of 9,083)
by using KDDTrain+ in training. The framework was also from known attack categories, and 91.28% (3,423 out
proven effective in a comparatively smaller data set i.e., using of 3,750) from unseen attack categories. From the results,
only 20% of all samples (KDDTrain+_20Percent) in training, DLHA performed outstandingly well in detecting both
where it obtained acceptable results, these being 87.55% known and unknown attack categories. DLHA trained on
accuracy, 89.19% in F1 score, 88.17% in precision, and KDDTrain+_20Percent gained a slightly higher detection
90.24% in detection rate with 11.83% of false alarm rate. rate on unseen attack categories. However, DLHA detected
Then, we conducted a detailed analysis of our results 94.01% of attacks from known attack categories when the
to explore the detection rates of each class as shown total samples were used in training due to a greater amount
in Fig 8. It was found that our proposed method, from using of the samples per each category in KDDTrain+ compared to
KDDTrain+ in training, has the detection rates of 92.4% KDDTrain+_20Percent.
on DoS (6,893 out of 7460), 90.87% on Probe (2,200 out It is worth mentioning that there are a number of existing
of 2,421), 96.67% on R2L (2,789 out of 2,885), and 100% works that previously studied anomaly-based IDS using a
on U2R (67 out of 67). When using KDDTrain+_20Percent refined version of the KDD99 i.e., NSL-KDD [1], the same
in training, it has the detection rates of 92.84% on DoS data set we considered in this study. However, some scholars
(6,926 out of 7,460), 89.88% on Probe (2,176 out of 2,421), presented their results from implementing a cross valida-
83.6% on R2L (2,421 out of 2,885), and 100% on U2R (67 out tion method, a holdout method, or using a portion of the
of 67). Therefore, it is demonstrated that our proposed DLHA KDD99 data set, which are not sufficiently reliable in the con-
accomplished its objective in maintaining great detection text of IDS research i.e., achieved over 99-100% in accuracy
rates on DoS and Probe, and showed excellent performance in or detection rate [28]. In this study, we used KDDTrain+ and
detecting 96.67% on R2L and 100% on U2R in KDDTest+. KDDTrain+_20Percent in the training and validation steps,
In addition, the time measurement was also presented as and only used KDDTest+ in testing. Therefore, we only com-
displayed in Fig 9. The presented numbers were the average pared our results to the studies that take a similar approach
of 10 times running on the desktop machine. It was apparent i.e., using the KDDTest+ in testing.
that the time used for training in the KDDTrain+_20Percent In order to objectively evaluate our proposed framework
was only one-third of the full data set as it contains only 20% on wider impacts, we conducted an extensive comparison of
of all training data. The testing time is similar on both training our results to other publicly published IDS research papers as
sets, where approximately 2.5 seconds were spent classifying shown in Table 6. It is acknowledged that our framework is
22,544 instances, or in other words, that ≈ 9,000 instances highly competitive in the field. Evidently, DLHA obtains the
were successfully classified in one second. highest F1 Score and DR. However, the obvious downside
One of the most important areas we highlighted in of our model is a relatively high FAR because we attempt
this study is how successful our approach is in detecting to maximize the detection rate. The no.22-26 results are
TABLE 6. Performance comparison on accuracy, F1 score, precision, detection rate, and false alarm rate with other anomaly-based IDS approaches (only
compared to the studies that performed evaluation in the original KDDTest+).
TABLE 7. Comparative detection rates of major attack categories. rates of the major attack categories to other studies as dis-
played in Table 7. The comparison indicates that DLHA is
not the best algorithm to detect DoS or Probe, as our results
attain approximately 90-92%, while others show superior
outcomes. However, our model can accurately detect every
type of attack compared to others that exhibit undesirable
detection scores on R2L and U2R. Our model clearly out-
performs all other methods by reaching the detection rates
of 96.67% in R2L and 100% in U2R.
VI. CONCLUSION
Rule-based IDS methods are not sufficient for the new era of
rapidly-growing internet connections worldwide. Anomaly-
based IDS approaches using machine learning offer a promis-
derived from the original NSL-KDD article, which are set as ing performance, but usually suffer from bias towards
a baseline. Any models that perform worse than the baseline frequent attacks as well as underestimation of rare threats.
are considered substandard. Our DLHA has considerably Single machine learning models are not accurate in detecting
higher accuracy than the best baseline single machine learn- all types of attacks, which result in a low detection rate,
ing classifier, NB Tree, by +6.95%, and +11.56% compared particularly on infrequent attacks. Thus, the IDS problem
to Multi-Layer Perceptron. Furthermore, [37], [47] devel- requires a hybrid solution.
oped the single machine learning classifier models, SVM and This paper proposed an algorithm called a Double-Layered
RNN, to detect all attack types. Their accuracy scores were Hybrid Approach (DLHA) to tackle an unsatisfactory perfor-
82.37% and 81.29% respectively, indicating no improvement mance on rare attacks, which also give rise to an improved
over the baseline, while most hybrid methods performed bet- overall detection rate. An Intersectional Correlated Feature
ter than the baseline. In addition, we compared our detection Selected (ICFS) was presented as part of DLHA to exclude
commonly irrelevant features on the subgroups to reduce [11] X. Gao, C. Shan, C. Hu, Z. Niu, and Z. Liu, ‘‘An adaptive ensemble
dimensions and accelerate the whole framework for real- machine learning model for intrusion detection,’’ IEEE Access, vol. 7,
pp. 82512–82521, 2019.
time practice. The detection part consists of two layers. The [12] J. Gao, S. Chai, B. Zhang, and Y. Xia, ‘‘Research on network intru-
first layer utilized NBC to classify DoS and Probe attacks sion detection based on incremental extreme learning machine and adap-
from all connections. The second layer adopted SVM with tive principal component analysis,’’ Energies, vol. 12, no. 7, p. 1223,
Mar. 2019.
RBF kernel to detect R2L and U2R attacks among normal [13] S.-H. Kang and K. J. Kim, ‘‘A feature selection approach to find optimal
traffic, which is a more difficult task. Hyperparameter tuning feature subsets for the network intrusion detection system,’’ Cluster Com-
is paramount, c and gamma on SVM were optimized as they put., vol. 19, no. 1, pp. 325–333, Mar. 2016.
[14] J. Zhang, M. Zulkernine, and A. Haque, ‘‘Random-forests-based network
are the primary factors to accurately detect attacks that share intrusion detection systems,’’ IEEE Trans. Syst., Man, Cybern. C, Appl.
a similar pattern to normal connections i.e., R2L and U2R. Rev., vol. 38, no. 5, pp. 649–659, Sep. 2008.
[15] R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran, A. Al-Nemrat,
Our proposed DLHA was evaluated on the NSL-KDD data and S. Venkatraman, ‘‘Deep learning approach for intelligent intrusion
set. It achieved exceptional results with an overall detection detection system,’’ IEEE Access, vol. 7, pp. 41525–41550, 2019.
rate of 93.11% with over 96.67% detection rate of R2L, [16] J. Mill and A. Inoue, ‘‘Support vector classifiers and network intrusion
detection,’’ in Proc. IEEE Int. Conf. Fuzzy Syst., Jul. 2004, pp. 407–410.
and 100% of U2R. The execution time and F1 score have [17] W. Li, P. Yi, Y. Wu, L. Pan, and J. Li, ‘‘A new intrusion detection sys-
proven its enhanced efficiency and capability for broader tem based on KNN classification algorithm in wireless sensor network,’’
applications. J. Electr. Comput. Eng., vol. 2014, pp. 1–8, Jun. 2014.
[18] B. Zhang, Z. Liu, Y. Jia, J. Ren, and X. Zhao, ‘‘Network intrusion detection
Our experimental results demonstrated how successful and method based on PCA and Bayes algorithm,’’ Secur. Commun. Netw.,
effective the hybrid IDS approach is by using two differ- vol. 2018, pp. 1–11, Nov. 2018.
ent classifiers with ICFS. Since we avoided overfitting and [19] P. Tao, Z. Sun, and Z. Sun, ‘‘An improved intrusion detection algorithm
based on GA and SVM,’’ IEEE Access, vol. 6, pp. 13624–13631, 2018.
data leakage by implementing hyperparameter tuning on [20] Y. Zhang, Q. Yang, S. Lambotharan, K. Kyriakopoulos, I. Ghafir, and
10-fold CV using training data, we concluded that our DLHA B. AsSadhan, ‘‘Anomaly-based network intrusion detection using SVM,’’
offers a generalized model with a class-topping performance in Proc. 11th Int. Conf. Wireless Commun. Signal Process. (WCSP),
Oct. 2019, pp. 1–6.
in detecting uncommon but more dangerous attacks. This [21] M. Li, ‘‘Application of CART decision tree combined with PCA algorithm
approach is suitable for a real-time IDS and aims to secure in intrusion detection,’’ in Proc. 8th IEEE Int. Conf. Softw. Eng. Service
critical network environments. The possible future work of Sci. (ICSESS), Nov. 2017, pp. 38–41.
[22] S. Sahu and B. M. Mehtre, ‘‘Network intrusion detection system using
this study can be the application of this approach on the J48 decision tree,’’ in Proc. Int. Conf. Adv. Comput., Commun. Informat.
data set or network environment that might categorize attacks (ICACCI), Aug. 2015, pp. 2023–2026.
differently e.g., having more than four types of attacks. [23] Y. Liao and V. Vemuri, ‘‘Use of K-nearest neighbor classifier for intrusion
detection,’’ Comput. Secur., vol. 21, no. 5, pp. 439–448, Oct. 2002.
[24] M. Panda and M. R. Patra, ‘‘Network intrusion detection using Naive
REFERENCES Bayes,’’ Int. J. Comput. Sci. Netw. Secur., vol. 7, no. 12, pp. 258–263, 2007.
[25] R. F. Fouladi, C. E. Kayatas, and E. Anarim, ‘‘Frequency based DDoS
[1] U. S. Musa, M. Chhabra, A. Ali, and M. Kaur, ‘‘Intrusion detection system attack detection approach using naive Bayes classification,’’ in Proc. 39th
using machine learning techniques: A review,’’ in Proc. Int. Conf. Smart Int. Conf. Telecommun. Signal Process. (TSP), Jun. 2016, pp. 104–107.
Electron. Commun. (ICOSEC), Sep. 2020, pp. 149–155. [26] S. Mukherjee and N. Sharma, ‘‘Intrusion detection using naive Bayes
[2] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, ‘‘Intrusion detection classifier with feature reduction,’’ Proc. Technol., vol. 4, pp. 119–128,
system: A comprehensive review,’’ J. Netw. Comput. Appl., vol. 36, no. 1, Feb. 2012.
pp. 16–24, 2013. [27] J. V. Anand Sukumar, I. Pranav, M. Neetish, and J. Narayanan, ‘‘Net-
[3] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis work intrusion detection using improved genetic k-means algorithm,’’ in
of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Comput. Intell. Secur. Proc. Int. Conf. Adv. Comput., Commun. Informat. (ICACCI), Sep. 2018,
Defense Appl., Jul. 2009, pp. 1–6. pp. 2441–2446.
[4] H. H. Pajouh, R. Javidan, R. Khayami, A. Dehghantanha, and [28] B. A. Tama, M. Comuzzi, and K. Rhee, ‘‘TSE-IDS: A two-stage classifier
K.-K. R. Choo, ‘‘A two-layer dimension reduction and two-tier classi- ensemble for intelligent anomaly-based intrusion detection system,’’ IEEE
fication model for anomaly-based intrusion detection in IoT backbone Access, vol. 7, pp. 94497–94507, 2019.
networks,’’ IEEE Trans. Emerg. Topics Comput., vol. 7, no. 2, pp. 314–323, [29] S. Khalid, T. Khalil, and S. Nasreen, ‘‘A survey of feature selection and
Apr./Jun. 2019. feature extraction techniques in machine learning,’’ in Proc. Sci. Inf. Conf.,
[5] R. Das and M. Z. Gündüz, ‘‘Analysis of cyber-attacks in IoT-based critical Aug. 2014, pp. 372–378.
[30] W. Feng, Q. Zhang, G. Hu, and J. X. Huang, ‘‘Mining network data for
infrastructures,’’ Int. J. Inf. Secur. Sci., vol. 8, no. 4, pp. 122–133, 2020.
intrusion detection through combining SVMs with ant colony networks,’’
[6] M. Roesch, ‘‘Snort: Lightweight intrusion detection for networks,’’ Lisa,
Future Generat. Comput. Syst., vol. 37, pp. 127–140, Jul. 2014.
vol. 99, pp. 229–238, Jun. 1999.
[31] M. Ektefa, S. Memar, F. Sidi, and L. S. Affendey, ‘‘Intrusion detection
[7] R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. McClung, using data mining techniques,’’ in Proc. Int. Conf. Inf. Retr. Knowl. Man-
D. Weber, S. Webster, D. Wyschogrod, R. Cunningham, and M. Zissman, age. (CAMP), Mar. 2010, pp. 200–203.
‘‘Evaluating intrusion detection systems: The 1998 DARPA off-line intru- [32] H. F. Eid, A. E. Hassanien, T.-H. Kim, and S. Banerjee, ‘‘Linear
sion detection evaluation,’’ in Proc. DARPA Inf. Survivability Conf. Expo., correlation-based feature selection for network intrusion detection model,’’
vol. 2, 2000, pp. 12–26. in Proc. Int. Conf. Secur. Inf. Commun. Netw. Berlin, Germany: Springer,
[8] A. L. Buczak and E. Guven, ‘‘A survey of data mining and machine 2013, pp. 240–248.
learning methods for cyber security intrusion detection,’’ IEEE Commun. [33] S. Mohammadi, H. Mirvaziri, M. Ghazizadeh-Ahsaee, and H. Karimipour,
Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016. ‘‘Cyber intrusion detection by combined feature selection algorithm,’’
[9] F. Salo, M. Injadat, A. B. Nassif, A. Shami, and A. Essex, ‘‘Data mining J. Inf. Secur. Appl., vol. 44, pp. 80–88, Feb. 2019.
techniques in intrusion detection systems: A systematic literature review,’’ [34] K. Jiang, W. Wang, A. Wang, and H. Wu, ‘‘Network intrusion detection
IEEE Access, vol. 6, pp. 56046–56058, 2018. combined hybrid sampling with deep hierarchical network,’’ IEEE Access,
[10] I. Ahmad, M. Basheri, M. J. Iqbal, and A. Raheem, ‘‘Performance com- vol. 8, pp. 32464–32476, 2020.
parison of support vector machine, random forest, and extreme learning [35] L. Mohammadpour, T. C. Ling, C. S. Liew, and C. Y. Chong, ‘‘A con-
machine for intrusion detection,’’ IEEE Access, vol. 6, pp. 33789–33795, volutional neural network for network intrusion detection system,’’ Proc.
2018. Asia–Pacific Adv. Netw., vol. 46, Aug. 2018, pp. 50–55.
[36] Y. Ding and Y. Zhai, ‘‘Intrusion detection system for NSL-KDD dataset [60] H. H. Pajouh, G. Dastghaibyfard, and S. Hashemi, ‘‘Two-tier network
using convolutional neural networks,’’ in Proc. 2nd Int. Conf. Comput. Sci. anomaly detection model: A machine learning approach,’’ J. Intell. Inf.
Artif. Intell. (CSAI), 2018, pp. 81–85. Syst., vol. 48, no. 1, pp. 61–74, 2017.
[37] C. Yin, Y. Zhu, J. Fei, and X. He, ‘‘A deep learning approach for intru- [61] A. A. Alfantookh, ‘‘DoS attacks intelligent detection using neural net-
sion detection using recurrent neural networks,’’ IEEE Access, vol. 5, works,’’ J. King Saud Univ. Comput. Inf. Sci., vol. 18, pp. 31–51,
pp. 21954–21961, 2017. 2006.
[38] P. Gogoi, D. K. Bhattacharyya, B. Borah, and J. K. Kalita, ‘‘MLH-IDS: A [62] M. Baykara and R. Das, ‘‘A novel honeypot based security approach for
multi-level hybrid intrusion detection method,’’ Comput. J., vol. 57, no. 4, real-time intrusion detection and prevention systems,’’ J. Inf. Secur. Appl.,
pp. 602–623, Apr. 2014. vol. 41, pp. 103–116, Aug. 2018.
[39] H. Yao, Q. Wang, L. Wang, P. Zhang, M. Li, and Y. Liu, ‘‘An intrusion [63] E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, and B. Prieto, ‘‘PCA
detection framework based on hybrid multi-level data mining,’’ Int. J. filtering and probabilistic SOM for network intrusion detection,’’ Neuro-
Parallel Program., vol. 47, no. 4, pp. 740–758, Aug. 2019. computing, vol. 164, pp. 71–81, Sep. 2015.
[40] N. B. Nanda and A. Parikh, ‘‘Hybrid approach for network intrusion detec- [64] A. Karami and M. Guerrero-Zapata, ‘‘A fuzzy anomaly detection system
tion system using random forest classifier and rough set theory for rules based on hybrid PSO-Kmeans algorithm in content-centric networks,’’
generation,’’ in Proc. Int. Conf. Adv. Informat. Comput. Res. Singapore: Neurocomputing, vol. 149, pp. 1253–1269, Feb. 2015.
Springer, 2019, pp. 274–287. [65] C. Ieracitano, A. Adeel, F. C. Morabito, and A. Hussain, ‘‘A novel statistical
[41] P. Singh and M. Venkatesan, ‘‘Hybrid approach for intrusion detection analysis and autoencoder driven intelligent intrusion detection approach,’’
system,’’ in Proc. Int. Conf. Current Trends towards Converging Technol. Neurocomputing, vol. 387, pp. 51–62, Apr. 2020.
(ICCTCT), Mar. 2018, pp. 1–5. [66] M. Baykara and R. Daş, ‘‘SoftSwitch: A centralized honeypot-based secu-
[42] V. Hajisalem and S. Babaie, ‘‘A hybrid intrusion detection system based on rity approach usingsoftware-defined switching for secure management of
ABC-AFS algorithm for misuse and anomaly detection,’’ Comput. Netw., VLAN networks,’’ TURKISH J. Electr. Eng. Comput. Sci., vol. 27, no. 5,
vol. 136, pp. 37–50, May 2018. pp. 3309–3325, Sep. 2019.
[43] E. Kim and S. Kim, ‘‘A novel anomaly detection system based on HFR- [67] H. Bostani and M. Sheikhan, ‘‘Modification of supervised OPF-based
MLR method,’’ in Mobile, Ubiquitous, and Intelligent Computing. Berlin, intrusion detection systems using unsupervised learning and social network
Germany: Springer, 2014, pp. 279–286. concept,’’ Pattern Recognit., vol. 62, pp. 56–72, Feb. 2017.
[44] P.-J. Chuang and S.-H. Li, ‘‘Network intrusion detection using hybrid [68] A. Golrang, A. M. Golrang, S. Yildirim Yayilgan, and O. Elezaj, ‘‘A novel
machine learning,’’ in Proc. Int. Conf. Fuzzy Theory Appl. (iFUZZY), hybrid IDS based on modified NSGAII-ANN and random forest,’’ Elec-
Nov. 2019, pp. 1–5. tronics, vol. 9, no. 4, p. 577, Mar. 2020.
[45] J. Esmaily, R. Moradinezhad, and J. Ghasemi, ‘‘Intrusion detection system [69] C. Liu, Z. Gu, and J. Wang, ‘‘A hybrid intrusion detection system based on
based on multi-layer perceptron neural networks and decision tree,’’ in scalable K-means+ random forest and deep learning,’’ IEEE Access, vol. 9,
Proc. 7th Conf. Inf. Knowl. Technol. (IKT), May 2015, pp. 1–5. pp. 75729–75740, 2021.
[46] L. Li, Y. Yu, S. Bai, Y. Hou, and X. Chen, ‘‘An effective two-step intru- [70] S. Stolfo. (1999). KDD-99 Dataset. [Online]. Available: http://www.
sion detection approach based on binary classification and κ-NN,’’ IEEE kdd.ics.uci.edu/databases/kddcup99/kddcup99.htmlkddcup99.html
Access, vol. 6, pp. 12060–12073, 2018. [71] A. Özgür and H. Erdem, ‘‘A review of KDD99 dataset usage in intrusion
[47] M. S. Pervez and D. M. Farid, ‘‘Feature selection and intrusion classifi- detection and machine learning between 2010 and 2015,’’ PeerJ Preprints,
cation in NSL-KDD cup 99 dataset employing SVMs,’’ in Proc. 8th Int. vol. 4, Apr. 2016, Art. no. e1954v1.
Conf. Softw., Knowl., Inf. Manage. Appl. (SKIMA), Dec. 2014, pp. 1–6. [72] S. Revathi and A. Malathi, ‘‘A detailed analysis on NSL-KDD dataset using
[48] H. He and E. A. Garcia, ‘‘Learning from imbalanced data,’’ IEEE Trans. various machine learning techniques for intrusion detection,’’ Int. J. Eng.
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. Res. Technol., vol. 2, no. 12, pp. 1848–1853, 2013.
[49] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A deep learning approach [73] T. Su, H. Sun, J. Zhu, S. Wang, and Y. Li, ‘‘BAT: Deep learning methods
for network intrusion detection system,’’ EAI Endorsed Trans. Secur. Saf., on network intrusion detection using NSL-KDD dataset,’’ IEEE Access,
vol. 3, no. 9, p. e2, 2016. vol. 8, pp. 29575–29585, 2020.
[50] A. A. Olusola, A. S. Oladele, and D. O. Abosede, ‘‘Analysis of KDD’99 [74] N. Chouhan, A. Khan, and H.-U.-R. Khan, ‘‘Network anomaly detec-
intrusion detection dataset for selection of relevance features,’’ in Proc. tion using channel boosted and residual learning based deep con-
World Congr. Eng. Comput. Sci. (WCECS), San Francisco, CA, USA, volutional neural network,’’ Appl. Soft Comput., vol. 83, Oct. 2019,
vol. 1, Oct. 2010, pp. 20–22. Art. no. 105612.
[51] I. S. Thaseen and C. A. Kumar, ‘‘Intrusion detection model using fusion [75] L. Dhanabal and S. P. Shantharajah, ‘‘A study on NSL-KDD dataset for
of chi-square feature selection and multi class SVM,’’ J. King Saud Univ.- intrusion detection system based on classification algorithms,’’ Int. J. Adv.
Comput. Inf. Sci., vol. 29, no. 4, pp. 462–472, 2016. Res. Comput. Commun. Eng., vol. 4, no. 6, pp. 446–452, 2015.
[52] G. Li, Z. Yan, Y. Fu, and H. Chen, ‘‘Data fusion for network intrusion detec- [76] C. Seger, ‘‘An investigation of categorical variable encoding techniques
tion: A review,’’ Secur. Commun. Netw., vol. 2018, pp. 1–16, May 2018. in machine learning: Binary versus one-hot and feature hashing,’’ School
[53] B. Aslahi-Shahri, R. Rahmani, M. Chizari, A. Maralani, M. Eslami, Elect. Eng. Comput. Sci., KTH Roy. Inst. Technol., Stockholm, Sweden,
M. J. Golkar, and A. Ebrahimi, ‘‘A hybrid method consisting of GA and Tech. Rep., 2018.
SVM for intrusion detection system,’’ Neural Comput. Appl., vol. 27, no. 6, [77] H. Abdi and L. J. Williams, ‘‘Principal component analysis,’’ Wiley Inter-
pp. 1669–1676, 2016. discipl. Rev., Comput. Statist., vol. 2, no. 4, pp. 433–459, 2010.
[54] S. Aljawarneh, M. Aldwairi, and M. B. Yassein, ‘‘Anomaly-based intrusion [78] J. Benesty, J. Chen, Y. Huang, and I. Cohen, ‘‘Pearson correlation coeffi-
detection system through feature selection analysis and building hybrid cient,’’ in Noise Reduction in Speech Processing. Springer, 2009, pp. 1–4.
efficient model,’’ J. Comput. Sci., vol. 25, pp. 152–160, Mar. 2018. [79] H. Zhang, ‘‘The optimality of Naive Bayes,’’ AA, vol. 1, no. 2, p. 3, 2004.
[55] N. A. Biswas, F. M. Shah, W. M. Tammi, and S. Chakraborty, ‘‘FP-ANK: [80] B. E. Boser, I. M. Guyon, and V. N. Vapnik, ‘‘A training algorithm for
An improvised intrusion detection system with hybridization of neural optimal margin classifiers,’’ in Proc. 5th Annu. Workshop Comput. Learn.
network and K-means clustering over feature selection by PCA,’’ in Proc. Theory (COLT), 1992, pp. 144–152.
18th Int. Conf. Comput. Inf. Technol. (ICCIT), Dec. 2015, pp. 317–322. [81] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn.,
[56] M. Mazini, B. Shirazi, and I. Mahdavi, ‘‘Anomaly network-based intru- vol. 20, no. 3, pp. 273–297, 1995.
sion detection system using a reliable hybrid artificial bee colony and [82] J. Jha and L. Ragha, ‘‘Intrusion detection system using support vector
AdaBoost algorithms,’’ J. King Saud Univ. Comput. Inf. Sci., vol. 31, no. 4, machine,’’ Int. J. Appl. Inf. Syst., vol. 3, pp. 25–30, Jun. 2013.
pp. 541–553, Oct. 2019. [83] C. C. Chang and C. J. Lin, ‘‘LIBSVM: A library for support vector
[57] H. Altwaijry, ‘‘Bayesian based intrusion detection system,’’ in IAENG machines,’’ ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27,
Transactions on Engineering Technologies. Dordrecht, The Netherlands: 2011.
Springer, 2013, pp. 29–44. [84] M. Al-Qatf, Y. Lasheng, M. Al-Habib, and K. Al-Sabahi, ‘‘Deep learning
[58] Ü. Çavuşoğlu, ‘‘A new hybrid approach for intrusion detection using approach combining sparse autoencoder with SVM for network intrusion
machine learning methods,’’ Appl. Intell., vol. 49, no. 7, pp. 2735–2761, detection,’’ IEEE Access, vol. 6, pp. 52843–52856, 2018.
2019. [85] N. K. Kanakarajan and K. Muniasamy, ‘‘Improving the accuracy of intru-
[59] T. S. Hwang, T.-J. Lee, and Y.-J. Lee, ‘‘A three-tier IDS via data min- sion detection using GAR-forest with feature selection,’’ in Proc. 4th Int.
ing approach,’’ in Proc. 3rd Annu. ACM Workshop Mining Netw. Data Conf. Frontiers Intell. Comput., Theory Appl. (FICTA). New Delhi, India:
(MineNet), 2007, pp. 1–6. Springer, 2016, pp. 539–547.
[86] P. Kromer, J. Platos, V. Snasel, and A. Abraham, ‘‘Fuzzy classification by MASON THAMMAWICHAI received the B.S.
evolutionary algorithms,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern., degree in computer engineering from the Uni-
Oct. 2011, pp. 313–318. versity of Wisconsin-Madison, USA, the M.Sc.
[87] H. Benaddi, K. Ibrahimi, and A. Benslimane, ‘‘Improving the intrusion degree in avionic system from the University
detection system for NSL-KDD dataset based on PCA-fuzzy clustering- of Sheffield, U.K., and the Ph.D. degree from
KNN,’’ in Proc. 6th Int. Conf. Wireless Netw. Mobile Commun. (WIN- Imperial College London, in 2016. He has been
COM), Oct. 2018, pp. 1–6.
a Faculty Member with the Graduate School
[88] S.-J. Horng, M.-Y. Su, Y.-H. Chen, T.-W. Kao, R.-J. Chen, J.-L. Lai, and
of Navaminda Kasatriyadhiraj Royal Air Force
C. D. Perkasa, ‘‘A novel intrusion detection system based on hierarchical
clustering and support vector machines,’’ Expert Syst. Appl., vol. 38, no. 1, Academy, since 2016. From 2017 to 2021, he was
pp. 306–313, 2011. an Assistant Professor. Since 2021, he has been
[89] C. Guo, Y. Ping, N. Liu, and S.-S. Luo, ‘‘A two-level hybrid an Associate Professor. His research interests include optimization, optimal
approach for intrusion detection,’’ Neurocomputing, vol. 214, pp. 391–400, control, UAV, swarm robots, intelligence systems, machine learning, deep
Nov. 2016. learning, and cyber security.
[90] P. Bedi, N. Gupta, and V. Jindal, ‘‘Siam-IDS: Handling class imbalance
problem in intrusion detection systems using Siamese neural network,’’
Proc. Comput. Sci., vol. 171, pp. 780–789, Jan. 2020.