Inductive Intrusion Detection in Flow-Based Network Data Using One-Class Support Vector Machines
Inductive Intrusion Detection in Flow-Based Network Data Using One-Class Support Vector Machines
Abstract—Despite extensive research effort, ordinary anomaly evaluates the proposed approach and discusses the results.
detection systems still suffer from serious drawbacks such as Finally, Section VI provides a conclusion and discusses future
high false alarm rates due to the enormous variety of network work.
traffic. Also, increasingly fast network speeds pose performance
problems to systems which base upon deep packet inspection. II. R ELATED W ORK
In this paper, we address these problems by proposing a novel
inductive network intrusion detection system. The system oper- Gao and Chen designed and developed a flow-based intru-
ates on lightweight network flows and uses One-Class Support sion detection system in [1]. Karasaridis et al. [2], Shahrestani
Vector Machines for analysis. In contrast to traditional anomaly et al. [3] and Livadas et al. [4] proposed a concept for the
detection systems, the system is trained with malicious rather
than with benign network data. The system is suited for the load detection of botnets in network flows. Finally, in [5] Sperotto
of large-scale networks and is less affected by typical problems et al. provided a comprehensive survey about current research
of ordinary anomaly detection systems. in the domain of flow-based network intrusion detection.
Evaluations brought satisfying results which indicate that A sound evaluation of a NIDS is a nontrivial task and
the proposed approach is interesting for further research and requires high-quality training and testing sets. Unfortunately,
perfectly complements traditional signature-based intrusion de-
tection systems. the de facto standard is still the DARPA data set created by
Lippmann et al. in [6]. Despite its severe weaknesses and the
Keywords-network intrusion detection; machine learning; sup- critique published by McHugh [7] it is still used. The KDD
port vector machine; netflow;
Cup ’99 data set can be regarded as another popular data set
[8]. Finally, Sperotto et al. contributed the first labeled flow-
I. I NTRODUCTION
based data set [9] intended for evaluating and training network
In the area of network intrusion detection, the research intrusion detection systems. This data set is used in this paper.
community usually focuses on either misuse or anomaly de- Finally, in [10] Gates and Taylor examine whether the es-
tection systems. While the former is meant to detect precisely tablished anomaly detection paradigms are still valid. Similar
specified attack signatures, the latter is supposed to detect pat- work is contributed by Sommer and Paxson in [11]. They
terns deviating from normal network operations. Both concepts point out why anomaly detection systems are hardly used in
exhibit disadvantages. Misuse detection systems struggle with operational networks.
steadily increasing network speeds and attack signatures while
anomaly detection systems suffer from high false alarm rates III. P ROPOSED A PPROACH
and the lack of representative training data, just to name a few. We decided to train the NIDS with malicious network data
The network intrusion detection system (NIDS) proposed in only. This approach is diametric to the way ordinary anomaly
this paper embraces concepts of both worlds. It operates on detection systems work. As pointed out in [11], machine
network flows rather than on entire network packets. Incoming learning methods perform better at recognizing inliers than
flows are analysed using One-Class Support Vector Machines at detecting outliers. Further advantages of this approach are
(OC-SVM). The learning algorithm is trained solely with discussed in Section V-B.
malicious rather than with benign data. By this means, the In short, the proposed NIDS receives network flows and
NIDS is supposed to recognize previously learned attacks analyzes them with a OC-SVM. The following two sections
including attack variations instead of detecting anomalies. The briefly introduce the concepts behind network flows and OC-
proposed concept is entitled “inductive NIDS”. SVMs respectively.
The remainder of this paper is organized as follows. Section
II gives a brief overview of similar contributions. Section A. Network flows
III introduces the proposed approach. Section IV describes A set of unidirectional network packets sharing certain
the process of model and feature selection while Section V characteristics together form a so called network flow. This set
OTHER 18.970
Not Malicious
AUTH/IDENT 191.339
Drop Flow
SSH 13.942.629
0 2 4 6 8
10 10 10 10 10
Fig. 1. The high-level view on the proposed approach for an inductive NIDS. Amount of Flows
Automated 14.156.775
Training and Validation Set Testing Set
Manual 6
Failed 26.694
Validation Set Testing Set
Succeeded 144
0 2 4 6 8
10 10 10 10 10 Benign Data Set
Amount of Flows
Fig. 3. The attack types of the original training set. Fig. 4. Partitioning of the malicious and the benign data set.
added to the testing set. The remainder of both sets was chosen
1) The IP addresses of the original data set were discarded
to be the validation set, respectively.
since they have been anonymized. Furthermore, all time
information was derived to a single attribute entitled “du- IV. M ODEL AND F EATURE S ELECTION
ration”. After all, the following flow attributes remained The sole purpose of model and feature selection is the
in the reduced training set: packets/flow, octets/flow, optimization of the classifiers classification capability, i.e.,
duration per flow, source port, destination port, TCP the best possible adaption of the classifier to the custom
flags, IP protocol. classification problem.
2) The deletion of irrelevant flows followed these rules: Feature selection is the process of selecting a subset of
First of all, all 5.968 unlabeled flows of the original data features out of the original feature set. Within the scope of
set (i.e. flows whose nature could not be verified) were this paper this means that the original feature set F consisting
deleted. Next, all flows belonging to protocols other of the seven flow attributes described in Section III-D is to be
than SSH and HTTP were deleted; 215.123 flows were reduced to a subset S where S ⊆ F.
affected by this deletion. Model selection or optimization is the process of finding
3) Random sampling is necessary because the intermediate parameters for the respective machine learning method which
data set after step 1) and 2) still contained the vast lead to optimal classification results. OC-SVMs require the
amount of almost 13 million flows. Training that many parameter ν, {ν ∈ R | 0 < ν ≤ 1} which controls the
flows would require an unacceptably high training time. fraction of the outliers in the training set. For ν, we tested
The random sampling process selected every flow of the the values {0.001, 0.201, 0.401, 0.601, 0.801}. This linearly
intermediate data set with a probability of 1/600. Finally, increasing set was chosen to approximately cover the possible
this step yielded a total of 22.924 flows. These flows now range for ν. Furthermore, a radial basis function (RBF) kernel
represent the reduced training set which will be used for was chosen for the OC-SVM due to its general applicability
model and feature selection in Section IV. [18]. The RBF kernel requires a parameter γ, {γ ∈ R | 0 ≤ γ}
which specifies the width of the RBF kernel. For γ, the
E. The benign data set values {0.1, 0.3, 0.5, 1, 2, 5, 10} are tested. This set increases
A data set holding only benign flows was created from nonlinearly and is also supposed to cover a small but realistic
scratch. The purpose of this set is to aid in the process of model range for γ.
and feature selection (see Section IV). In the end it is also used So, roughly speaking model selection is the task of deter-
to evaluate the NIDS (see Section V). This set was created by mining the best tradeoff between ν and γ for the OC-SVM.
manually generating network traffic inside a virtual machine A. Joint optimization
based on grml linux [15]. All the network traffic was captured Model and feature selection was regarded as one joint rather
with tcpdump [16] and then converted to flow format using than two independent optimization problems. As pointed out
softflowd [17]. The benign data set embraced flows belonging in [19] and [20], this methodology can lead to better results.
to the following protocols: HTTP, SSH, DNS, ICMP and FTP. As argued by the authors of [11], limiting the false alarm
Overall, the set consists of 1.904 flows. rate of an anomaly detection system should have top priority.
Hence, our proposed NIDS was optimized with respect to its
F. Data set partitioning false alarm rate.
Preliminary to the process of model and feature selection, The small search space enables the use of the popular grid
the malicious and benign data sets were further divided search method for joint optimization. In the scope of this
into a so called testing and validation set, respectively. The paper, the grid spans over three dimensions:
partitioning is illustrated in Figure 4. 1) The feature subset consisting of 27 − 1 possible subsets.
The two validation sets are only used for model and The exponent stands for the 7 feature candidates, i.e.
feature selection whereas the testing sets are used for the final flow attributes.
evaluation described in Section V. One-third of the malicious 2) The model parameter ν for which 5 values are tested as
set and half of the benign set were randomly sampled and discussed in the previous section.
TABLE I TABLE II
R ESULTS OF THE COARSE GRAINED OPTIMIZATION . R ESULTS OF THE FINE GRAINED OPTIMIZATION .
False alarm rate Miss rate γ ν Feature subset False alarm rate Miss rate γ ν
0% 22.53807% 0.1 0.201 SP, DP, TF, PR 0% 4.71376% 0.26 0.02
0% 22.53807% 0.1 0.201 SP, DP, TF 0% 4.72689% 0.25 0.02
0% 22.61685% 0.3 0.201 SP, DP, TF, PR 0% 4.74658% 0.29 0.02
0% 22.61685% 0.3 0.201 SP, DP, TF 0% 4.75315% 0.27 0.02
0% 22.66281% 0.1 0.201 P, SP, DP, TF, PR 0% 4.76628% 0.28 0.02
25
0.33 0.25
It is meant to determine the best feature subset as well 0.25
0.17 0.13
0.17
0.21
5
γ
2) Afterwards, “fine grained optimization” is performed.
This step bases upon the results of the coarse grained Fig. 5. Relationship between the varying parameters ν and γ and the miss
optimization and further explores the surrounding region rate during the fine grained grid search.
of the best ν/γ-combination to achieve even better
results.
optimization came up with ν = 0.201 and γ = 0.1, the set
B. Coarse grained optimization to be further explored starts with 0.01 for ν and γ. The set
ends with 0.29 for ν and 0.39 for γ. For both parameters,
The best five results of the coarse grained optimization
a step size of 0.01 is chosen. After all, for γ the following
with respect to the false alarm rate are listed in Table I. The
set is tested: {0.01, 0.02, ..., 0.28, 0.29} whereas for ν the set
abbreviations in the last column are as follows: P = packets
contains: {0.01, 0.02, ..., 0.38, 0.39}. With 39 values in the
per flow, SP = source port, DP = destination port, TF = TCP
set for ν and 29 values in the set for γ, the fine grained
flags and PR = IP protocol.
optimization has to test 29 ∗ 39 = 1.131 points. Again, for
The results reveal that none of the best five combinations
each point an 8-folded cross-validation is executed.
causes any false alarms. The best one scores only marginally
The best five results of the fine grained optimization are
better than the remaining four. For all five points, the model
listed in Table II. All results yield a miss rate of around 4.7%
parameter ν is 0.201. The parameter γ, on the other hand
with a false alarm rate of still 0%. The parameter ν is 0.02
exhibits some variance in the range between 0.1 and 0.3.
for all results whereas γ varies between 0.25 and 0.29.
Since all of the five points yield an identical false alarm
rate, the point with the lowest miss rate is chosen for further The fine grained optimization was able to significantly lower
optimization. In fact, the first two points of Table I have the error rates of the coarse grained optimization. The miss rate
identical error rates. The difference is that one of these points was originally around 22.5% and could be lowered to 4.7%.
contains the IP protocol feature whereas the other does not. Figure 5 illustrates how the miss rate on the Z-axis varies
We chose to take the one including this feature since we during fine grained optimization. One can clearly see the
believe that the additional feature enhances the generalization correlation between ν and the miss rate. The range of the
capability of the OC-SVM. parameter γ, on the other hand, hardly influences the miss
rate.
C. Fine grained optimization
V. E VALUATION AND D ISCUSSION
The coarse grained optimization resulted in the point char-
acterized by γ = 0.1, ν = 0.201 and the feature subset holding After model and feature selection determined the best set
source port, destination port, TCP flags and IP protocol. of features and model parameters, it is crucial to test the final
As already mentioned, this feature subset is not changed model by making use of the dedicated testing set created in
anymore. Hence, only ν and γ remain for further optimization. Section III-F.
The fine grained testing range for ν and γ is chosen to lie
A. Testing set prediction
between the nearest coarse grained values. Since the previous
For the purpose of testing, the OC-SVM is trained with the
1 The fold size was chosen to be a multiple of the four CPU cores. malicious training set (see Figure 4), an RBF kernel and the
TABLE III
P REDICTION OF THE TESTING SETS . multiple representative and realistic data sets collected “in the
wild”.
Benign set Malicious set The source code and the prepared data sets are available
Type Predicted Actual Predicted Actual upon email request to the authors.
In-class 0 (0%) 0 7.540 (98.07492%) 7.688 ACKNOWLEDGMENT
Outlier 942 (100%) 942 148 (1.92507%) 0 We want to thank the anonymous reviewers who provided
us with many helpful comments.