Intrusion Detection Using Self Training Vector
Intrusion Detection Using Self Training Vector
Self-Training Support
Vector Machines
Prateek
Thesis submitted in
partial fulfillment of the requirements
for the degree of
Bachelor of Technology
in
Prateek
[Roll: 109CS0130]
Dr. S. K. Jena
May, 2013
Certificate
This is to certify that the work in the thesis entitled Intrusion Detection Using
Self-Trainining Support Vector Machines by Prateek is a record of an original
research work carried out under my supervision and guidance in partial fulfillment of
the requirements for the award of the degree of Bachelor of Technology in Computer
Science and Engineering.
To the best of my knowledge, the matter embodied in the thesis has not been
submitted for any degree or academic award elsewhere.
Dr S. K. Jena
Professor
CSE Department of NIT Rourkela
Acknowledgment
I also take this opportunity to express a deep sense of gratitude to Sweta, my sis-
ter, for her support and motivation which helped me in completing this task through
its various stages.
Lastly, I thank Almighty, my parents and friends for their constant encouragement
without which this assignment would not have been possible.
Prateek
Abstract
Data Mining techniques provide efficient methods for the development of IDS. The
idea behind using data mining techniques is that they can automate the process of
creating traffic models from some reference data and thereby eliminate the need of
laborious manual intervention. Such systems are capable of detecting not only known
attacks but also their variations.
Existing IDS technologies, on the basis of detection methodology are broadly clas-
sified as Misuse or Signature Based Detection and Anomaly Detection Based System.
The idea behind misuse detection consists of comparing network traffic against a
Model describing known intrusion. The anomaly detection method is based on the
analysis of the profiles that represent normal traffic behavior.
Semi-Supervised systems for anomaly detection would reduce the demands of the
training process by reducing the requirement of training labeled data. A Self Training
Support Vector Machine based detection algorithm is presented in this thesis. In the
past, Self-Training of SVM has been successfully used for reducing the size of labeled
training set in other domains. A similar method was implemented and results of
the simulation performed on the KDD Cup 99 dataset for intrusion detection show a
reduction of upto 90% in the size of labeled training set required as compared to the
supervised learning techniques.
Contents
Certificate ii
Acknowledgement iii
Abstract iv
1 Introduction 1
1.1 Intrusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Architecture of an IDS . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Classification of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Detection Methodology based classification . . . . . . . . . . . 4
1.4.2 Data Source based classification . . . . . . . . . . . . . . . . . 5
1.5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Objective and Scope of Work . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Proposed Work 10
2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Maximal Margin Hyperplanes . . . . . . . . . . . . . . . . . . 11
2.2.2 Linear SVM for Separable Case . . . . . . . . . . . . . . . . . 12
2.2.3 Linear SVM for Non Separable Case . . . . . . . . . . . . . . 14
2.2.4 Non-Linear SVM and Kernel Functions . . . . . . . . . . . . . 14
v
2.3 Self-Training: A Semi-Supervised Learning Technique . . . . . . . . . 15
2.4 Intrusion Detection Using Self-Training SVM . . . . . . . . . . . . . . 15
2.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Bibliography 25
vi
List of Figures
3.1 Procedure for Simulation of Intrusion Detection on the KDD ’99 Data
Set Using Self Training SVM . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Self Training SVM with a Labeled Training Set of Size 500 and Unla-
beled Training Set ( Self-Training Set) of Size 5K . . . . . . . . . . . 22
3.3 Self Training SVM with a Labeled Training Set of Size 5K and Unla-
beled Training Set ( Self-Training Set) of Size 25 K . . . . . . . . . . 22
3.4 Comparison of Standard SVM and Self-Training SVM . . . . . . . . . 23
vii
List of Tables
viii
Chapter 1
Introduction
1.1 Intrusion
Intrusion is generally defined as a successful attack on a network or system. In a
technical report on the practice of Intrusion Detection[1], Julia et. al. have defined
attack as “An action conducted by one adversary, the intruder, against another ad-
versary, the victim. The intruder carries out an attack with a specific objective in
mind. From the perspective of an administrator responsible for maintaining a system,
an attack is a set of one or more events that may have one or more security conse-
quences. From the perspective of an intruder, an attack is a mechanism to fulfill an
objective.”
By its very definition, an intrusion is a subjective phenomenon and its presence or
absence can be perceived differently by different observers. An attacker would deem
an attack to be successful if he is able to achieve the objectives with which the attack
was initiated. From the viewpoint of the victim, an attack is considered successful if
it has consequences for him. It is important to note that an attack, though successful
from the victim’s perspective may still be unsuccessful from the intruder’s perspective.
For the purpose of detection, usually the victim’s perspective is considered.
Some common examples of intrusions at the network level would include Denial of
Service (DoS) Attack, Packet Sniffing and Remote Login etc. Trojans and spywares
are some of the mechanisms by which system level intrusions are achieved.
1
1.2 Intrusion Detection System Introduction
Data Acquisition Module This module is used in the data collection phase. In the
2
1.3 Architecture of an IDS Introduction
case of a Network Intrusion Detection System (NIDS), the source of the data
can be the raw frames from the network or information from upper protocol
layers such as the IP or UDP. In the case of host based detection system, source
of data are the audit logs maintained by the operating system.
Feature Generator This module is responsible for extracting a set of selected fea-
tures from the data acquired by the acquisition module. Features can be clas-
sified as low-level and high-level features. A low-level feature can be directly
extracted from captured data whereas some deductions are required to be per-
formed to extract the high-level features. Considering the example of a network
based IDS, the source IP and destination IP of network packets would be the
low level features whereas information such as number of failed login attempts
would be classified as high level features. Sometimes features are categorized
based on the source of data as well.
Incident Detector This is the core of an IDS. This is the module that processes
the data generated by the Feature Generator and identifies intrusions. Intrusion
detection methodologies are generally classified as misuse detection and anomaly
detection. Misuse detection sytems have definitions of attacks and they match
the input data against those definitions. Upon a successful match, the activity
3
1.4 Classification of IDS Introduction
Traffic Model Generator This module contains the reference data with which the
Incident Detector compares the data acquired by the acquisition modules and
processed by the feature generator. The source of data of the Traffic Model Gen-
erator could be non-automated(coming from human knowledge) or automated
(coming from automated knowledge gathering process).
Response Management Upon receiving an alert from the incident detector, this
module initiates actions in response to a possible intrusion.
4
1.4 Classification of IDS Introduction
Stateful Protocol Analysis Based System This methodology is based on the as-
sumption that IDS could know and trace the protocol states. Though SPA
process seems similar to the Anomaly Detection methodology, they are basi-
cally different. SPA depends on vendor-developed generic profiles to specific
protocols whereas, Anomaly Detection uses preloaded network or host specific
profiles. Generally, the network protocol models in SPA are based on protocol
standards from international standard organizations, e.g., IETF. SPA is also
known as Specification- based Detection.
Hybrid Most existing IDSs use multiple methodologies to improve the accuracy of
detection. For example, Signature Detection and Anomaly Detection are used
as complementary methods as they provide a mixture of improved accuracy and
ability to detect unknown attacks.
Network Based Intrusion Detection This class of IDS acquires its data from the
raw frames from the network or information from upper protocol layers such as
the IP or UDP. Analysis is then performed on the network logs and consequently
the detection occurs at the network level.
Host Based Intrusion Detection In the case of host based detection system, source
of data are the audit logs maintained by the operating system. System call logs
and file system logs are the commonly used sources of data. This class of IDS
detects intrusions occuring on a particular host device.
5
1.4 Classification of IDS Introduction
6
1.5 Literature Review Introduction
Apart from the issues related to the requirement of high level of human interaction,
other problems with Intrusion Detection Systems have been discussed by Catania et.
al. [2]. Lack of model adjustment information, proper traffic feature identification,
lack of resource consumption information and lack of public network traffic data-sets
have been mentioned as some of the important issues. Patcha et. al [6] have given
a review of open problems in anomaly detection based IDS. High computation com-
plexity, noise in audit data, high false positive rate, lack of recent standard data-set,
inability of IDS to defend itself from attacks, precise definition of normal behaviour
and inability of IDS to analyze encrypted packets have been cited as the prominent
problems with these systems..
7
1.6 Motivation Introduction
1.6 Motivation
As discussed earlier, with the recent advances in the field of software exploits and the
lowering of skills required for launching a successful attack, the problem of detecting
intrusions, effectively and accurately, is becoming more and more challenging. This
is severely compounded by the fact that misuse detection based system cannot suffice
to meet the present needs because the number of zero-day exploits is on the rise and
the problem with most anomaly detection systems is that of high false alarm rate.
Further to this, both misuse and anomaly based systems require a significant amount
of labeled data for the development of the traffic models used by the incident detector.
Labeling of data is extremely difficult, time consuming and costly. The extensive
manual intervention required in the process makes it really slow and consequently the
existing systems have not been able to scale according to the increasing demands of
the networks. Hence the need for an anomaly based detection system which would
significantly reduce the requirements of labeled data has been felt.
Data Mining is the process of automatically discovering useful information in large
data repositories [7]. It includes methods like Classification, Clustering, Anomaly De-
tection and Association Analysis and it can help in automating the process of finding
novel and useful patterns that might otherwise remain unknown. These techniques
also provide capabilities to predict the outcome of future observations. Considering
these traits of the data mining techniques, it was felt that application of data mining
to the problem of intrusion detection would be a suitable course of research to tackle
the current issues with the problem domain.
1. To study the performance of various existing data mining based intrusion de-
tection systems and compare their accuracy and efficiency.
8
1.8 Outline of Thesis Introduction
systems which may overcome some of the drawbacks of the existing systems.
For the purpose of this research, network based detection systems have been con-
sidered. However, the same could be applied to the problem of host based detection
systems with minor modifications. The current effort was concentrated on the analy-
sis and development of only the Traffic Model Generator and Incident Detector. The
other components of IDS, such as the Traffic Data Acquisition Module or the Re-
sponse Management Module were not considered. This was done to concentrate on
the core features of the intrusion detection process.
9
Chapter 2
Proposed Work
The anomaly detection approach for intrusion detection is generally based on the
following assumptions
• Records contained in the training set belong mostly to normal traffic data, with
the number of records pertaining to intrusions being comparatively small.
10
2.2 Support Vector Machines Proposed Work
method gives a very high accuracy rate for a large number of problem domains and
is highly suited for high-dimensional data.
For the purpose of illustration, lets consider a data set that is linearly separable.
Given a set of labeled training data, we can find a hyperplane such that it completely
separates points belonging to the two classes. This is called the decision boundry. An
infinite number of such decision boundaries are possible (fig 2.1). Decision Boundry
margin refers to the shortest distance between the closest points on the either side
of the half plane (fig 2.2). It is evident by intuition and has been mathematically
proven[8] that the decision hyperplane with the maximal margin provides better gen-
eralization error. Support Vectors refers to training samples lying on the margins of
11
2.2 Support Vector Machines Proposed Work
the decision plane and the processs of training the SVM involves finding these support
vectors.
w.x + b = 0 (2.1)
Here w, b are the parameters of the SVM and the training process is concerned with
12
2.2 Support Vector Machines Proposed Work
w.xa + b = 0, (2.2)
w.xb + b = 0 (2.3)
w.(xa − xb ) = 0, (2.4)
Accordingly we have,
y = 1 if w.z + b > 0
−1 if w.z + b < 0
Considering two hyperplanes bi1 and bi2 , such that they pass through the points
closest to the decision margin on each side of it, we have
2
d= (2.9)
w
The problem of training a SVM is that of optimizing the above equation, which
translates to the determination of the model parameters w and b based on the training
13
2.2 Support Vector Machines Proposed Work
examples. This problem is one of a convex optimization problems and is solved for
the Dual formulation using Lagranges Multiplier Method.
To adapt the formulation of the decision boundary presented for the separable case,
we need to adopt the soft margin[7] approach. A slack variable ξ is introduced as
the penalty for deviating from the hard decision boundary. It the estimate of the
error for a particular training example. The modified formulation is given as:
w.xi + b ≥ 1 − ξi if yi = 1,
where ∀i : ξi > 0
Considering the change in the formulation, the modified objective function is given
as:
N
kw2 k X k
f (w = + C( ξi ) (2.11)
2 i=1
where C and k are user defined parameters. If we want to emphasize on the firm
boundary, we need to set the value of C to be small and if we want to optimize the
residual error, we set the value of C to be big. For most cases, the value of the
parameter k is assumed to be 1.
Cases where the decision boundary is non-linear require the data in the orginial space
x to be transformed to a new feature space φ(x). This transformation is brought
about by the transformation function φ which is chosen so that the decision bound-
ary in the transformed space is a linear one.
In most cases the determination of the actual transform function is difficult and
is not required. A manipulation called the Kernel Trick[7] is applied to compute
the similarities in the transformed space using the attributes in the original feature
14
2.3 Self-Training: A Semi-Supervised Learning Technique Proposed Work
space.
15
2.4 Intrusion Detection Using Self-Training SVM Proposed Work
2.4.1 Algorithm
The formulation for a standard SVM for a binary classification problem is given as
N
1 X
min kwk2 + C ξi (2.12)
2 i=1
ξi ≥ 0, i = 1, ..., N,
where xi ∈ Rn is the feature vector for the ith training example. yi ∈ −1, 1 is the
class label of xi , i = 1, ..., N, C > 0 is a regularization constant. The pseudo code
for the Self-Training wrapper algorithm is given below:
Algorithm 1 Self-Training-SVM
Input: FI , FT and σ0
The last trained SVM is considered as the final classification model. The proof of
convergence of the algorithm is given in Li et. al. [12]
16
Chapter 3
3.1 Data-Set
The KDD Cup 1999 Dataset[13] was used for the purpose of this simulation.In 1998
MIT Lincoln Labs had prepared a data set under the DARPA Intrusion Detection
Evaluation Program[14]. The Third International Knowledge Discovery and Data
Mining Tools Contest, which was held along with the The Fifth International Con-
ference on Knowledge Discovery and Data Mining, used a version of the DARPA
Intrusion Detection Data Set. The data set, generated from the raw TCP dump data
had more than 40 features.
3.1.1 Features
• High Level Traffic Features Some of the features were high level traffic
features computed using a two-second time. Examples include the number of
17
3.2 A LIBSVM Based Implementation Simulation and Results
connections to the same host as the current connection in the past two seconds
window, and the percentage of connections to the same service.
3.1.2 Attacks
The training set contained 24 known attacks whereas the testing set contained an
additional set of 13 novel attacks. Additionally the probability distributiion of the
test data was different from that of the training data. This was done to make the
simulation more realistic. The attacks simulated fall under the following four cate-
gories:
• Probing
During this phase, two sets of data sets are extracted from the KDD Cup ’99 Training
Set which consists of over 4 lakh records. The first set FI is a set of labeled records
and is used to train the initial SVM. The second set, FT is the set of unlabeled records
and is used to retrain the SVM model during the iterations of the algorithm 1. All
the 41 features of KDD Cup ’99 were used in the simulation.
For the purpose of this simulation, the size of FI was taken to be much smaller
than that of FT so that the efficiency of the proposed scheme in reducing the require-
ment of labeled data may be properly tested.
18
3.3 Results Simulation and Results
The KDD Cup ’99 Test set consisting of over 3 lakh records was used as the
independent test set.
Additionally, the original data sets were scaled and converted to the libsvm format
by using the data mining software Weka [15].
3.2.2 Self-Training
The wrapper code based on the algorithm given in 1 called the respective LIBSVM
routines for SVM model training and class prediction. LIBSVM[16] developed by
Chang et. al. is a library for Support Vector Machines and can be easliy integrated
with C or JAVA codes. Its binaries can be called from virtually any language capable
of executing a system call.
A RBF Kernel exp(−γ ∗ |u − v|2 ) was used for training a cost based SVM and the
parameters for training ( c and γ ) can be determined either by a grid search or by
the model selection algorithm as given in Li. et. al. [12].
A detailed illustration of the simulation process is given in the fig: 3.1
3.3 Results
The simulation was run with various sizes of the labeled and unlabeled set, where
the maximum ratio between the labeled and unlabeled set was maintained to be 1:10.
This ratio was decided on an empirical observation of results obtained by Li et. al.
[12].
It was observed that the minimum size of labeled training set required for effective
Self-Training was around 500 records. For labeled sets having very few examples, e.g
50-60, the overall accuracy of detection either did not change or in some cases it got
reduced from its orginial value. This may be explained by considering the fact that in
case of limited labeled points in the original case, the decision boundary obtained may
not be accurate and upon use of the model on the unlabeled set, the points belonging
to the set may be classified incorrectly. This may further lead to a reduction in the
19
3.3 Results Simulation and Results
Figure 3.1: Procedure for Simulation of Intrusion Detection on the KDD ’99 Data Set
Using Self Training SVM
20
3.3 Results Simulation and Results
Results obtained for a labeled set of 500 records with an unlabeled set of 5000
records is presented in figure 3.2. Results for another simulation with a labeled set of
5000 records and unlabeled set 25000 records is given in figure 3.3.
It can be inferred from the results that Self-Training process as given in algorithm 1
converges and for the given examples, it converges pretty quickly ( after around 6 it-
erations in both the cases).
The degree of improvement in the detection accuracy with the iterations of the
Self-Training algorithm depends on the size of the labeled and unlabeled training set.
This result can be inferred from the fact that after 6 iterations, the change in the
detection accuracy for the simulation with 5000 labeled records set is almost double
that of the simulation with 500 labeled records set. This observation is also reaffirmed
by the fact that for very small labeled training sets, there was virtually no positive
improvement in the detection accuracy.
The results also show that the the overall accuracy is most sensitive to the size of
the labeled set. In case of the simulation with 500 labeled records, the final detection
accuracy was around 75.5% whereas for the simulation with 5000 labeled records, it
was found to be around 86%.
Finally the results validate the hypothesis that Self-Training can be used for re-
duction of the labeled training set size in the domain of Intrusion Detection as well. A
reduction of upto 90% has been achieved in the number of labeled training examples
required. A comparison of the performance of Standard SVM and Self-Training SVM
has been given in figure 3.4.
21
3.3 Results Simulation and Results
Figure 3.2: Self Training SVM with a Labeled Training Set of Size 500 and Unlabeled
Training Set ( Self-Training Set) of Size 5K
Figure 3.3: Self Training SVM with a Labeled Training Set of Size 5K and Unlabeled
Training Set ( Self-Training Set) of Size 25 K
22
3.3 Results Simulation and Results
23
Chapter 4
A new method for Intrusion Detection under the Semi-Supervised Learning paradigm
has been presented and evaluated in this thesis. The correctness of the algorithm
and its effectiveness for the Intrusion Detection Problem domain has been verified
by simulation on the standard KDD Cup 1999 dataset. Further, the given algorithm
achieves good results in reduction of requirement of labeled training data. In the sim-
ulations run for the purpose of this thesis, a reduction of upto 90% was achieved. This
value may vary from case of case, depending upon the compositions of the labeled
training set.
The work presented in this thesis may be extended to the case of host based
intrusion detection. The performance of this method may also be compared with
that of other supervised learning approaches. Additionally the application of Self-
Training scheme to other classification techniques used in intrusion detection such as
the Bayesian Belief Network can be worked upon.
24
Bibliography
[1] Julia Allen, Alan Christie, William Fithen John McHugh, Jed Pickel, and Ed Stoner. State of
the practice of intrusion detection technologies. Technical report, Carnegie Mellon University,
2001.
[2] Carlos A. Catania and Carlos Garcia Garino. Automatic network intrusion detection - current
techniques and open issues. Computers and Electrical Engineering, 2012.
[3] Wun-Hwa Chen, Sheng-Hsun Hsu, and Hwang-Pin Shen. Application of svm and ann for
intrusion detection. Computers and Operations Research, 2005.
[4] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leoniod Portnoy, and Sal Stolfo. A geometric
framework for unsupervised anomaly detection detecting intrusions in unlabeled data. Advances
in information security, 2002.
[5] Hung-Jen Liao, Kuang-Yuan Tung, Chun-Hung Richard Lin, and Ying-Chih Lin. Intrusion
detection system - a comprehensive review. Journal of Network and Computer Applications,
2013.
[6] Animesh Patcha and Jung-Min Park. An overview of anomaly detection techniques- existing
solutions and latest technological trends. Computer Networks, 2007.
[7] Pang-Ning Tan, Vipin Kumar, and Michael Steinbach. Introduction to Data Mining. Pearson,
2006.
[8] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. A training algorithm for maximal
margin classifiers. In The proceedings of the Fifth Annual Workshop of Computational Learning
Theory, pages 144–152. ACM, 1992.
[9] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien, editors. Semi-Supervised Learning,
chapter Introduction to Semi-Supervised Learning. MIT Press, 2006.
[10] Jun Cai Huang, Feng Bi Wang, Huan Zhang Mao, and Ming Tian Zhou. A self-training semi-
supervised support vector machine method for recognizing transcription start sites. Interna-
tional Conference on Apperceiving Computing and Intelligence Analysis (ICACIA), 2010.
[11] Ujjwal Maulik and Debasis Chakraborty. A self-trained ensemble with semisupervised svm - an
application to pixel classification of remote sensing imagery. Pattern Recognition Letters, 2011.
25
Bibliography
[12] Yuanqing Li, Cuntai Guan, Huiqi Li, and Zhengyang Chin. A self-training semi-supervised svm
algorithm and its application in an eeg-based brain computer interface speller system. Pattern
Recognition Letters, 2008.
[13] Kdd cup 99 data set, 1999. Data Set available at http://kdd.ics.uci.edu/databases/
kddcup99/kddcup99.html.
[14] Darpa intrusion detection evaluation, 1998. Data Set available at http://www.ll.mit.edu/
mission/communications/cyber/CSTcorpora/ideval/data/1998data.html.
[15] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H.
Witten. The weka data mining software: An update. SIGKDD Explorations, 11, 2009. Software
available at http://www.cs.waikato.ac.nz/ml/weka.
[16] Chang, Chih-Chung, Lin, and Chih-Jen. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
26