Performances of Machine Learning Algorithms For Bi
Performances of Machine Learning Algorithms For Bi
1
Mukrimah Nawir, 1Amiza Amir, 1Ong Bi Lynn, 1Naimah Yaakob,
2
R.Badlishah Ahmad
1
Embedded, Network, and Advanced Computing Research Cluster (ENAC)
School of Computer and Communication Engineering, Universiti Malaysia Perlis
(UniMAP), Pauh, Perlis
2
Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin (UniSZA),
22200, Besut, Terengganu
Abstract. The rapid growth of technologies might endanger them to various network attacks
due to the nature of data which are frequently exchange their data through Internet and large-
scale data that need to be handle. Moreover, network anomaly detection using machine
learning faced difficulty when dealing the involvement of dataset where the number of labelled
network dataset is very few in public and this caused many researchers keep used the most
commonly network dataset (KDDCup99) which is not relevant to employ the machine learning
(ML) algorithms for a classification. Several issues regarding these available labelled network
datasets are discussed in this paper. The aim of this paper to build a network anomaly detection
system using machine learning algorithms that are efficient, effective and fast processing. The
finding showed that AODE algorithm is performed well in term of accuracy and processing
time for binary classification towards UNSW-NB15 dataset.
1. Introduction
Security-based anomalies are an abnormality or malicious behaviours where the network or
systems deviate from their realm usual functionality. This wide-ranging issue in network
security need defences or protection tools either software or hardware-based tools that ensure
the system in a full-secure from compromise by the attackers. For instance, during the
communication or transferring data information between users they might compromise by
hackers that have intention to stealth the information. Machine Learning (ML) is one of the
efficient and modern technique that enable to monitor the patterns in highly system environment
that present today [1].
However, network anomaly detection system using machine learning faced difficulty when
dealing the involvement of dataset. There are several issues regarding network anomaly detection
dataset for evaluation purpose with the present dataset as discussed in paper [2]. Most datasets
not fulfilled the requirements for network security. The labelled dataset is very
few in public [3] and the conflicts arise when some of this dataset only specific for certain
environment (flexibility), less availability, and a lack of ground truth as state in [4].
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
Hence, in this paper we construct the network anomaly detection systems that capable to
manage the large-scale data and frequently exchange their data through Internet by employing
the machine learning (ML) algorithms onto it. The challenges and issues of network anomaly
detection datasets motivate this paper to investigate the UNSW-NB15 dataset. The creation
of this dataset consists a hybrid of real normal and synthetic network attack data as presented
in [5]. Another point is that, UNSW-NB15 dataset overcome the problems faced by researchers
with the others dataset where they represent the complex patterns that have balanced data
distribution of training and testing set. Advantageous of UNSW-NB15 dataset reflect the high
efficiency due to the features of instances collect from the payload to the header of information
[6].
The paper organized as followed: In Section 2 present the existing work with the current
issues of labelled network dataset. Next, the experimental set-up describes details at Section 3.
From the conducted experiments, Section 4 discussed the finding of experiments. Last part of
this paper to conclude (Section 5).
2. Related Works
Numerous number of network anomaly detection dataset announced by many community such
as NSL-KDD (improvement of KDDCup99 dataset), MAWILab [7], the MoMe Cluster, the
Cooperative Association for Internet Data Analysis (CAIDA) and RIPE comprehensively critics
by many researchers. For instance, CAIDAs datasets not effective to be used for network
anomaly detection where their data resources might be removed during the simulation and
it is not adaptable to any environment.
Several works had been done for classification to determine the behavior of data
in a system. Authors [8] compared the UNSW-NB15 and KDDCUP99 dataset to measure the
accuracy and False Alarm Rate (FAR) using five different ML algorithms and found that DT is
an efficient algorithm for classification differ from our paper that investigate the accuracy and
other measures by binary classes (Normal (1) and Attack (0)).
In addition, the most famous dataset that had been used by many researchers, DARPA
datasets (KDDCup1999), are perceived to skewness and biases classification toward the training
set that repetitious their data instances [5][9][10]. Hence, these dataset is impractical for network
anomaly detection systems. Moreover, since the data is dynamic in the system causes the new
attack might be present in the network. These old-fashioned network anomaly detection dataset
also in-comprehensive to represent a modern normal behaviours and contemporary synthesized
attack as in UNSW-NB15 dataset [11].
The latest related work is present in paper [12] stated that the datasets consists irrelevant
and redundant attributes. They believe that by using features selection result a fast processing
and high in accuracy of classifier. They employed ML techniques (Random Forest) toward the
datasets. From here motivate this present paper to observe the performance of time taken to
train and test the model with varying the size of data.
3. Experiment Setting
In our experiment, Ubuntu software version 13.10-0 ubuntu 4.1 is the operating system and the
WEKA tool [13] is run on an Intel Xeon (R) CPU E3-1270 v5 @ 3.60GHz x 8, 16GB RAM to
employ ML algorithms toward UNSW-NB15 dataset to evaluate based on their classification rate
and processing time in order to build a high efficient and fast processing for network anomaly
detection system.
According to Fig. 1 the binary classification for network anomaly detection system toward
UNSW-NB15 dataset involved four stages (preparation of dataset, training and testing, build
classifier model and performance measure metrics). The experiment begins by loading a network
dataset that needed for classification purpose. Once the dataset is ready, the data will be undergoing
2
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
the training/testing stage. Before testing stage, the classifier model is required as a
decision engine. Finally, to analyse the result from the conducted experiments.
3
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
4
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
Although, the Naive Bayes algorithm required small amount of time (with 0.79s) to classify
the data instances of given dataset (UNSW-NB15) but it is not comparable to AODE and
BN algorithms based on classification rate. To measure the performance of this three ML
algorithms not enough to consider only their accuracy, the performances of metrics measure
(i.e. True Positive Rate(TP Rate), False Positive Rate (FP Rate), Precision (Prec), Recall) of
ML algorithms for these binary data (Class 1=Normal and Class 0=Attacks) using UNSW-NB15
dataset need to be investigate as well that tabulated in Table 2.
5
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
4.2. Effect of Time Taken against the Varying Size of Training Data
In our concern, the time taken to build classifier model is enumerating to have an efficient
and effective with a fast classification for network anomaly detection system. Figure 3 shows
that the effect of time taken by varying the number of training data. The square point is for
time taken to build the classifier whereas the cross-point result of time taken to test the model.
The finding showed that NB algorithm is fastest to build the model for network anomaly
detection system of binary classification during training or testing scheme where the time taken
with increasing the size of training data is in the range of 0.77s-0.94s to train them and to test
the model required 1.03s to 2.16s only. As aforementioned that the accuracy of this algorithm is
lowest compared the other two ML algorithms. Moreover, the NB algorithm become faster
to test the model when the number of training data increases.
Even though, to train the AODE algorithm take a long time compared to NB, they are more
fast to test the built model and the correctness to classify the data instances is more
robust. Yet, this training time of AODE is linearly with respect to the size of training data and
can be learn in incremental way. For instance, with the 50k training data the differences time
taken for training and testing for these both algorithm where AODE need double amount of
time compared NB algorithm. BN algorithm’s performances, when the large number training
data caused the low time taken to train as well as to test their model. This classifier considers
slower during training process.
5. Conclusion
As a conclusion, the UNSW-NB15 dataset is a public network dataset that is relevant for
network anomaly detection due to their patterns are complex that represented in modern as
well as in contemporary synthesized attacks. From the evaluation made in this paper using
ML algorithms (Bayesian group) found that the AODE algorithm is an outperformed, effective,
efficient, and fastest classifier for network anomaly detection of binary classification toward the
network labelled (UNSW-NB15) dataset.
6
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015
Acknowledgments
The research reported in this paper is supported by Research Acculturation Grant Scheme
(RAGS). The authors would also like to express gratitude to the Malaysian Ministry of Higher
Education
(MOHE) and University Malaysia Perlis for the facilities provided.
References
[1] B.D.Kumar, K.J. Kumar, Network anomaly detection: A machine learning, CRC Press, 2013.
[2] S. Ali, S. Hadi, T. Mahbod, G.A.Ali, Toward developing a systematic approach to generate
benchmark datasets for intrusion detection, computers & security, pp. 357{374, vol. 31, no.3,
2012, Elsevier.
[3] G.Folino, F.S Pisani, P.Sabatino, A Distributed Intrusion Detection Framework Based on
Evolved Specialized Ensembles of Classifiers,Applications of Evolutionary Computation:
19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 { April 1,
2016, Proceedings, Part I", Springer International Publishing, pp.315-331, 2016.
[4] R. Koch, M. Golling, G.D. Rodosek, 11 - Towards Comparability of Intrusion Detection
Systems: New Data Sets, TNC2014 - TNC2014. (n.d.). Retrieved August 11, 2017, from
https://tnc2014.terena.org/core/poster/13.
[5] N. Moustafa, J. Slay, UNSW-NB15: a comprehensive data set for network intrusion detection
systems (UNSW-NB15 network data set). 2015 Military Communications and Information
Systems Conference (MilCIS), pp. 16, 2015. https://doi.org/10.1109/MilCIS.2015.7348942.
[6] N. Moustafa, J. Slay, A hybrid feature selection for network intrusion detection systems: Central
points A HYBRID FEATURE SELECTION FOR NETWORK INTRUSION DETECTION
SYSTEMS: CENTRAL POINTS AND ASSOCIATION RULES, pp. 513 2015.
[7] F. Romain, B. Pierre, A. Patrice, F. Kensuke, MAWILab: Combining Diverse Anomaly
Detectors for Automated Anomaly Labeling and Performance Benchmarking, ACM
CoNEXT ’10, p. 12, December,2010.
[8] N. Moustafa, J. Slay, The evaluation of Network Anomaly Detection Systems: Statistical
analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set,
Information Security Journal: A Global Perspective, vol. 25, pp.18-31, 2016.
[9] F. Gumus, C. Sakar, Z. Erdem, O. Kursun, Online Naive Bayes classification for network
intrusion detection, ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining, (Asonam), pp.670-674,
2014.
[10] O.Osanaiye, H.Cai, K.K.R Choo, A.Dehghantanha, Z. Xu, M. Dlodlo, Ensemble-based multi
filter feature selection method for DDoS detection in cloud computing, EURASIP Journal on
Wireless Communications and Networking, pp.130-139,2016.
[11] D. G. Mogal, S. R. Ghungrad, B. B. Bhusare, A Review on High Ranked Features based NIDS.
Ijarcce, 6(3), pp. 349353 2017. https://doi.org/10.17148/IJARCCE.2017.6380.
[12] J.Tharmin, Z. Shahrzad, Feature selection in UNSW-NB15 and KDDCUP’99 datasets,
Industrial Electronics (ISIE), 2017 IEEE 26th International Symposium on, pp.1881-1886,
IEEE, 2017.
[13] Weka 3-Data Mining with Open Source Machine Learning in Java, (n.d), Retrieved July 11,
2017, from http://www.cs.waikato.ac.nz/ml/weka/.
[14] The UNSW-NB15 data set description, https://www.unsw.adfa.edu.au/australian-centre-
forcybersecurity/cybersecurity/ADFA-NB15-Datasets/, 2 March, 2016, Accessed date:
sept,2017.
[15] N. Moustafa, J. Slay,The UNSW-NB15 data set description, 2 March 2016.
https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-
NB15-Datasets/.
7
1st International Conference on Big Data and Cloud Computing (ICoBiC) 2017 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1018 (2018)
1234567890 ‘’“” 012015 doi:10.1088/1742-6596/1018/1/012015