IOP with vivek
IOP with vivek
Email: [email protected]
Abstract. In the current pandemic situation, much work became automated using Internet of
Things (IoT) devices. The security of IoT devices is a major issue because they can easily be
hacked by third parties. Attackers cause interruptions in vital ongoing operations through these
hacked devices. Thus, the demand for an efficient attack identification system has increased in
the last few years. The present research aims to identify modern distributed denial-of-service
(DDoS) attacks. To provide a solution to the problem of DDoS attacks, an openly available
dataset (CICDDoS 2019) has recently been introduced and implemented. The attacks currently
occurring in the dataset were identified using two machine learning methods, i.e. the light
gradient boosting method (LGBM) and extreme gradient boosting (XGBoost). These methods
have been selected because of their superior prediction ability in high volumes of data in less
time than other methods require. The accuracy achieved by LGBM and XGBoost were 94.88%
and 94.89% in 30 and 229 seconds(s), respectively.
1. Introduction
Distributed denial-of-service (DDoS) attacks have become an unavoidable security issue these days [1].
DDoS attacks obstruct devices involved in communication networks. The devices may be completely
blocked or partially stop working while under attack. The first DDoS attack, which immobilised the
oldest internet service provider, Panix, for several days, was discovered in 1996 [2]. Attacks became
common after a few years, and according to the Cisco Annual Internet Report, their number will increase
to up to 15 million by 2023 [3]. Thus, a systematic solution to DDoS attacks is highly recommended.
The execution of DDoS attacks is shown in Figure 1. The attacker converts different vulnerable devices
into bots. These bots send voluminous requests to the target server, which results in network congestion,
causing all machines connected to the server to stop responding.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
This approach requires IP addresses to be updated regularly. Also, this approach fails in IP spoofing.
The second approach used virtual machine (VM) technology, installed at each sensor, and collecting its
data by applying the Dempster-Shafer theory at the frontend [6]. This approach produces fewer false
positives, but it cannot detect unknown attacks. According to Arbor Networks [7], DDoS attack activity
during the first quarter of 2015 shows that attack duration has shortened, but the impact is very high,
with a size of 1.25 Gbps. In 2015, the majority of these attacks leveraged reflection amplification
techniques that use the Simple Service Discovery Protocol (SSDP) with 26Gbps and the Network Time
Protocol (NTP) with 51 Gbps.
The statistical approaches that were applied to identify DDoS attacks were discussed in [8]. Parametric
[9] and non-parametric [10] methods were applied. Multivariate correlation analyses were used to model
traffic behaviour and spectral analysis was used to handle large data in the application of the parametric
statistical method. A Markov chain, regression analysis and time series analysis were done to predict
future attacks when applying the non-parametric statistical method.
The volume of DDoS attacks is growing very fast [8], thus, computer modelling of these attacks was
strongly recommended. The researchers started applying machine learning (ML) to model and identify
attack patterns. The advantage of ML models is that they learn from data and predict with high accuracy
[11]. Popular ML models used to identify DDoS attacks are support vector machines (SVM), decision
tree (DT), random forest (RF), K-nearest neighbour (KNN), Naïve Byes (NB) and neural network (NN)
[12]. The ML methods applied in various research are shown in Table 1.
2
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
Since CIC-DDoS2019* is the most recent dataset available on the Web, the current research is conducted
on the same, using efficient ML methods, such as light gradient boosting method (LGBM) and extreme
gradient boosting (XGBoost) [23]. As the volume of DDoS attack data is growing, faster adaptive
boosting is not considered a suitable solution to this problem because adaptive boosting is slow and
sensitive to noise [21]. LGBM utilises less memory and makes predictions with high accuracy. The
XGBoost predicts with a higher accuracy than any other ML method because it utilises both L1 and L2
regularisation and implements parallel processing. The workflow of the current research is shown in
Figure 2.
Clean
data
Dataset Train-data
(80%) Cross-
CSV validation Predicted
2019 Output
Files Test-data Algorithm
Encoding (20%) Comparison
of Labels
2. Boosting Algorithms
Boosting algorithms [21] were designed in 1999 with the aim to improve the accuracy of ML algorithms.
These are tree-based ensemble algorithms that can be applied to the data that doesn’t follow any
distribution. They are designed to handle mixed data types. There are different gradient boosting
algorithms, discussed in [22]. The XGBoost and LGBM methods were used in this research because of
their extreme learning capability and fast processing. Descriptions of the methods are presented in this
section.
3
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
2.1 XGBoost
XGBoost [23] finds the best split in trees using histograms. A histogram is a graphical form of number
of bins into a feature. Thus, in histogram-based methods the splitting is done based on bins rather than
features. The method executes faster because features are binned before the construction of the tree. The
evolution of the XGboost method is shown in Figure 3.
Table 2 shows the 11 DDoS attacks on training day include NTP, DNS, LDAP, MSSQL, NetBIOS,
SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN Flood (SYN) and TFTP, and the seven attacks on test
day include PortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag and SYN.
4
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
All CSV files present in both attacks categories consist of 88 attributes collected by CICFlowMeter [25]
The attributes selected and deleted after pre-processing are shown in Figure 5(a) and (b), respectively.
5
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
6
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
SSDP(2611374)
SNMP(5161377) LDAP(2181542)
DNS(5074413)
Figure 6. Attacks under training and their number of instances in each CSV
ML models were applied to identify the different attacks that had been executed in approximately
1,500,000 instances in the available hardware resources. Thus, a dataset was reconstructed by taking
every 9th instance in MSSQL, 10th instance in LDAP, 20th instance in NetBIOS, 10th instance in SSDP,
25th instance in DNS, 25th instance in SNMP, 6th instance in NTP and 5th instance in TFTP in reflection
attacks and every 3rd instance of SYN, 3rd instance in UDP flood and UDP-lag were sampled out of the
500,000 exploitation attacks. These values were decided by checking the instances that were present in
the original files. The different attacks in this experiment were encoded as follows: LDAP – 1, DNS –
2, MSSQL – 3, NetBIOS – 4, NTP – 5, SNMP – 6, SSDP – 7, TFTP – 10, UDP – 8, SYN – 9 and UDP-
Lag – 11.
The reconstructed dataset was pre-processed using the steps mentioned in section 3.2. The files of
different attacks under reflection and exploitation attacks were merged, and the attacks in each class
were identified by two ML algorithms, i.e. LGBM and XGBoost. The XGBoost methods were found to
be expensive in terms of time. LGBM performed the best, with high accuracy and least time required,
as shown in table 3.
7
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
MSSQL(5763061)
SYN Flood(4284751)
PortMap(186960)
NetBIOS(3454578)
Figure 7. Attacks under test and their number of instances in each CSV
However, a balancing of the types of attacks was required in this experiment because fewer instances of
PortMap and UDP-Lag attacks occurred than other types of attacks. This balancing was accomplished
by taking every 25th instance in MSSQL, 10th instance in LDAP, 20th instance in SYN, 15th instance in
NetBIOS, 15th instance in UDP and all instances of PortMap. The number of UDP-lag cases was only
1,873; thus, it was oversampled to 200,000. The different attacks in this experiment were encoded as
follows: LDAP – 1, MSSQL – 2, NetBIOS – 3, PortMap – 4, SYN – 5, UDP – 6, UDP-Lag – 7. The
accuracies obtained after experiment 2 are shown in Table 4.
8
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
It was discovered that accuracy was quite poor without cross-validation but very high with cross-
validation. Thus, a classification report for LGBM was generated, shown in Figure 8.
According to the classification report, the precision, recall and F1 score for UDP-Lag were significantly
lower than for others. In addition, the number of UDP-Lag instances in the test dataset was just 1,873;
this is approximately 0.05% of the training data shown in Figure 9.
Even after up-sampling, the expected accuracies were not attained, due to the small amount of test data.
As a result, a new experiment was run without UDP-Lag; the accuracies obtained without UDP-Lag are
presented in Table 6, and the classification report is shown in Figure 9.
The above experiments show the suitability of using LGBM and XGBoost methods for the CIC-DDoS
2019 dataset. Both methods show more than 80% accuracy in less time than traditional ML methods in
most of the experiments. LGBM executed the task in less than 5 minutes in most of the experiments,
whereas traditional ML methods take 45-75 minutes for the same size of data (shown in experiment II
in table 4 under Sec. 3.4). A comparison with existing research on the CIC-DDoS 2019 dataset is
presented in Table 7.
9
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
In Table 7 it is observed that all previous research focussed on accuracy; execution time had never been
measured. In addition, no previous study had examined training vs test cases. In the present study, the
models were trained with day 1 attacks in the dataset (presented as training data) and tested with day 2
attacks (presented as test data in the dataset). The LGBM model presented in the current research
predicted the attacks with very high accuracy, 94.88% without cross-validation and 99.2% with cross-
validation, in 0.30 and 0.49 mins, respectively.
A ResNet pretrained network, which has a very complex neural network architecture particularly suited
to image processing applications, was used in [18]. ResNet generally takes a long time to converge due
to its complex architecture, and it requires special hardware to execute [28]. J48 Classifier was used in
[19] for the identification of individual attacks, providing very high accuracy. Only five attacks were
considered in this research. Binary classification was done in [20, 26] which produced higher accuracy
levels than multiclass classification [29]. Furthermore, AdaBoost and deep learning techniques were
always slower than LGBM and XGBoost [30].
Thus, a fast and efficient solution to DDoS attacks is presented in the current research, with the novelty
that models have been tested for unseen data.
4. Conclusions
With the increased volume of DDoS attacks, the problem of attack identification is getting more complex
by the day. The presented work proposes a time-efficient solution for reflection, exploitation and test
data. As a special case, five DDoS attacks in MSSQL, LDAP, SYN, NetBIOS and UDP present in the
10
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
training were recognised from corresponding similar attacks in the test data with an accuracy of 94.88%
and 94.89% by LGBM and XGBoost in 0.30 and 3.48 mins, respectively. ML models generally are very
prone to overfitting. To handle that situation, 10-fold cross-validation was applied, which improved
accuracy to 99.2% in 0.48 minutes by LGBM.
The present work can be applied to real data collected from IoT devices. A limitation that was found in
the present work is that all instances present in the dataset cannot be processed, even with the use of
high-end machines.
Acknowledgement: Our sincere thanks to Samsung R&D, Bangalore and Birla Institute of Technology,
Mesra, Ranchi, for providing us with the opportunity to work on the SRIB prism project. We would like
to thank Mr Prem Abhishek and Mr Bimal Gupta, Samsung R&D, Bangalore for their valuable
suggestions and support, enabling us to carry out this research efficiently.
References:
[1] Dalmazo BL, Marques JA, Costa LR, Bonfim MS, Carvalho RN, da Silva AS, Fernandes S,
Bordim JL, Alchieri E, Schaeffer‐Filho A, Paschoal Gaspary L. A systematic review on
distributed denial of service attack defense mechanisms in programmable networks.
International Journal of Network Management. 2021 May 24:e2163.
[2] Wani S, Imthiyas M, Almohamedh H, M Alhamed K, Almotairi S, Gulzar Y. Distributed Denial
of Service (DDoS) Mitigation Using Blockchain—A Comprehensive Insight. Symmetry. 2021
Feb;13(2):227.
[3] Malathy B, Krieshaanthiny N, Chitra B. Cloud-Based Enhanced Storage System Using Android
Technology. INTI JOURNAL. 2021;2021(01).
[4] Chen YW, Sheu JP, Kuo YC, Van Cuong N. Design and implementation of IoT DDoS attacks
detection system based on machine learning. In2020 European Conference on Networks and
Communications (EuCNC) 2020 Jun 15 (pp. 122-127). IEEE.
[5] Ramachandran A, Feamster N, Vempala S. Filtering spam with behavioral blacklisting. In
Proceedings of the 14th ACM conference on computer and communications security 2007 Oct
28 (pp. 342-351).
[6] Bakshi A, Dujodwala YB. Securing cloud from ddos attacks using intrusion detection system
in virtual machine. In2010 Second International Conference on Communication Software and
Networks 2010 Feb 26 (pp. 260-264). IEEE.
[7] Arbor networks detects largest ever DDoS attack in Q1 2015 DDoS report. In: Arbor Networks
(2015). http://www.arbornetworks.com/arbor-networks-detects-largest-ever-ddosattack-in-q1-
2015-ddos-report
[8] Khalaf BA, Mostafa SA, Mustapha A, Mohammed MA, Abduallah WM. Comprehensive
review of artificial intelligence and statistical approaches in distributed denial of service attack
and defence methods. IEEE Access. 2019 Apr 16;7:51691-713.
[9] Tan Z, Jamdagni A, He X, Nanda P, Liu RP. A system for denial-of-service attack detection
based on multivariate correlation analysis. IEEE transactions on parallel and distributed
systems. 2013 May 23;25(2):447-56.
[10] Saranya R, Kannan SS, Sundaram SM. Integrated quantum flow and hidden Markov chain
approach for resisting DDoS attack and C-Worm. Cluster Computing. 2019 Nov;22(6):14299-
310.
[11] Attaran M, Deb P. Machine learning: the new big thing for competitive advantage. International
Journal of Knowledge Engineering and Data Mining. 2018;5(4):277-305.
11
ICE4CT2021 IOP Publishing
Journal of Physics: Conference Series 2312 (2022) 012082 doi:10.1088/1742-6596/2312/1/012082
[12] Tuan TA, Long HV, Son LH, Kumar R, Priyadarshini I, Son NT. Performance evaluation of
Botnet DDoS attack detection using machine learning. Evolutionary Intelligence. 2020
Jun;13(2):283-94.
[13] Divekar A, Parekh M, Savla V, Mishra R, Shirole M. Benchmarking datasets for anomaly-based
network intrusion detection: KDD CUP 99 alternatives. In2018 IEEE 3rd International
Conference on Computing, Communication and Security (ICCCS) 2018 Oct 25 (pp. 1-8). IEEE.
[14] Prasad M, Tripathi S, Dahal K. An efficient feature selection based Bayesian and Rough set
approach for intrusion detection. Applied Soft Computing. 2020 Feb 1;87:105980.
[15] Meidan Y, Sachidananda V, Peng H, Sagron R, Elovici Y, Shabtai A. A novel approach for
detecting vulnerable IoT devices connected behind a home NAT. Computers & Security. 2020
Oct 1;97:101968.
[16] Oo MM, Kamolphiwong S, Kamolphiwong T, Vasupongayya S. Analysis of Features Dataset
for DDoS Detection by using ASVM Method on Software Defined Networking. International
Journal of Networked and Distributed Computing. 2020 Apr;8(2):86-93.
[17] Stiawan D, Idris MY, Bamhdi AM, Budiarto R. CICIDS-2017 dataset feature analysis with
information gain for anomaly detection. IEEE Access. 2020 Jul 16;8:132911-21.
[18] Hussain F, Abbas SG, Husnain M, Fayyaz UU, Shahzad F, Shah GA. IoT DoS and DDoS Attack
Detection using ResNet. In2020 IEEE 23rd International Multitopic Conference (INMIC) 2020
Nov 5 (pp. 1-6). IEEE.
[19] Kshirsagar D, Kumar S. A feature reduction based reflected and exploited DDoS attacks
detection system. Journal of Ambient Intelligence and Humanized Computing. 2021 Jan 28:1-
3.
[20] Maranhão JP, da Costa JP, Javidi E, de Andrade CA, de Sousa Jr RT. Tensor based framework
for Distributed Denial of Service attack detection. Journal of Network and Computer
Applications. 2021 Jan 15;174:102894.
[21] Schapire RE. A brief introduction to boosting. InIjcai 1999 Jul 31 (Vol. 99, pp. 1401-1406).
[22] Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting
algorithms. Artificial Intelligence Review. 2021 Mar;54(3):1937-67.
[23] Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H. Xgboost: extreme gradient boosting.
R package version 0.4-2. 2015 Aug 1;1(4):1-4.
[24] Sharafaldin I, Lashkari AH, Hakak S, Ghorbani AA. Developing realistic distributed denial of
service (DDoS) attack dataset and taxonomy. In 2019 International Carnahan Conference on
Security Technology (ICCST) 2019 Oct 1 (pp. 1-8). IEEE.
[25] Lashkari AH, Zang Y, Owhuo G, Mamun MS, Gil GD. CICFlowMeter.
[26] Cil AE, Yildiz K, Buldu A. Detection of DDoS attacks with feed forward based deep neural
network model. Expert Systems with Applications. 2021 May 1;169:114520.
[27] Odumuyiwa V, Alabi R. DDOS Detection on Internet of Things Using Unsupervised
Algorithms. Journal of Cyber Security and Mobility. 2021 May 27:569-92.
[28] Sundar KS, Bonta LR, Baruah PK, Sankara SS. Evaluating training time of Inception-v3 and
Resnet-50,101 models using TensorFlow across CPU and GPU. In 2018 Second International
Conference on Electronics, Communication and Aerospace Technology (ICECA) 2018 Mar 29
(pp. 1964-1968). IEEE.
[29] Lorena AC, De Carvalho AC, Gama JM. A review on the combination of binary classifiers in
multiclass problems. Artificial Intelligence Review. 2008 Dec;30(1):19-37.
[30] Shahraki A, Abbasi M, Haugen Ø. Boosting algorithms for network intrusion detection: A
comparative evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost.
Engineering Applications of Artificial Intelligence. 2020 Sep 1;94:103770.
12