Experimental Analysis of Decision Tree C
Experimental Analysis of Decision Tree C
Volume 11, Issue 7, July 2020, pp. 869-880, Article ID: IJARET_11_07_085
Available online at https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=7
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: 10.34218/IJARET.11.7.2020.085
ABSTRACT
Machine learning is an incessantly developing field, it differs from traditional
computational approaches. Machine learning algorithms permits computers to train on
data inputs. It is requisite to analyze large amount of data and extract useful knowledge
from it. In this paper we will explore supervised machine learning algorithms for
intrusion detection. An intrusion detection system (IDS) is a system that monitors
network traffic for harmful activities. Decision trees are employed to visually illustrate
decisions and inform decision making. In the whole experimentations, we compare the
performance of decision tree with other supervised machine learning classifiers.
Evaluating the performance metrics such as accuracy, precision, recall and F1 score is
done. Comparison of Roc_Auc score with accuracy is also verified. The Decision tree
reaches high accuracy value of 98.7% using KDD cup99 dataset.
Keywords: Decision Tree classifier, KDD Cup99, Supervised Machine learning
classifiers, Intrusion detection system
Cite this Article: Ajeesha M I and D Francis Xavier Christopher, Experimental Analysis of
Decision Tree Classifier in Intrusion Detection, International Journal of Advanced Research in
Engineering and Technology (IJARET), 11(7), 2020, pp. 869-880.
https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=7
1. INTRODUCTION
The last decade has seen rapid advancements in machine learning techniques empowering
automation and predictions in scales never imagined before. This further leads to researchers
and engineers visualize new applications for these techniques. The aim of machine learning is
to make sense of the structure of data and fit that data into models that can be recognized and
employed by the people. Machine learning algorithms use computational methods to learn
information directly from data without depending on a predestined equation as a model.
Machine learning uses two types of techniques such as supervised learning and unsupervised
learning.
2. LITERATURE SURVEY
Intrusion detection system is a software application to detect the network intrusions using
machine learning algorithms. The decision tree outperforms other classifiers with respect to
accuracy, time and precision [6].
Heterogeneous data/mixture analysis technology is expected to play a significant role in
almost all the domains. Integration of two or more algorithms by combining their strength
would be more useful for processing heterogeneous data analysis [13]. The following figure 2
represents the machine learning classifiers.
Machine learning techniques can impact in the domain of cybersecurity, and examined the
security challenges that remain. Shared the overview of the conceptualization, understanding,
modeling, and thinking about cybersecurity data science [6].
The network intrusion detection systems based on the ML and DL methods to provide the
new researchers with the updated knowledge, recent trends, and progress of the field. A
systematic approach is taken on for the selection of the relevant articles in the field of AI-based
NIDS [4].
An approach to find best classification algorithm, for the applications of machine learning
to intrusion detection. The j48 algorithm shows highest classification accuracy performance
with lowest error rate [21].
An innovative intelligent intrusion detection system based on Stacking is developed, and it
used a DT-RFE algorithm to extract less features. This model can improve and optimize the
dataset and increase the resource utilization through deleting uncorrelated and redundant
records [17].
The growth of smart methods is required to fight with complex new smart system. They
represented a deep neural network for intrusion detection for IoT network. The result shows
that with each data set we got at least 90% accuracy and more [16].
Network Intrusion Detection System is the utmost used defense technology in the field of
network security. The increase in the efficiency of the parameters in the intrusion detection
system using the two-level approach. In Level 1, compare any basic supervised/unsupervised
learning algorithm and then in Level 2, train the results from level 1 in deep learning to use
Artificial Neural Networks (ANN) and compare the parameters such Accuracy, Precision,
Recall, False Alarm, F-score [3].
For competing with these cybersecurity problems, one must deal with certain machine
learning challenges. These methods to generate labels with pivoting, results for common
problems of lack of labels in cybersecurity [8].
Karatas compared the performance of different ML algorithms using an up-to-date
benchmark dataset CSE-CIC-IDS2018. They addressed the dataset imbalance problem by
reducing the imbalance ratio using Synthetic Minority Oversampling Technique (SMOTE),
which resulted in detection rate improvement for minority class attacks.
A two-stage anomaly-based network intrusion detection process using the UNSW-NB15
dataset using Recursive Feature Elimination and Random Forests among other techniques to
select the best dataset features for the purpose of machine learning. Evaluated the performance
of Decision Trees (C5.0), Naïve Bayes and multinomial Support Vector Machine.
Implementing C5.0 results the highest accuracy (74%) and F1 score (86%), and the two-stage
hybrid classification improved the accuracy of results by up to 12% (achieving a multi-
classification accuracy of 86.04%) [17].
3.METHODOLOGY
3.1. Machine Learning Classifiers for Intrusion Detection
3.1.1. Naive Bayes
Naive Bayes is supervised machine learning method uses classification technique that assume
the principle of class conditional independence from the Bayes Theorem. The goal of Naive
bayes is that the presence of one feature does not affect the presence of another in the probability
of the specified output, and each predictor has an equal effect on that result. The three types of
Naive Bayes classifiers include Multinomial Naive Bayes, Bernoulli Naive Bayes, and
Gaussian Naive Bayes. The application of classifier includes in text classification, spam
identification, and recommendation systems. The naïve bayes classifier is a kind of probabilistic
graphical model. It aims in conditional dependence by organizing the dependencies on the edge
of a directed graph. It manages all nodes not connected by an edge are conditionally
independent and make use of this fact in the creation of the directed acyclic graph. The foremost
function of the algorithm is to classify data into specified categories. It follows the Bayesian
theorem that every category is mutually exclusive and independent of each other. It works
perfectly with a large dataset.
classification, they are usually called as decision stumps. It can be used in occurrence with
different types of learning algorithms to enhance the performance. The output of the other
learning algorithms ('weak learners') is combined into a weighted sum that constitutes the final
output of the boosted classifier. AdaBoost classifier is adaptive with the meaning that the
successive weak learners are modified with respect to those cases misclassified by previous
classifiers. The discrete learners can be weak, provided that the performance of each one is
moderately better than random guessing, the final model can be proven to coincide to a strong
learner.
AdaBoost classifier accompanied by the decision trees with the weak learners is frequently
referred to as the excellent out-of-the-box classifier. With decision tree learning, information
collected at each stage of the AdaBoost algorithm with respect to the relative 'hardness' of
individual training sample is fed into the tree growing algorithm in such a way that later trees
tend to focus on harder-to-classify examples.
3.2. Dataset
Machine learning usually works with two data sets: training dataset and test dataset. The
evaluation datasets play a critical role in the validation of any IDS approach, to evaluate the
proposed model’s ability in detecting intrusive behavior. Due to privacy issues the datasets used
for network packet analysis in commercial products are not easily accessible. There are some
publicly attainable datasets such as DARPA, KDD, NSL-KDD, and ADFA-LD, and they are
standard datasets. Existing datasets that are used for building and comparative evaluation of
IDS are analyzed including their features and limitations. The most important and tedious
process of set out with machine learning models is getting reliable data. We use KDD Cup 1999
Data to build predictive models capable of differentiating between intrusions or attacks, and
valuable connections. It contains 4898431 instances with 41 attributes, a standard set of data.
Each connection is described as either normal or as an attack, with exactly one specific attack
type. Each connection consists of about 100 bytes. Attacks fall into four main groups:
• DOS: denial-of-service
• R2L: unauthorized access from a remote machine
• U2R: unauthorized access to local root privileges
• probing: surveillance and another probing
Each group has various attacks, and there are a total of 21 types of attacks.
3.3. Implementation
We work with the IDS using Python and its extensive libraries available. The machine learning
algorithms explained here are all present in the scipy library. The increased development of
deep learning frameworks including TensorFlow, PyTorch, and Keras, available for this
language recently made popularity of Python. It is easy to process with readable syntax and the
ability to be used as a scripting language, Python proves to be powerful and straightforward
both for pre-processing data and working with data directly. The scikit-learn machine learning
library is built on top of several existing Python packages namely NumPy, SciPy, and
Matplotlib.
Accuracy is the mostly used performance measure and it is the ratio of correctly predicted
observation to the total observations. Higher the accuracy better is the model. Accuracy is a
better for symmetric datasets where values of false positive and false negatives are almost same.
Therefore, we have to consider other parameters to evaluate the performance of the model.
Accuracy predicts immediately whether a model is being trained correctly. It does not give any
detailed information regarding its application to the problem. The problem only using accuracy
as main performance metric does not do well when there is severe class imbalance.
Precision
Precision calculates the number of True Positives divided by the number of True Positives
and False Positives. Precision evaluates how precise a model is by predicting positive values.
Precision is the percentage of the results which are correct.
4.1.2. Recall
Recall calculates the percentage of true positives a model correctly identified. If false negative
is high recall is used. The numerator consists of the number of true positives or the number of
positives the model correctly identified. The denominator is the number of true positives
predicted by the model and the number of positives incorrectly predicted as negative by the
model.
4.1.3. F1 Score
F1 Score calculates the weighted average of Precision and Recall. It consists of false positives
and false negatives. It is difficult to understand like accuracy, but F1 is usually more useful for
uneven class distribution. Accuracy works better for similar values of false positives and false
negatives. If the values of false positives and false negatives are very different, it’s better to
consider both Precision and Recall.
F1-score is a harmonic mean of Precision and Recall thus it gives an integrated idea about
these two metrics. It is highest when Precision is equal to Recall.
Accuracy examines the fractions of correctly assigned positive and negative classes. If our
problem is highly imbalanced, we get a really high accuracy score by simply predicting that all
observations belong to the majority class.
Naive Bayes is supervised learning so the data is labeled. Naive Bayes is a linear classifier
and is highly accurate when applied to big data. Naive Bayes work only if the decision boundary
is linear, elliptic, or parabolic. Naive Bayes outperform decision trees in rare occurrences.
Decision trees are comfortable to use for small amounts of classes. Decision trees are simple
to explain and understand. Decision trees have better features to identify the most significant
dimensions, handle missing values, and deal with outliers. Although over-fitting is a major
problem with decision trees, it could be avoided by using boosted trees or random forests. In
many situations, boosting or random forests can result in trees outperforming either Bayes or
K-NN. Decision trees can work directly from a table of data, without any prior design work
unlike Bayes and K-NN. Bayes can perform quite well, and it doesn't over fit nearly as much
so there is no need to prune or process the network. The figure below shows the graphical
representation of decision tree and naïve bayes.
4. CONCLUSIONS
With the rapid growth of machine learning technologies, in this paper we have discussed how
decision tree is applicable for intrusion detection system. Evaluating various performance
metrics shows that decision tree performs well compared with other classifiers. The IDS should
provide the most effective solutions based on the requirements. Here we discussed the accuracy,
precision, recall, f1score, auc-roc score of the classifiers. Based on these evaluation metrics we
can conclude that the decision tree performs well as an intrusion detector. Machine learning is
a field that is continuously being innovated, thus it is important to keep in mind that algorithms,
methods, and approaches will continue to change.
REFERENCES
[1] Ajeesha M I, Dr. D Francis Xavier Christopher, Supervised Machine Learning Techniques For
Intrusion Detection, 2019 IJRAR June 2019, Volume 6, Issue 2
[2] An Introduction to Machine Learning | DigitalOcean
[3] B Ida Seraphim, Shreya Palit, Kaustubh Srivastava, E Poovammal, Implementation of Machine
Learning Techniques applied to the Network Intrusion Detection System, International Journal
of Engineering and Advanced Technology (IJEAT), ISSN: 2249-8958, Volume-8 Issue-5, June
2019.
[4] Dr. Poornima Nataraja, Bharathi Ramesh, Machine Learning Algorithms For Heterogeneous
Data: A Comparative Study, International Journal of Computer Engineering & Technology
(IJCET) Volume 10, Issue 3, May-June 2019
[5] Evaluation of Machine Learning Algorithms for Intrusion Detection System | by Cuelogic
Technologies | Cuelogic Technologies | Medium
[6] Harsh H. Patel, Purvi Prajapati, Study and Analysis of Decision Tree Based Classification
Algorithms, International Journal of Computer Sciences and Engineering.
[7] HUSPI, Quick Introduction to Machine Learning Algorithms for Beginners Quick Introduction
to Machine Learning Algorithms for Beginners - HUSPI
[8] Idan Amit, John Matherly, Machine Learning in Cyber-Security - Problems, Challenges and
Data Sets.
[9] Iqbal H. Sarker, Yoosef B. Abushark, IntruDTree: A Machine Learning Based Cyber Security
Intrusion Detection Model, Symmetry 2020, 12, 754; doi:10.3390/sym12050754
[10] jakubczakon, F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should
You Choose?, neptune-ai/blog-binary-classification-metrics
[11] Jesse Davis, Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves,
Appearing in Proceedings of the 23 rd International Conference on Machine Learning,
Pittsburgh, PA, 2006.
[12] Machine learning Algorithms. Introduction | by Mohammad Daoud | Medium
[13] Mohammed Tabash, Mohamed Abd Allah, and Bella Tawfik, Intrusion Detection Model Using
Naive Bayes and Deep Learning Technique, The International Arab Journal of Information
Technology, Vol. 17, No. 2, March 2020.
[14] Nahla Ben Amor, Salem Benferhat, Zied Elouedi, Naive Bayes vs Decision Trees in Intrusion
Detection Systems, 2004 ACM Symposium on Applied Computing.
[15] Richard Power. 1999 CSI/FBI computer crime and security survey. Computer Security Journal,
Volume XV (2), 1999.
[16] Sarika Choudharya, Nishtha Kesswani, Analysis of KDD-Cup’99, NSL-KDD and UNSW-NB15
Datasets using Deep Learning in IoT, International Conference on Computational Intelligence
and Data Science (ICCIDS 2019)
[17] Souhail Meftah, Tajjeeddine Rachidi and Nasser Assem, Network Based Intrusion Detection
Using the UNSW-NB15 Dataset, International Journal of Computing and Digital Systems, ISSN
(2210-142X)
[18] Tahir Mehmood and Helmi B Md Rais, Machine Learning Algorithms In Context Of Intrusion
Detection, 2016 3rd International Conference On Computer And Information Sciences
(ICCOINS)
[19] Wenjuan Lian, Guoqing Nie, Bin Ji, An Intrusion Detection Method Based on Decision Tree-
Recursive Feature Elimination in Ensemble Learning, Hindawi, Volume 2020, Article ID
2835023.
[20] What Is Machine Learning? How It Works, Techniques & Applications - MATLAB & Simulink
(mathworks.com)
[21] Yogendra Kumar Jain and Upendra, An efficient intrusion detection based on decision tree
classifier using feature reduction, International journal of scientific and research publications,
Volume 2, Issue 2, January 2012.
[22] Zeeshan Ahmad, Adnan Shahid Khan, Cheah Wai Shiang, Network intrusion detection system:
A systematic study of machine learning and deep learning approaches, DOI: 10.1002/ett.4150