0% found this document useful (0 votes)
57 views

Intrusion Detection System Using Unsupervised ML Algorithms: School of Information Technology and Engineering

This document describes a project to create an intrusion detection system using unsupervised machine learning algorithms. It will use K-Means and Gaussian mixture models with random forest classifiers trained on Spark clusters. The methodology involves feature scaling, selection using attribute ratio, training the K-Means and Gaussian models on Spark, and categorizing network data into different attack types for training and testing. One-hot encoding and feature elimination are also discussed.

Uploaded by

haggele haggele
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Intrusion Detection System Using Unsupervised ML Algorithms: School of Information Technology and Engineering

This document describes a project to create an intrusion detection system using unsupervised machine learning algorithms. It will use K-Means and Gaussian mixture models with random forest classifiers trained on Spark clusters. The methodology involves feature scaling, selection using attribute ratio, training the K-Means and Gaussian models on Spark, and categorizing network data into different attack types for training and testing. One-hot encoding and feature elimination are also discussed.

Uploaded by

haggele haggele
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

School of Information Technology and Engineering

A PROJECT ON
INTRUSION DETECTION SYSTEM USING
UNSUPERVISED ML ALGORITHMS

Technical Answers for Real World Problems


(ITE3999)

Faculty: Prof. DAPHNE LOPEZ

Submitted by:
Aditya Kumar (18BIT0235)
Ritvik Gupta (18BIT0218)

ABSTRACT
With the advent vast amounts of information and technology, all forms of
businesses around the world are becoming increasingly data driven.
Companies collect and deal with high velocity, variety and volumes of
data. This also gives way to various loopholes in the systems developed
for working with such large amounts of data.

In this project, we attempt to tackle the problem of intrusions in digital


systems by creating an Intrusion Detection System using Unsupervised
Machine Learning Algorithms.

Traditional Intrusion Detection Systems have existed however, they


detect intrusions based only in network signatures and flags previously
classified. By using Machine Learning, we are able to use the data from
previous attacks and dynamically detect new intrusion patterns in real-
time by analyzing that data.

METHODOLOGY

We are using the K-Means and Gaussian Mixture Model using


Random Forest Classifiers for training our unsupervised machine
learning algorithms on local memory spark clusters.

 Feature scaling will be done to adjust for widely varying data.

 Feature selection is also applied by using Attribute-Ratio.

 K-means and Gaussian Mixture are used for training.

 We will also PySpark to optimize our operations and divide


processing into different batches using pipelines .

CATEGORIZING DATA
Here, we import the data into out project and categorize the data in on
basis of different types of attacks in 5 general attack categories. ie. Normal,
DoS, R2L, U2R, Probe.
We also categorize into Normal or Attack categories. We use Pipeline
functions which takes the arguments as Transformers and Estimators which
are custom defined in our code to make the categories.
TRAIN and TEST DATA OUTPUT
Train and Test data frames after applying transform Pipelines.
ONE-HOT ENCODING
Most machine learning algorithms cannot operate on label data directly.
They require all input variables and output variables to be numeric.

This means that categorical data must be converted to a numerical form. If


the categorical variable is an output variable.
FEATURE ELIMINATION USING ATTRIBUTE
RATIO
It is also known as attribute selection or variable selection. It helps in
selecting the most appropriate features amongst the available. Feature
selection can be performed manually or automatically.
Importance:
 Features may be expensive to obtain, thus feature selection
is helpful.
 It helps in improving accuracy of the model.
 It also reduces the time required by the model to train itself.
 Discards the garbage data.

References: https://www.naun.org/main/UPress/cc/2014/a102019-106.pdf
http://www.wseas.us/e-library/conferences/2013/Nanjing/ACCIS/ACCIS-30.pdf
FEATURE SCALING
Machine learning algorithm just sees number — if there is a vast difference
in the range say few ranging in thousands and few ranging in the tens, and
it makes the underlying assumption that higher ranging numbers have
superiority of some sort. So, these more significant number starts playing a
more decisive role while training the model and hence the model becomes
biased towards a specific feature. To fix this, we use scaling of data.

You might also like