Logistic Regression
Logistic Regression
Abstract—Today’s world is rapidly moving towards digitiza- To detect a normal malware the longest common con-
tion. In this context, protecting and safeguarding the digital tent in it is found and deployed in to a security system
resources is very crucial for a large organization or a country. as the malware signature. Next time, when attacker sends
Digital resources are attacked and virtually brought down using
malware. One of the strategies to defend against malware is the same malware in a packet, security system checks for
searching for a pattern inside them. These patterns become the known signatures in the packets, if they are found immediately
signature for a malware and they are deployed into a security those packets are dropped. To evade these type of detection
system for detection. But the traditional signature generation systems, attackers employ various techniques to ensure no
techniques fail against polymorphic malware, which change trace of longest common content is fingerprinted and used
their form after every infection. In this paper, we propose a
defense system which uses, Logistic regression with Anova F-Test as signatures. One of the successful evasion techniques is
and snort IDS to thwart these polymorphic malware. Logistic by deploying polymorphic malware into victims system. A
regression with Anova F-Test has achieved 97.7% accuracy. malware which exhibits different form after every infection is
called a Polymorphic malware. These malware change their
Index Terms—Polymorphic Malware, Machine Learning, form after every infection, there by the signature found by
Anova F-Test.
a security system intially would have changed. If a security
systems searches for the same signature, it would fail. For
I. I NTRODUCTION this reason, attackers frequently use polmorphic malware to
intrude into a vulnerable system.
Human lives are hugely intertwined with web. It makes
To thwart these type of malware it’s important to understand
people smart, connected and updated at lightening speed. But
their structure. A typical polymorphic malware would contain
there are organizations and individuals who for their personal
an encrypted payload and a decryption routine. Encrypted
motives target web in various ways. They attack others systems
payload contains the malicious instructions in an encrypted
to steal, make changes or completely destroy in an anonymous
form, usually it looks like a junk data. This encrypted payload
way. These type of attacks are termed as cyber attacks. Cyber
is usually appended to a decryption module.
attacks have become quite prevalent and dangerous these days.
They hamper the growth of economy and functioning of large
organizations and countries.
According to a study by McAfee, MyDoom is a spam-
mailing malware that caused the largest economic damage of
all time, it’s estimated damage is about $38 billion. Alarm- Structure Of Polymorphic Malware
ingly, a report by security company AV-Test tells the total
number of malware has doubled over the past four years. This Once these type of malware are executed in the victim’s
is illustrated in the following plot. system, control is given to a decryptor which decrypts the
encrypted payload. Decrypted payload would contain the ma-
licious instructions. These are finally executed to compromise
the vulnerable system.
Apart from encryption, there are many other techniques to
deploy polymorphism few of them are :
• Garbage-code insertion is a technique, where garbage
instructions are inserted into a malware after every in-
fection. For example, we can insert lot of nop instruc-
tions after every infection which makes it difficult for a
security system to compare the two instances of the same
malware.
• Instruction-substitution technique employs polymorphism
by replacing a code with an equivalent but different one.
,(((
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:44:45 UTC from IEEE Xplore. Restrictions apply.
2017 International Conference on Innovations in information Embedded and Communication Systems (ICIIECS)
• Code-transposition exhibits polymorphism by changing They used decision tree and random forest models for classi-
the execution order using jumps. fication.
• Register-reassignment deploys polymorphism into a mal- Firdausi et al.[8] analysed malware samples in Anubis, a
ware payload by simple reassignment of registers. sandbox environment. A sparse vector model was generated
In this paper, we have proposed an efficient machine learn- from the analysed reports and used for classification. About
ing algorithm using Anova F-Test to end the menace of these 220 malicious and 250 benign samples were used, performance
malware. In addition, the features selected by Anova F-Test are on 5 different classifiers were compared. Best performance
deployed into Snort IDS to build an effective defense system. was achieved by J48 decision tree.
Our method, has achieved 97.7% accuracy. III. DATA S ET D ESCRIPTION & E XPERIMENTAL S ETUP
II. BACKGROUND AND R ELATED W ORK We generated polymorphic malware using msfvenom frame-
work in Kali linux. This framework has plethora of tools
Using machine learning researchers build models which
like polymorphic encoders, malicious payloads and no-op
capture the complex patterns of malware and benign files.
generators which can be integrated for exploitation. Attackers,
These are then effectively used against unknown malware. In
use this platform as an express lane to attack a victim.
this section we would like to discuss various machine learning
To generate malware in msfvenom framework first, we have
techniques used by researchers to detect malware.
to choose a malicious payload and further to make them
Machine learning techniques to detect malware was first
polymorphic different type of encoders are available. For our
introduced by Schultz et al.[1]. To classify a malware three
experiments, we chose windows/adduser payload. Below is the
different static features: strings, DLL function calls and n byte
image of a new user created by this malware on a windowsXp
sequences were used. They have used Ripper and Naive Bayes
system:
algorithms.
Kolter et al.[2] used different classification algorithms like
SVM, Naive-Bayes, Boosted versions of Decision Trees. They
have concluded that boosted version of Decision trees give best
results compared to others.
Nataraj et al.[3] proposed K-nearest neighbor classifier for
malware classification. They have used image processing to
visualize a malware binary as gray-scale image and used
these images to classify malware. This method was very fast In msfvenom framework, we have different polymorphic
in classification. But an attacker can beat these systems by encoders to simulate polymorphic malware. Encoders in this
inserting lot of redundant data into malware because this framework uses (semi) direct injection i.e., they directly
technique classifies based on global image features. change the AddressofentryPoint of PE to malcode’s entry
Kong et al.[4] classified malware based on call graphs of point. In the case of polymorphic malware Addressofentry-
each malware. Worms are clustered depending on their call Point of PE is changed to a decoder stub. This is illustrated
graphs i.e., worms belonging to same family are clustered in the below diagram:
together. Then an individual classifier like kNN was used to
learn each family of malware. Finally an ensemble classifier
is used to classify a new variant of malware.
Tian et al.[5] used machine learning techniques in WEKA
library for malware classification. They have used function
length and frequency of bytes in malware as features. Function
length is the number of bytes in malware code. They have We have used shikata-ga-nai encoder, which is rated to be
observed function length and byte frequency combined with excellent in this framework. As pointed out by [9] this encoder
other features gives a scalable and fast classifier. is special because the decoder stub uses code reordering
Santos et al.[6] developed a semi-supervised machine learn- and substitution, to generate polymorphism in the malware.
ing algorithm. It’s very difficult to label all the malware which Additionally, decoder stub itself is self-modifying thereby
is being generated over internet, in this context supervised ma- evading security systems. With different number of iterations
chine learning algorithms are not effective. From this point of we encoded the windows/adduser payload and generated 300
view they have proposed a semi-supervised algorithm LLGC polymorphic malware executables.
(Learning with Local and Global Consistency) which learns To further test the detectability of these malware we up-
from both labeled and unlabelled data. They have studied, loaded few of them to virustotal website, they weren’t detected
the effect of number labeled instances on models accuracy. by some of the anti-virus softwares, a few of them being
Their goal was to deliver a high precision model with very commercial ones. About 300 executables from our windows
few labeled instances. system are collected as benign files for our experiments. So,
Siddiqui et al.[7] built a machine learning algorithm which from a total of 600 executables we chose 70% as our training
uses variable length instruction sequence to detect worms. set and 30% as our test set.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:44:45 UTC from IEEE Xplore. Restrictions apply.
2017 International Conference on Innovations in information Embedded and Communication Systems (ICIIECS)
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:44:45 UTC from IEEE Xplore. Restrictions apply.
2017 International Conference on Innovations in information Embedded and Communication Systems (ICIIECS)
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:44:45 UTC from IEEE Xplore. Restrictions apply.
2017 International Conference on Innovations in information Embedded and Communication Systems (ICIIECS)
These graphs depict the increase in Accuracy, Precision, [12] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent
Recall and F1-Score values with ANOVA F-Test. Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pretten-
hofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos,
David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard
VI. C ONCLUSION Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn.
In this paper, we developed a defense system which uses Res., vol. 12, pp. 2825–2830, Nov. 2011.
Logistic regression with ANOVA F-Test and snort IDS to
thwart polymorphic malware. We simulated an attacker ex-
ploiting a vulnerable system in our lab. Kali linux system
was used as an attacker system and windows xp system as a
vulnerable one. Using msfvenom framework in Kali linux we
generated polymorphic malware. We used logistic regression
with ANOVA F-Test and deployed significant features into
snort IDS to detect Polymorphic malware. This defense setup
on our vulnerable windows xp system has proved to be
effective in thwarting polymorphic malware. Our machine
learning models detecting capabilities have been improved
greatly by ANOVA F-Test.
VII. ACKNOWLEDGMENT
Our work is dedicated to Bhagawan Sri Sathya Sai Baba,
Founder chancellor of Sri Sathya Sai Institue of Higher
Learning.
R EFERENCES
[1] Matthew G Schultz, Eleazar Eskin, F Zadok, and Salvatore J Stolfo,
“Data mining methods for detection of new malicious executables,”
in Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE
Symposium on. IEEE, 2001, pp. 38–49.
[2] Jeremy Z Kolter and Marcus A Maloof, “Learning to detect malicious
executables in the wild,” in Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining.
ACM, 2004, pp. 470–478.
[3] Lakshmanan Nataraj, S Karthikeyan, Gregoire Jacob, and BS Manjunath,
“Malware images: visualization and automatic classification,” in Pro-
ceedings of the 8th international symposium on visualization for cyber
security. ACM, 2011, p. 4.
[4] Deguang Kong and Guanhua Yan, “Discriminant malware distance
learning on structural information for automated malware classification,”
in Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM, 2013, pp. 1357–1365.
[5] Ronghua Tian, Lynn Margaret Batten, and SC Versteeg, “Function
length as a tool for malware classification,” in Malicious and Unwanted
Software, 2008. MALWARE 2008. 3rd International Conference on.
IEEE, 2008, pp. 69–76.
[6] Igor Santos, Javier Nieves, and Pablo G Bringas, “Semi-supervised
learning for unknown malware detection,” in International Symposium
on Distributed Computing and Artificial Intelligence. Springer, 2011, pp.
415–422.
[7] Muazzam Siddiqui, Morgan C Wang, and Joohan Lee, “Detecting
internet worms using data mining techniques,” Journal of Systemics,
Cybernetics and Informatics, vol. 6, no. 6, pp. 48–53, 2008.
[8] Ivan Firdausi, Alva Erwin, Anto Satriyo Nugroho, et al., “Analysis of
machine learning techniques used in behavior-based malware detection,”
in Advances in Computing, Control and Telecommunication Technolo-
gies (ACT), 2010 Second International Conference on. IEEE, 2010, pp.
201–203.
[9] Ryan Farley and Xinyuan Wang, “Codext: Automatic extraction of ob-
fuscated attack code from memory dump,” in International Conference
on Information Security. Springer, 2014, pp. 502–514.
[10] Angela Orebaugh, Gilbert Ramirez, Jay Beale, and Joshua Wright,
Wireshark & Ethereal Network Protocol Analyzer Toolkit, Syngress
Publishing, 2007.
[11] Mohssen MZE Mohammed, H Anthony Chan, Neco Ventura, Mohsin
Hashim, and Izzeldin Amin, “A modified knuth-morris-pratt algorithm
for zero-day polymorphic worms detection.,” in Security and Manage-
ment, 2009, pp. 652–657.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:44:45 UTC from IEEE Xplore. Restrictions apply.