A New Malware Detection Model using
A New Malware Detection Model using
04) 24
Abstract
Machine learning is a set of procedures that make the computer capable to learn without being
expressly programmed. In other words, the machine-learning algorithm detects and validates the
principles that support the information. With this information, the algorithm can ’consult’ the
properties of previously unrecognized samples. At the detection of malware, a previously unrec-
ognized sample may be a new file to the program. Its subtle assets can be malware or benign.
The set of customized terms in terms of data is called a model. Machine learning has many broad
approaches that are needed in solving one approach. These methods have different capabilities and
different appropriate functions. In this work, we have used advanced machine learning algorithms
and use the most accurate algorithm in the proposed model to detect a file is malware or not.
Keywords: Cyber Security; Data Set; Malware Detection; Machine Learning; Prediction Model
1 Introduction
Being an emerging field of computer science, machine learning aims to enable computers to learn from
data instead of explicit programming. To use PB-level data to make selections and perform what is
impossible or best carried out in a certain situation, where the task for us humans is complex and time-
consuming [15, 18]. Malware is an imminent threat faced by companies and users every day. Whether
it’s phishing emails or exploits in the entire browser, coupled with multiple evasion methods and other
security vulnerabilities, today’s defence systems can no longer compete with them [7, 9].
Using of Internet has increased dramatically over the past few years as today’s community has been
relying on worldwide communication excessively. Simultaneously, it is being utilized extensively by
hackers and others with criminal intent, and with the emergence of the black market where they can
buy or use malicious resources. This gives robust motivation for attackers to modify and escalate the
complexity of the malicious program to reduce the possibility of success of antivirus programs which
leads to the new execution of similar malware that can spread out of control, again and again [16].
According to AV-TEST report, roughly 350,000 new malicious samples are being recorded daily. This
makes several problems for processing a big volume of unframed data acquired from the malware test
I.J. of Electronics and Information Engineering, Vol.13, No.1, PP.24-32, Mar. 2020 (DOI: 10.6636/IJEIE.202103 13(1).04) 25
which creates difficulties for anti-virus companies to identify every attack and upgrade the software at
the right time to restrain the penetration and spread [14].
Analyzing malware can provide incorrect data in many cases [25]. When obfuscating methods are
used for analyzing the difficulty is increased further, so it will be difficult to see what kind of malware
it is. The alternative for this is carrying out a rigorous analysis of malicious software’s behaviour that
can be a daunting job when we have to examine the increasing and growing figure of the new malicious
sample. Because of that issue, it is, consequently, necessary to enhance an extensive framework where
many malicious samples can be examined vigorously [22, 26].
New malware which appears every day is believed to be the most altered versions of the previous one
using enlightened replication method. An enormous number of samples have been used. Having a large
number of data sets contributes to the prophetic ability and dependability of the built model that gives
a presentable outcome. For this work, we will develop an ml-based prediction model, which can be used
as a preliminary filter system for all types of malware that may be classified from the unusual malicious
sample. This provides the chance to pass over the static analysis on previously identified samples and
concentrate particularly on testing the unusual sample that can increase the detection rate of anti-virus
software [1].
3 Problem Description
Malicious software, which is also called malware is one of the most serious problems in the cyber world
today. Malicious software created by hackers is mainly polymorphic or metamorphic that can alter
its code as it spreads [12, 21, 27]. Besides, the diversity and volume of their differentness undeniably
undermine the efficiency of traditional techniques that often use signature-based strategies and are
unable to identify previously unknown malware. Diversity of families with malware shares common
patterns of behaviour based on their origin and purpose. Behavioural or acquired behavioural patterns
can be utilized to identify the undefined malware sample from their identified family’s behaviour using
machine learning algorithms [5, 11, 24].
4 Working Model
In this work, we have used 5 different types of algorithms for predicting and comparing their results (See
Figure 1). The used machine learning algorithms are Random Forest [17], Decision Tree [28], Adaboost,
Gaussian Naive Bayes (GNB), Gradient Boosting [29]. To test the file of the model, the characteristics
of the given file need to be extracted. The ML model is used to predict the class of a given file based on
the trained model. The total dataset size is 50000 which has 50 independent variables and 1 dependable
variable. In the dataset, 80% is the training set and 20% is the testing dataset.
5 ML Algorithms
5.1 Random Forest
Random Forest [13] is a well-known supervised learning algorithm. It builds multiple decision trees on
different subsets of a given dataset and counts the common result to enhance the prediction accuracy.
The relationship between the number of trees in the forest and its outcome: if the number of trees is
larger, the prediction will be more accurate. For both Classification and Regression machine learning
problems, Random forest can be used. It is an ensemble model which means, it combines various
classifiers to resolve the complex issue and enhance the model performance. Random forest algorithm
counts the prediction of every tree instead of depending on a single tree and depend on the highest
predicted one, it provides the final prediction.
5.2 AdaBoost
It is an ensemble model that works particularly well with decision trees. The model key learns from
past mistakes like misclassified data pinots. The learning process of Adaboost is, adding the weight to
the misclassification data points as shown in (Algorithm 1).
AdaBoost algorithm predicts the new prediction by adding weight and multiplying a prediction of
every tree. The final prediction made by the forest as a whole is decided by the group of decision trees
with the greatest total.
5: Update the weight of the decision tree, so the next tree can learn from the previous decision tree’s
mistakes.
6: Create a new dataset.
7: Repeat Steps 2 to 6 up to the number of trees we set for training.
8: Make the final prediction using the forest of decision trees.
from the previous training dataset. In this algorithm, we start at the root of the tree for predicting a
record’s class label. Based on the comparison between the values of the root attribute and the record
attribute, we follow the branch depend on that value and proceed to the following node.
6 Implementation
6.1 Data Pre-processing
Data that is collected from the real-world is often noisy, missing values, and remains in an unusable
format that cannot be used in the machine learning model directly. It is a necessary task to refine the
data and make it useable for the machine learning model, which in turn enhances the precision and
efficiency of the machine learning model. It involves the following steps shown in (Algorithm 3).
At last finding out the important features form the sample to perform the workflow. Best algorithm
with max accuracy is Random Forest and Gaussian Naive Bayes (GNB) has the least accuracy. The
main cause for Naive Bayes’ has the least accuracy rate is that the notion of conditional independence
among the attributes in this classifier is weak and is rarely suited to real problems except in cases where
attributes are extracted from independent systems. Implicitly, Naive Bayes considers that all attributes
are completely independent. It’s almost impossible for us to find a set of completely independent
predictors in real life. Besides, if a component has not been identified in the training data set and
which is available on a phase variable, the algorithm will assume a probability of 0 (zero) and cannot
make a prediction, which is commonly known as zero frequency. Even for Gaussian equations, Naive
I.J. of Electronics and Information Engineering, Vol.13, No.1, PP.24-32, Mar. 2020 (DOI: 10.6636/IJEIE.202103 13(1).04) 29
95
90
Percentage
Random Forest
85 AdaBoost
Gradient Boosting
80 Decision Trees
Gaussian Naive Bayes
75
70
65
Algorithms
Bayes carries the weakness of the conditional independence of the attributes. On the other hand, the
random forest is a better algorithm to produce a forecasting model, for both classification and regression
problems. The default hyperparameters feature of this model gives excellent results and the system
works well to prevent overfitting. Furthermore, a good indicator of the importance is that it provides
facilities.
7 Conclusions
In this work, our prime focus was to develop a machine learning model which can identify malicious
samples as truthfully as possible, with having a zero false-positive rate. To achieve it, we have preferred
different machine learning algorithms and got the highest accuracy rate for the Random forest algorithm.
After that, we used the random forest algorithm for malware detection and our proposed model has
successfully detected the malware sample. We were almost near to our target, even though not having
a zero false-positive rate still. Our model false-positive rate was 0.48992% and the false-negative rate
was 1.008782%. So, to be a segment of a competitive business outcome, various alternative strategies
must be added to this model. According to our observation, machine learning-based malware detection
model is also the best technology along with the conventional detection techniques like anti-virus. It is
very important to note that to enhance the malware detection rate, the machine learning models are
highly essential and also result in better accuracy.
References
[1] Mouhammd Al-Kasassbeh, Safaa Mohammed, Mohammad Alauthman, and Ammar Almomani,
“Feature selection using a machine learning to classify a malware,” in Handbook of Computer
Networks and Cyber Security, pp. 889–904, Springer, 2020.
[2] Abdullah A. Al-khatib and Waleed A. Hammood, “Mobile malware and defending systems: Com-
parison study,” International Journal of Electronics and Information Engineering, vol. 6, no. 2,
pp. 116–123, 2017.
[3] K. Dharmarajan, A. S. Arunachalam, S. Vaishnavi Sree, “Malware detection and classification
using random forest and adaboost algorithms,” International Journal of Innovative Technology
and Exploring Engineering, vol. 8, pp. 2863–2868, 2019.
[4] Ahmed Bensaoud, Nawaf Abudawaood, and Jugal Kalita, “Classifying malware images with con-
volutional neural network models,” arXiv preprint arXiv:2010.16108, 2020.
[5] William Blanc, Lina G Hashem, Karim O Elish, and MJ Hussain Almohri, “Identifying android
malware families using android-oriented metrics,” in 2019 IEEE International Conference on Big
Data (Big Data’19), pp. 4708–4713, IEEE, 2019.
[6] Sen Chen, Minhui Xue, Lingling Fan, Shuang Hao, Lihua Xu, Haojin Zhu, and Bo Li, “Automated
poisoning attacks and defenses in malware detection systems: An adversarial machine learning
approach,” Computers & Security, vol. 73, pp. 326–344, 2018.
[7] Tushaar Gangavarapu, C. D. Jaidhar, and Bhabesh Chanduka, “Applicability of machine learning
in spam and phishing email filtering: Review and approaches,” Artificial Intelligence Review, pp. 1–
63, 2020.
[8] Daniel Gibert, Carles Mateu, and Jordi Planes, “The rise of machine learning for detection and
classification of malware: Research developments, trends and challenges,” Journal of Network and
Computer Applications, vol. 153, p. 102526, 2020.
I.J. of Electronics and Information Engineering, Vol.13, No.1, PP.24-32, Mar. 2020 (DOI: 10.6636/IJEIE.202103 13(1).04) 31
[9] Nikhil Govil, Kunal Agarwal, Ashi Bansal, and Astha Varshney, “A machine learning based spam
detection mechanism,” in Fourth International Conference on Computing Methodologies and Com-
munication (ICCMC’20), pp. 954–957, IEEE, 2020.
[10] Anand Handa, Ashu Sharma, and Sandeep K Shukla, “Machine learning in cybersecurity: A
review,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 4,
pp. e1306, 2019.
[11] Manabu Hirano and Ryotaro Kobayashi, “Machine learning based ransomware detection using
storage access patterns obtained from live-forensic hypervisor,” in Sixth International Conference
on Internet of Things: Systems, Management and Security (IOTSMS’19), pp. 1–6, IEEE, 2019.
[12] Li-Chin Huang, Chun-Hsien Chang, and Min-Shiang Hwang, “Research on malware detection and
classification based on artificial intelligence,” International Journal of Network Security, vol. 22,
no. 5, pp. 717–727, 2020.
[13] Areeba Irshad, Ritesh Maurya, Malay Kishore Dutta, Radim Burget, and Vaclav Uher, “Feature
optimization for run time analysis of malware in windows operating system using machine learn-
ing approach,” in 42nd International Conference on Telecommunications and Signal Processing
(TSP’19), pp. 255–260, IEEE, 2019.
[14] Nitesh Kumar, Subhasis Mukhopadhyay, Mugdha Gupta, Anand Handa, and Sandeep K Shukla,
“Malware classification using early stage behavioral analysis,” in 14th Asia Joint Conference on
Information Security (AsiaJCIS’19), pp. 16–23, IEEE, 2019.
[15] Ori Or-Meir, Nir Nissim, Yuval Elovici, and Lior Rokach, “Dynamic malware analysis in the modern
era state of the art survey,” ACM Computing Surveys, vol. 52, no. 5, pp. 1–48, 2019.
[16] R Mohanasundaram P HarshaLatha, “Classification of malware detection using machine learning
algorithms: A survey,” International Journal of Scientific & Technology Research, vol. 9, no. 2,
pp. 1796–1802, 2020.
[17] S Abijah Roseline and S Geetha, “Intelligent malware detection using oblique random forest
paradigm,” in International Conference on Advances in Computing, Communications and Infor-
matics (ICACCI’18), pp. 330–336, IEEE, 2018.
[18] Asaf Shabtai, Yuval Fledel, and Yuval Elovici, “Automated static code analysis for classifying
android applications using machine learning,” in International Conference on Computational In-
telligence and Security, pp. 329–333, IEEE, 2010.
[19] Sanjay Sharma, C Rama Krishna, and Sanjay K Sahay, “Detection of advanced malware by machine
learning techniques,” in Soft Computing: Theories and Applications, pp. 333–342, Springer, 2019.
[20] Latika Singh and Markus Hofmann, “Dynamic behavior analysis of android applications for mal-
ware detection,” in International Conference on Intelligent Communication and Computational
Techniques (ICCT’17), pp. 1–7, IEEE, 2017.
[21] Ming-Yang Su, Jer-Yuan Chang, and Kek-Tung Fung, “Android malware detection approaches in
combination with static and dynamic features.,” International Journal of Network Security, vol. 21,
no. 6, pp. 1031–1041, 2019.
[22] Zheng-Zhi Tang, Xuewen Zeng, Zhichuan Guo, and Mangu Song, “Malware traffic classification
based on recurrence quantification analysis.,” International Journal of Network Security, vol. 22,
no. 3, pp. 449–459, 2020.
[23] Daniele Ucci, Leonardo Aniello, and Roberto Baldoni, “Survey of machine learning techniques for
malware analysis,” Computers & Security, vol. 81, pp. 123–147, 2019.
[24] Matúš Uchnár and Peter Fecil’ak, “Behavioral malware analysis algorithm comparison,” in IEEE
17th World Symposium on Applied Machine Intelligence and Informatics (SAMI’19), pp. 397–400,
IEEE, 2019.
[25] R Vinayakumar, Mamoun Alazab, KP Soman, Prabaharan Poornachandran, and Sitalakshmi
Venkatraman, “Robust intelligent malware detection using deep learning,” IEEE Access, vol. 7,
pp. 46717–46738, 2019.
I.J. of Electronics and Information Engineering, Vol.13, No.1, PP.24-32, Mar. 2020 (DOI: 10.6636/IJEIE.202103 13(1).04) 32
[26] Di Xue, Jingmei Li, Tu Lv, Weifei Wu, and Jiaxiang Wang, “Malware classification using proba-
bility scoring and machine learning,” IEEE Access, vol. 7, pp. 91641–91656, 2019.
[27] Hanqing Zhang, Senlin Luo, Yifei Zhang, and Limin Pan, “An efficient android malware detection
system based on method-level behavioral semantic analysis,” IEEE Access, vol. 7, pp. 69246–69256,
2019.
[28] Boyou Zhou, Anmol Gupta, Rasoul Jahanshahi, Manuel Egele, and Ajay Joshi, “Hardware perfor-
mance counters can detect malware: Myth or fact?,” in Proceedings of the 2018 on Asia Conference
on Computer and Communications Security, pp. 457–468, 2018.
[29] Qingguo Zhou, Fang Feng, Zebang Shen, Rui Zhou, Meng-Yen Hsieh, and Kuan-Ching Li, “A novel
approach for mobile malware classification and detection in android systems,” Multimedia Tools
and Applications, vol. 78, no. 3, pp. 3529–3552, 2019.
Biography
Anik Dewanje is currently pursuing a bachelor’s degree program in computer science and engineering
in specialization with information security at Vellore Institute of Technology, Vellore, India.
Dr Kakelli Anil Kumar is currently working as an associate professor in the school of computer
science and engineering at Vellore Institute of Technology, Vellore, India.