Behavior-based_features_model_for_malware_detectio
Behavior-based_features_model_for_malware_detectio
net/publication/277667123
CITATIONS READS
150 6,361
1 author:
Hisham S. Galal
Concordia University
12 PUBLICATIONS 434 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hisham S. Galal on 15 April 2018.
ISSN 2274-2042
1 23
Your article is protected by copyright and
all rights are held exclusively by Springer-
Verlag France. This e-offprint is for personal
use only and shall not be self-archived
in electronic repositories. If you wish to
self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
1 23
Author's personal copy
J Comput Virol Hack Tech
DOI 10.1007/s11416-015-0244-0
ORIGINAL PAPER
123
Author's personal copy
H. S. Galal et al.
malware used by AV software. Of course, modern AV also as Disassemblers and Executable analyzers. Static analysis
depends on an heuristic engine component to detect unknown has many advantages. First, it is considered a safe analysis
malware instances based on a set of rules [6]. A signature- method since malware is not executed and there is a less
based detection technique matches a previously generated chance of infecting the analysis machine. Second, the disas-
set of signatures against the suspicious samples. A signa- sembling of malware provides information about all possible
ture is a sequence of bytes at specific locations within the execution paths might be taken by the malware. However, a
executable, a regular expression, a hash value of binary packed malware sample is a quite challenge for static analy-
data, or any other formats created by malware analyst which sis method since it requires a high experience and skills to
should accurately identify malware instances. This approach figure out the unpacking routine in order to extract the real
has at least three major drawbacks [7]. First, the signatures payload [11].
are commonly created by a human; this is an error-prone On the other hand, dynamic analysis [9] is considered
task and can lead to creating a signature that falsely alarm an active method. It involves executing the malware and
a benign program. Second, being dependent on previously monitoring the actions and impacts on the system. Unlike
analyzed malware signatures inherently prevents the detec- the static analysis, it provides information about the only
tion of unknown malware for which no signatures yet exist. running execution path. The malware sample is analyzed
Finally, malware samples use obfuscation techniques such within a controlled environment such as Virtual Machines
as packing, polymorphism, and metamorphism [8] to evade (VM) [9]. Dynamic analysis has a considerable time over-
signature-based detection methods due to the sensitive nature head when compared to the static analysis time. Additionally,
of signatures to the smallest changes in malware binary the increased usage of VM in dynamic analysis inspired mal-
images. ware authors to incorporate additional code to detect the VM
On the contrary, behavior-based techniques assume mal- presence [12]. Hence, once the malware sample detects VM
ware can be detected by observing the malicious behaviors presence, it can either infect host machine by exploiting vul-
exhibited by malware during runtime [9]. They do not nerabilities found in VM or changes their execution path to
suffer the limitations found in signature-based techniques turn into a passive process without any malicious impact on
since they make their decision after observing malware system [12].
actions rather than looking for previously known signatures. In this paper, we aim to provide a behavior-based features
Therefore, they are effective in the detection of malware model used by the AV filtering tool to cope with the increas-
variants which share similar behaviors, yet have a differ- ing release rate of malware variants. The proposed model
ent structure. Nonetheless, they suffer from a high false describes the malicious actions exhibited by malware during
positive rate where a benign program is falsely classified runtime. It is extracted by performing dynamic analysis on
as a malicious program. Additionally, they are evaded by a relatively recent malware dataset. We employ API hook-
mimicry attacks which involve reconstructing the malicious ing library [13] to log information about API calls and their
behavior to appear as legitimate behavior. For example, a parameters, however, we further process these API calls into
Trojan could inject its code into a web-browser application, a set of sequences that share a common semantic purpose.
such that it gains the privileges of the web-browser, hence After that, the sequences are analyzed by a set of heuristic
its communication with the attacker can bypass through functions to infer a representative semantic feature which we
the firewall successfully since it appears to be from the refer to as actions.
web-browser application rather than from another suspect The contributions of the paper are as follow:
process.
– We provide a new processing approach on the raw infor-
mation gathered by API call hooking and produce a set of
actions representing the malicious behaviors exhibited.
1.2 Malware analysis techniques
– We demonstrate the semantic value provided by actions
and their insight to help malware analyst.
The signature-based and behavior-based detection tech-
– We assess actions as a feasible features model and employ
niques depend on a variety of malware analysis techniques.
various classification techniques to evaluate its accuracy.
Malware analysis is the art of dissecting malware to under-
stand how it works, how to identify it, and how to defeat The rest of this paper is organized as follows. Sec-
or eliminate it. While malware appears in many different tion 2 provides a review about the related techniques to
forms, three common techniques exist for malware analysis extract features for malware detection while we describe the
such as static analysis, dynamic analysis, and hybrid analysis process of extracting actions in Section 3. In Section 4, we
[10]. evaluate the efficiency and value of actions by various exper-
Static analysis is a passive method. In a sense, malware iments. Finally, conclusions and limitations are presented in
sample is not executed, but it is inspected using tools such Section 5.
123
Author's personal copy
Behavior-based features model for malware detection
2 Related work ing graphs into binary feature vectors and trained on them
various classification techniques.
In this section, We cover research efforts that claim to detect
malware variants. We group the research techniques into
three groups. Namely, they are statistical-based, graph-based, 2.3 Structural-based techniques
and structural-based.
Eskandari et al. [5] presented a novel approach that uti-
2.1 Statistical-based techniques lizes machine learning techniques with taking advantages
of hybrid analysis methods in order to improve the accuracy
Wong and Stamp [8] proposed a technique to detect meta- of malware analysis procedure while preserving its speed at
morphic malware based on hidden Markov Model (HMM) a point reasonable. They called their approach as HDMAna-
and provided a benchmark used in other studies as Canfora lyzer that stands for a Hybrid Analyzer based on Data Mining
et al. [14], Kalbhor et al. [15], Lin and Stamp [16], Musale techniques. They used dynamic analysis to extract API call
et al. [17], Shanmugam et al. [18] on metamorphic mal- sequences during the execution time of malware sample,
ware. They analyzed the similarity degree of metamorphism meanwhile, they used static analysis to extract Enriched
produced by different malware generators such as G2, MPC- Control Flow Graph (ECFG) which incorporates informa-
GEN, NGVCK and VCL32 by training HMM on the opcode tion about API calls. After the extraction of features, they
sequences of the metamorphic malware samples. used a matching engine that combines features obtained by
Annachhatre et al. [19] have used (HMM) analysis to dynamic analysis with corresponding ECFG, and then, each
detect certain challenging classes of malware. In their conditional jump receives a label according to the dynamic
research, they considered the related problem of malware information. At this point, a machine learning algorithm is
variants classification based on HMMs. More than 8,000 employed to build a learning model with the labeled nodes
malware variants are then scored against these models and of ECFG. This learning model is used by HDM-Analyzer at
separated into clusters based on the resulting scores. They scanning time for analyzing unknown executable files.
observed that the clustering results could be used to group the Islam et al. [23] proposed classification approached based
malware samples into their appropriate families with good on features extracted from static analysis and dynamic analy-
accuracy. sis. During the static analysis, they extracted Function Length
Faruki et al. [20] used API call-gram to detect malware. Frequency (FLF) and Printable String Information (PSI) vec-
API call-gram captures the sequence in which API calls are tors. The FLF feature is based on counting the number of
made in a program. First a call graph is generated from functions in different length ranges or bins. They derived a
the disassembled instructions of a binary program. This call vector interpretation of an executable file based on the num-
graph is converted to call-gram. The call-gram becomes the ber of bins chosen and where the function lengths lie across
input to a pattern matching engine. the bins. In PSI, they extracted all printable strings from mal-
ware samples to create a global list of strings. Then, they
2.2 Graph-based techniques reported for each sample, the count of distinct strings, fol-
lowed by a binary report on the presence of each string in
Park et al. [21] proposed a method to construct a com- the global list, where a 1 represents the fact that the string is
mon behavioral graph representing the execution behavior present and a 0 that it is not. On the other hand, during the
of a family of malware instances. The method generates one dynamic analysis they extracted API features such as API
common behavioral graph by clustering a set of individual function names and parameters from the log files, then they
behavioral graphs, which represent kernel objects and their again construct a global list and use a binary vector where
attributes based on system call traces. The resulting common a 1 represents an API function in the global list is called by
behavioral graph has a common path, called HotPath, which sample, otherwise they set a 0.
is observed in all the malware instances in the same fam-
ily. The derived common behavioral graph is highly scalable
regardless of new instances added. It is also robust against
system call attacks. 3 The actions model
Eskandari and Hashemi [22] proposed a technique that
uses Control Flow Graph (CFG) to represent the control In this section, we demonstrate the dynamic analysis tech-
structure and the semantic aspects of a malware sample. nique to extract the actions model as outlined in Fig. 1. Each
The extracted CFG is annotated by API calls only rather malware sample will pass through three stages: API extrac-
than assembly instructions. This new representation model tion, Sequence extraction, and Action extractions, which we
is referred to as API-CFG. Finally, they converted the result- discuss below.
123
Author's personal copy
H. S. Galal et al.
The output is referred to as API-Trace, which is a log file We use the handle value to identify data-dependence
with each line formatted as (l, a, r, p1 , . . . pn ) where l is the among API calls of the top three categories while API calls
line number, a is the API function name, r is the return value of the last category will be represented individually into sep-
of API, and pi is the ith parameter’s value. Figure 2 shows arate sequences as shown in Fig. 3. We refer to the collection
an example of the output produced by this stage. of all extracted sequences as Sequence-Trace.
123
Author's personal copy
Behavior-based features model for malware detection
To extract actions from sequence-traces, we use a set of In this section, we describe the dataset used during experi-
heuristic functions that infer unique actions. The heuristic ments and actions extraction. Then, we present an evaluation
function selects a sequence of API calls based on their API- of the proposed features model performance and provide
category. In Table 1, we list some of the actions produced by insights gained from the experiments. Finally, we compare
heuristic functions. Actions consist of fields with a semantic our work with a recent technique discussed in related work.
value that describe the behavior of a sample. The collection
of actions for a given sample is referred to as Action-Trace, 4.1 Dataset
where each action is formatted as (N , Fi = Vi , ..) such
that N is the action name, Fi and Vi are the ith field and In this research, we have separate datasets for malware and
value, respectively. For example, the output of this stage after benign samples. The malware dataset has a diverse number
processing the extracted sequence-trace above is shown in of malware families for different malware types, and each
Fig. 4. family is represented by an equal number of different vari-
The unique value of the proposed actions model is the ants.
high-level insight it gives to malware analyst compared to We downloaded 9993 samples from VirusSign [24] in the
other techniques based on n-grams such as API n-grams [20] period from October 14, 2013 to March 2, 2014. However,
or techniques based on CFG such as API-CFG proposed in the obtained samples are labeled by MD5 hash values which
[22]. This insight is very helpful for AV malware analyst do not provide any information about their malware family.
when it comes to filtering thousands of submitted samples. We scanned the samples by AV software to identify their
It gives a brief report on the actions exhibited by the sample malware family. Then, we selected 2000 samples for a 50
during its runtime. Moreover, malware analysts can design different malware family such that each malware family is
new heuristic functions based on their expertise to infer addi- represented by 40 malware variants. A partial list of mal-
tional complex actions. ware families along with their type is given in Table 2. On
123
Author's personal copy
H. S. Galal et al.
Table 2 Part of malware families in dataset 3. False positive (FP) is the number of predicted benign
Type Malware Family samples incorrectly classified as malicious.
4. False negative (FN) is the number of predicted malware
Backdoor Androm Bifrose DarkKomet incorrectly classified as benign.
Hupigon Kelihos Zegost
Trojan Buzus Graftor Sirefef
These terms are used to define four performance compari-
Urusay Vundo ZBot
son criteria between DT, RF, and SVM. The first criterion is
Virus Alman Chir Elkern
Sensitivity which measures the proportion of actual positives,
Jadtre Neshta Sality (malware samples), which are correctly identified.
Worm Ratab Allaple Darkbot
Fesber Mydoom Vofbus TP
Sensitivit y = (1)
T P + FN
123
Author's personal copy
Behavior-based features model for malware detection
Table 3 Classifiers results for the proposed features model and SVM classifiers. First, we loaded the binary vectors of
Algorithm Sensitivity Specificity Accuracy AUC all samples into a data table. Then, we filtered the features
by selecting only the features with information gain above
DT 97.3 % 96.53 % 97.19 % 97.65 % a certain threshold. The threshold value is obtained after
RF 97.19 % 96.35 % 96.84 % 99.48 % carrying out several experiments to achieve highest classi-
SVM 92.28 % 96.35 % 93.98 % 98.55 % fication accuracy, we used 0.03 as a threshold value. The
feature selection resulted in a reduced data that we fed to
DT, RF, and SVM classifiers. The defaults parameters for
classifiers are used without changes. The 10-cross validation
technique is used to validate the accuracy achieved by clas-
sifiers. After training and testing classifiers, we report the
evaluation results in Table 3, and the ROC curves for each
classifier is shown in Fig. 5.
During the experiments, we investigated the reason for
false positives and false negatives. An example of false pos-
itive is the Virtual-CD setup program. It had many actions
similar to those exhibited by malware samples. For exam-
ple, some of these actions were installing a service driver in
a way similar to rootkit behaviors, in addition to setting an
auto-start extensibility point in the registry.
On the other hand, one of the IRCBot malware samples
could escape the detection due to inherent drawbacks of
dynamic analysis. Precisely, It did not show its malicious
actions due to the detection of virtual machine artifacts.
We have implemented the technique in [19] with 50 hid-
den Markov Model (HMM) for each malware family in our
Fig. 5 ROC curves of RF, SVM, and DT classifiers malware dataset. The opcode sequences are extracted using
ObjDump tool. We used the following parameters, the num-
ber of states N = 2, and the number of different symbols
We used Orange machine learning toolbox [28] to train
S = 827, (i.e, number of different opcodes), the number of
and test classifiers. It contains tools for data pre-processing,
iterations = 800, and likeli hood = 0.001. We used a vec-
classification, regression, clustering, association rules, and
tor that hold scores from all HMM to be used as features. We
visualization. Figure 6 outlines the workflow between the
have noticed that, since we trained HMM only on malware
different widgets employed to train and test the DT, RF,
Fig. 6 Work-flow between widgets of Orange toolbox to train and test RF, SVM, and DT and perform ROC analysis
123
Author's personal copy
H. S. Galal et al.
Table 4 Classification results for HMM based features model [19] partial behavior of malware sample. In other words, it is not
Algorithm Sensitivity Specificity Accuracy AUC suitable for malware samples that depend on external events
such as receiving a remote command or depending on spe-
DT 97.6 % 97.63 % 96.89 % 97.72 % cific time to trigger their malicious actions. Additionally, the
RF 97.1 % 96.3 % 96.14 % 97.68 % dynamic analysis fails against malware sample that checks
SVM 93.17 % 95.45 % 94.8 % 95.55 % for the existence of virtual machine artifacts [12] and become
a passive process or simply terminate itself resulting in clean
actions. Finally, the extraction technique is not fully auto-
families, we found that benign samples have low scores on matic since it requires malware analyst support to create the
all models while malware samples have at least one model set of heuristic functions.
that produced a high score. The evaluation results are given
in Table 4.
While the results of this state-of-the-art technique and our 5.2 Future work
proposed features model are comparable in terms of perfor-
mance, there are some key advantages of using actions. First, The future work on the actions includes refining them by
actions provide helpful semantic insight of malware behav- utilizing data flow dependence, which should provide more
ior to malware researcher. Second, the proposed technique to insight into malware behavior. Additionally, the tool which
extract actions is extensible as it relies on a set of heuristic extracts API call information can be modified to work outside
functions which can be improved by the technical expertise the virtual machine environment; this approach provides a
of malware researcher. This leads to extracting complex and more robust solution against malware samples that check for
new actions exhibited by evolved malware samples. Finally, the existence of mentioned tool.
actions are easier to understand by non-experts than statisti-
cal HMM as indicated previously in Fig 4.
References
1. Fossi, M., Egan, G., Haley, K., Johnson, E., Mack, T., Adams,
5 Conclusion T., Blackbird, J., Low, M.K., Mazurek, D., McKinney, D., et al.:
Symantec internet security threat report trends for 2010, vol. 16
(2011)
In this paper, we proposed a behavior-based features model to 2. Gennari, J., French, D.: Defining malware families based on ana-
describe the malicious actions exhibited by malware during lyst insights. In: Technologies for Homeland Security (HST), 2011
the runtime. The proposed features model is referred to as IEEE International Conference on IEEE, pp. 396–401 (2011)
actions; it is created by performing dynamic analysis over a 3. Mairh, A., Barik, D., Verma, K., Jena, D.: Honeypot in network
security: a survey. In: Proceedings of the 2011 International Con-
relatively recent malware dataset. In dynamic analysis, we ference on Communication, Computing & Security ACM, pp.
used API hooking technique to trace API calls invoked by 600–605 (2011)
malware sample, then we further process the traced API calls 4. Kiemt, H., Thuy, N.T., Quang, T.M.N.: A machine learning
into groups of semantically dependent API calls. The API approach to anti-virus system (artificial intelligence i). IPSJ SIG
Notes. ICS 2004(125), 61–65 (2004)
sequences are further processed by a set of heuristic functions 5. Eskandari, M., Khorshidpour, Z., Hashemi, S.: Hdm-analyser: a
that extract the actions. hybrid analysis approach based on data mining techniques for mal-
The unique value of actions is the high-level insight it ware detection. J. Comput. Virol. Hacking Tech. 9(2), 77–93 (2013)
gives to malware analyst. More precisely, actions describe 6. Kaspersky. Heuristic analysis in anti-virus. http://support.
kaspersky.com/8641 (2013). Accessed in 1 April 2015
the malicious behaviors of a sample with a better semantic 7. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for mal-
value than API n-gram based techniques. Additionally, mal- ware detection. In: Twenty-third annual IEEE Computer security
ware analysts can utilize their technical knowledge by adding applications conference, 2007. ACSAC 2007, pp. 421–430 (2007)
more heuristic functions to extract additional actions. Dur- 8. Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Com-
put. Virol. 2(3), 211–229 (2006)
ing the experiments, we assessed the actions as features for 9. Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on
malware and benign programs classification. Based on the automated dynamic malware-analysis techniques and tools. ACM
experimental results, the classifiers achieved high classifica- Comput. Surv. (CSUR) 44(2), 6 (2012)
tion accuracy rates. 10. Sikorski, M., Honig, A.: Practical malware analysis: the hands-on
guide to dissecting malicious software. No Starch Press (2012)
11. Cesare, S., Xiang, Y., Zhou, Wanlei: Malwise&# x2014; an effec-
5.1 Limitations tive and efficient classification system for packed and polymorphic
malware. IEEE Trans. Comput. 62(6), 1193–1206 (2013)
12. Lindorfer, M., Kolbitsch, C., Comparetti, P.M.: Detecting
The technique used to extract the actions has some limitations environment-sensitive malware. In: Recent Advances in Intrusion
inherent from dynamic analysis methods, which observe a Detection, pp. 338–357. Springer (2011)
123
Author's personal copy
Behavior-based features model for malware detection
13. Nektra Advanced Computing. Deviare api hook. http://www. 21. Park, Y., Reeves, D.S., Stamp, M.: Deriving common malware
nektra.com/products/deviare-api-hook-windows/ (2015). Acces behavior through graph clustering. Comput. Secur. 39, 419–430
sed in 1 April 2015 (2013)
14. Canfora, G.: Antonio Niccolò Iannaccone, and Corrado Aaron Vis- 22. Eskandari, M., Hashemi, Sattar: A graph mining approach for
aggio. Static analysis for the detection of metamorphic computer detecting unknown malwares. J. Vis. Lang. Comput. 23(3), 154–
viruses using repeated-instructions counting heuristics. J. Comput. 162 (2012)
Virol. Hacking Tech. 10(1), 11–27 (2014) 23. Islam, R., Tian, R., Batten, L.M., Versteeg, S.: Classification of
15. Kalbhor, A., Austin, T.H., Filiol, E., Josse, S., Mark, S.: Dueling malware based on integrated static and dynamic features. J. Netw.
hidden markov models for virus analysis. J. Comput. Virol. Hack- Comput. Appl. 36(2), 646–656 (2013)
ing Tech. 11, 1–16 (2014) 24. VirusSign. Malware research and data center. http://www.
16. Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. VirusSign.com (2015). Accessed in 1 April 2015
J. Comput. Virol. 7(3), 201–214 (2011) 25. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
17. Musale, M., Austin, T.H., Stamp, M.: Hunting for metamorphic 26. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn.
javascript malware. J. Comput. Virol. Hacking Tech. 1–14 (2014) 20(3), 273–297 (1995)
18. Shanmugam, G., Low, R.M., Stamp, M.: Simple substitution dis- 27. Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier
tance and metamorphic detection. J. Comput. Virol. Hacking Tech. methodology (1990)
9(3), 159–170 (2013) 28. Demšar, J., Curk, T., Erjavec, A., Gorup, Č., Hočevar, T., Miluti-
19. Annachhatre, C., Austin, T.H., Stamp, M.: Hidden markov models novič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A.,
for malware classification. J. Comput. Virol. Hacking Tech. 1–15 Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan,
(2014) B.: Orange: Data mining toolbox in python. J. Mach. Learn. Res.
20. Faruki, P., Laxmi, V., Gaur, M.S., Vinod, P.: Mining control flow 14, 2349–2353 (2013)
graph as api call-grams to detect portable executable malware. In
Proceedings of the Fifth International Conference on Security of
Information and Networks ACM, pp. 130–137 (2012)
123