FAMD A Fast Multifeature Android Malware Detection
FAMD A Fast Multifeature Android Malware Detection
ABSTRACT With Android’s dominant position within the current smartphone OS, increasing number
of malware applications pose a great threat to user privacy and security. Classification algorithms that
use a single feature usually have weak detection performance. Although the use of multiple features can
improve the detection effect, increasing the number of features increases the requirements of the operating
environment and consumes more time. We propose a fast Android malware detection framework based on the
combination of multiple features: FAMD (Fast Android Malware Detector). First, we extracted permissions
and Dalvik opcode sequences from samples to construct the original feature set. Second, the Dalvik opcodes
are preprocessed with the N-Gram technique, and the FCBF (Fast Correlation-Based Filter) algorithm
based on symmetrical uncertainty is employed to reduce feature dimensionality. Finally, the dimensionality-
reduced features are input into the CatBoost classifier for malware detection and family classification. The
dataset DS-1, which we collected, and the baseline dataset Drebin were used in the experiment. The results
show that the combined features can effectively improve the detection accuracy of malware that can reach
97.40% on Drebin dataset, and the malware family classification accuracy can achieve 97.38%. Compared
with other state-of-the-art works, our framework achieves higher accuracy and lower time consumption.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 194729
H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation
The other is the detection methods, which use different analysis, pattern matching, and static system call analysis.
machine learning methods or combinations of methods as The advantages of static analysis are low resource consump-
classifiers, such as SVM (Support Vector Machine), KNN tion, fast detection, and low real-time requirements, and the
(K-NearestNeighbor), RF (Random Forest), and deep learn- disadvantage is that the detection accuracy is relatively low.
ing methods, to identify the different behavior patterns, and The static analysis method is the most commonly used
establish detection systems. The purpose of these studies is method in current research. Enck et al. [3]designed a set of
to improve the accuracy of malware detection with the hope security rules that use a signature-based approach to detect
that the methods are effective in practice. the application being evaluated. Saracino et al. [4] proposed
In order to achieve the above purpose, we propose a fast a host-based malware detection system for Android devices
Android malware detection framework, FAMD, that com- called MADAM, which simultaneously analyzes and corre-
bines multiple features and uses a classification technique lates features at the kernel level, application level, user level
to detect malware and classifiy malware families. It uses and package level to detect and prevent malicious behav-
permissions and Dalvik opcodes as classification features ior. Kim et al. [5] proposed framework usage permissions,
and further uses the FCBF algorithm to process the fea- strings, API calls, and other features to reflect the various
tures to construct low-dimensional feature vectors. Finally, characteristics of applications from various aspects. Their
the machine learning framework CatBoost based on the gra- feature vector generation method consists of an existence-
dient boosting decision tree is used as the classifier to perform based method and a similarity-based method, and these are
the classification of malware. The main contributions of this very effective in distinguishing between malware and benign
paper are as follows. applications, even though malware has many properties that
• We propose a fast Android malware detection frame- are similar to those of benign applications. In addition,
work, FAMD, which includes three parts: constructing a Zhang et al. [6] keep the abstracted API calls of function
malware detection feature set, preprocessing the features methods to form a set of abstracted API calls transactions
for dimensionality reduction, and performing malware and calculate the confidence of association rules between
detection and family classification on the processed fea- the abstracted API calls. Combine machine learning to iden-
tures. The purpose is to improve the accuracy of malware tify the different behavior patterns, and establish a detection
detection while reducing the feature dimensions. system. The framework of MaMaDroid [7] constructs the
• In terms of feature preprocessing, because the sequences sequence obtained in the API call graph as a Markov chain
of Dalvik opcode are segmented by the N-Gram method, to detect malware from the perspective of behavior.
the feature dimension is high. We use the FCBF algo- Dynamic analysis covers a family of methods based on
rithm to reduce the dimension of the features from analyzing the runtime behavior of an application. It is usually
2467 to 500. necessary to run the application in a specific environment to
• CatBoost is adopted as the classifier for the first time monitor the application’s access to the network, system calls,
in Android malware detection and family classification. files and memory, information access patterns, and process-
Compare with other GBDT-based methods, CatBoost ing behaviors. Dynamic analysis judges the maliciousness
can solve the problems of gradient bias and prediction of an application by analyzing whether the abovementioned
shift, thus reducing the occurrence of over-fitting and behaviors are normal. The advantage of dynamic analysis is
improving the classification accuracy and the general- that it is not affected by code obfuscation and encryption
ization ability of the model. and can analyze an application based on its malware-like
The rest of the paper is organized as follows. Section II behavior. However, it consumes system more resources and
introduces related research on Android malware detection. requires analysts with high technical capabilities, which is not
Section III presents the framework FAMD, and gives the conducive to large-scale applications in testing.
implementation of each part of the framework. Then, section In 2014, Enck et al. [8] proposed a dynamic malware detec-
IV provides more details of our framework implementation tion tool, TaintDroid, which labeled a variety of sensitive
and discusses FAMD’s evaluation results. Finally, section V data, and then monitored the flow path of these contami-
concludes the paper. nated sensitive data in a sandbox environment in real-time to
determine whether the application had malicious behaviors
II. RELATED WORK of privacy data leakage. RansomProber [9] can infer whether
A. STATIC ANALYSIS AND DYNAMIC the user initiated the file encryption operation by analyzing
ANALYSIS OF MALWARE the user interface widgets of related activities and the user’s
The current research on Android malware detection can be finger movement coordinates and has a good effect in detect-
divided into static analysis and dynamic analysis from the ing encrypted ransomware. Cai et al. [10] used a variety of
perspective of feature extraction. Static analysis refers to the dynamic features based on method calls and inter-component
analysis of the source code or the analysis of the features communication (ICC) intents to achieve better robustness
extracted from the source code. This method can analyze than static analysis and dynamic analysis, which depends
a program’s source code without the application being exe- on system calls. Yerima et al. [11] proposed and investigate
cuted. The static analysis includes decompilation, reverse approaches based on stateful event generation and provided
much better code coverage, which leads to more accurate opcodes can reflect the behavior pattern of an application to
machine learning-based malware detection. a certain extent by means of the underlying machine code,
There are also some studies combine static and dynamic so they are often used as static analysis features.
features. Yuan et al. [12] used the requested permissions, Jerome et al. [19] used N-Gram-based opcodes as fea-
suspicious API calls, and dynamic behaviors with a total tures to detect malware and classify malware families.
of 202 features to build a complete deep learning model. McLaughlin et al. [20] proposed using a deep convolutional
Tam et al. [13] proposed CopperDroid, which can capture neural network to automatically learn from the original
operations initiated in Java and the native code execution to opcode sequence, thereby eliminating the need for manu-
reconstruct the behavior of Android malware based on the ally designed malware features. Zhang et al. [21] extracted
automatic dynamic analysis system in VMI (virtual machine several global topology features from the Dalvik opcode
introspection). graph of each sample to represent malware. This method
achieves better detection efficiency and robustness. Pektaş
B. FEATURES OF ANDROID MALWARE DETECTION and Acarman [22] extracted the instruction call graph from
1) PERMISSION FEATURES a malicious application and derived an instruction call
According to Android mechanism, every Android appli- sequence to represent Android malware. The accuracy of the
cation runs in a limited-access sandbox. If an applica- proposed malware detection method reached 91.42%. The
tion needs to use resources or information outside of its model proposed by Egitmen et al. [23] extracts skip-gram-
own sandbox, the application has to request the appropriate based features from the instruction sequence of an appli-
permissions. Therefore, malware can be found by view- cation, and a word embedded vector is generated for each
ing the permissions declared in the AndroidManifest.XML unique opcode to realize the high-level representation of the
file. Permission features can be divided into two types: opcode sequence, and it is used as the input feature of the
official permissions and custom permissions. There are detection model.
166 official permissions defined by Android [14], such as
android.permission.INTERNET, which allows applications 3) OTHER FEATURES
to open network sockets. All developers can request these per- In addition to permissions, Dalvik opcodes, Android malware
missions. By defining custom permissions, an application can detection features also include API calls [24]–[26], control
share its resources and capabilities with other applications. flow graphs (CFGs) [27], component information, and hard-
For example, if a developer wants to prevent certain users ware information. For example, the API Getdeviceid() can be
from launching an activity in an application, the developer used to access sensitive data and obtain the user’s device ID.
can define custom permissions to achieve this. After the Therefore, it is also an effective method to detect the mali-
permissions are defined, they can be referenced as part of the ciousness of the application by studying the application’s API
component definition. calls. Zhang et al. [28] represented opcodes with a bi-gram
Android’s permission features can reflect the behavior of model and represented API calls with a frequency vector.
the application in a certain sense. Sanz et al. [15]used the per- Then, they used principal component analysis to optimize the
missions as features and combined them with machine learn- representations and to improve the convergence speed.
ing algorithms to detect Android malware. Wang et al. [16] Since attacks may specifically evade detection by avoid
systematically analyzed the risk of each individual permis- using certain permissions or API calls, employing a single
sion and the risk of a group of collaborative permissions kind of feature in malware detection may affect the results.
by employing machine learning techniques. Talha et al. [17] Some works used combined features to detect malware.
implemented a permission-based Android malware detec- Arp et al. [29] performed extensive static analysis and col-
tion system, APK Auditor, which can achieve 88% accuracy lected as many application features as possible. These
and 92.5% specificity. Li et al. [18] proposed a multi-level features are embedded in a joint vector space, which can auto-
data pruning method, SIGPID, which includes negative-rate matically identify typical patterns that represent malware.
permission sorting, association-rule permission mining, and ICCDetector [30] uses captured interactions between compo-
support-based permission sorting to extract significant per- nents within an application or across application boundaries
missions strategically. When using SVM as the classifier,they as features to detect malware. Alazab et al. [31] com-
can achieve over 90% of precision, recall, accuracy, and bined request permissions and API calls. Compared with
F-measure. benign applications, malicious applications call a different
set of API calls. Malware usually requests dangerous per-
2) DALVIK OPCODE FEATURES missions to access sensitive data more frequently than benign
Dalvik is a virtual machine that was used to run Android applications.
applications in early Android systems. Every time it runs, The purpose of adopting multiple types of features is
it dynamically interprets a part of Dalvik bytecode as machine to improve the detection effect. However, the combination
code. After Android 5.0, the Dalvik virtual machine (DVM) of multiple features will increase the feature dimensions,
was replaced by Android Runtime (ART), but the compilation making the classifier consume much time in the operation
method of the underlying opcode is still compatible. These process and not detect efficiently. Therefore, reducing the
time consumed in malware detection is also a focus of • Feature selection. Since the constructed feature vectors
research. Applying appropriate feature selection methods to have high dimensionality which will result in high com-
reduce the features’ dimensions is a solution to this problem. putational cost and overfitting, we employ the feature
selection techniques to reduce dimensionality. FCBF
III. DESIGN AND IMPLEMENTATION OF FAMD algorithm is used to weight the features and construct
A. THE FRAMEWORK OF FAMD the feature subset. The parameters and subset feature
FAMD is a fast Android malware detection framework based numbers will be decided by experiments.
on multifeature combination. We combine the permission • Malware detection and family classification. After
features and Dalvik opcode features from different levels of distinguishing the malicious samples from benign ones,
the operating system. To deal with the high dimensionality dividing malware into families is important to analyze
problem emerged after feature combination, the feature selec- the behaviors of malware. We use a machine learning
tion method is used to reduce the dimensionality, thereby algorithm based on the gradient boosted decision tree,
reducing the classification consumption and achieving the CatBoost, as the classifier to detect malicious sam-
purpose of being fast. Specifically, the FAMD framework will ples and classify the malware families. The evalua-
be divided into four parts: Android application collection, tion metrics such as accuracy, precision, TPR, FPR are
feature extraction and preprocessing, feature selection, mal- used to verify the effectiveness and performance of the
ware detection and family classification, as shown in Fig. 1. framework.
• Android application collection. The applications in B. FEATURE EXTRACTION AND PREPROCESSING
this work are collected from an open source dataset and 1) EXTRACTION OF PERMISSIONS
third-party markets. The collected samples are filtered The purpose of setting permissions is to protect the privacy
by antivirus engine to ensure the purity of the malicious- of Android users. Android applications must apply for per-
ness and normality. The details will be introduced in the mission to access sensitive user data (such as contacts and
experiment section. text messages) and certain system functions (such as camera
• Feature extraction and preprocessing. We use decom- and Internet). Depending on the function, the system may
pilation tools to extract permissions and original opcode automatically grant permissions, or the user may be prompted
sequences from the AndroidManifest.XML file and the to approve the request. Android divides permissions into four
classes.dex file. Based on the N-Gram method, the spe- protection levels [32], which affect whether runtime permis-
cific length of the opcode sequence is extracted from sion requests are required.
the original opcode sequence, and the feature vector • Normal Permission. This category of permissions cov-
of each sample is constructed in combination with the ers situations in which the application needs to access
permission features. Finally, we construct the feature data or resources outside its sandbox. These situations
matrix with each application as a row, and each of the pose little risk to a user’s privacy or the operation of other
extracted features as a column. applications.
• Dangerous permission. Contrary to normal permis- TABLE 1. Dalvik instruction mapping table.
sions, if an application should acquire this type of per-
missions, the user’s private data will be exposed to the
risk of tampering.
• Signature permission. This type of permissions is only
open to applications with the same signature. Even if
other applications know this open data interface and they
also register permissions in the AndroidManifest.XML
file, they still cannot access the corresponding data due
to different application signatures.
• SignatureOrSystem permission. This permission cat- to the ‘‘V’’ instruction, and ‘‘goto’’ corresponds to the
egory is similar to signature permission, but it not only ‘‘G’’ instruction. Therefore, the sequence is simplified as
requires the same signature but also requires similar ‘‘MTVGM’’. Then the 3-Gram features of the sequence are
system-level applications. This type of permissions is {MTV},{TVG},{VGM}, the 4-Gram features are {MTVG},
only used for prefabricated applications developed by {TVGM}, and the 5-Gram feature is {MTVGM}.
general mobile phone manufacturers.
3) CONSTRUCTION OF FEATURE VECTORS
2) PROCESSING OF DALVIK OPCODE Androguard [35] is a python-based Android analysis tool that
N-Gram [33] is a method based on statistical language mod- can analyze an Android file structure through decompila-
els. It performs a sliding window operation of size N on the tion and extract static features. All permissions and opcode
content of the text, forming a sequence of byte fragments of sequences of each application can be extracted from the
length N . Each byte segment is called Gram. The frequency AndroidManifest.XML file and the classes.dex file through
of occurrence of all Grams is counted and filtered according Androguard. In this work, in order to limit the dimensional-
to the preset threshold to form a key Gram list, which is the ity of feature vectors and ensure the generality of extracted
feature vector space of this text, and each element in the Gram features, only 166 official permissions are extracted without
list is a feature vector dimension. considering custom permissions. For the extracted Dalvik
The N-Gram model is based on the following hypothesis: opcode sequence, according to the above mapping table,
the N th word’s appearance is only related to the previous an opcode sequence of a specific length is extracted. These
N − 1 words and is not related to any other words. The prob- features constitute the initial feature set.
ability of an entire sentence occurring is the product of the The feature set is numerically simulated in the following
probability of each word occurring. These probabilities can way to construct feature vectors. Assuming an Android appli-
be obtained by directly counting the number of simultaneous cation a, the feature set constructed from all applications
occurrences of N words from the corpus. contains n features, and the feature set is represented by S;
In malware detection, the N-Gram method is often used then, the feature vector of application a is represented by
to process malicious codes. The N-Gram features are usually equation (1).
extracted from the application opcode sequences. N is usually
valued at 2, 3, and 4. Va = {v1 , v2 , . . . , vn },
(
The current Dalvik instruction [34] set contains 230 1, vi ∈ a and vi ∈ S, 1 ≤ i ≤ n;
vi = (1)
instructions, including the ‘‘Move’’ instruction, ‘‘Invoke’’ 0, otherwise
instruction, ‘‘Return’’ instruction and so on. Existing studies
have shown that methods based on N-Grams face the prospect Therefore, the feature vector can be expressed as Va =
of exponential growth in the number of unique N-Grams as {0, 1, 0, 0, 1, 0, 1, . . . , 1}, where 1 indicates that the feature
the value of N increases. Therefore, in this paper, we sim- is included in the application and 0 indicates that the feature
plify the opcodes by remove the irrelevant instructions, retain is not included. For sample labels, 1 represents malware, and
only the seven core instruction sets, and remove the operands. -1 represents benign.
The seven instruction sets, M, R, G, I, T, P, and V, represent
seven types of instructions, move, return, jump, judge, read C. FEATURE SELECTION BASED ON FCBF
data, store data, and call methods, respectively. The instruc- Feature selection is the process of selecting a subset of M
tions are classified and described in Table 1. features from N feature sets while meeting the condition
According to the above mapping, we use the N-Gram to M ≤ N . The purpose of feature selection is to remove
segment the opcode sequence extracted from the applica- the redundant or irrelevant features from a set of features to
tions. The original opcode sequence: ‘‘move-object/from16, reduce the dimensionality.
iget-object, invoke-virtual, goto, move-object/from16’’ is According to the execution process of the feature selection
taken as an example. The ‘‘move-object/from16’’ sequence algorithm, feature selection can be divided into 3 categories:
corresponds to the ‘‘M’’ instruction, ‘‘iget-object’’ corre- Filter methods rely on the general characteristics of the train-
sponds to the ‘‘T’’ instruction, ‘‘invoke-virtual’’ corresponds ing data to select features with independence of any classifier.
Wrapper methods use the classifier as a black box and the Step 5: If SUX i ,X m > SUX i ,Y , then this feature is proven
classifier performance as the objective function to evaluate to be a redundant feature, and the feature is removed from
the variable subset. Embedded methods want to reduce the Slist . Here, we assume that features X2 and X4 are removed.
computation time taken up for reclassifying different subsets Step 6: Add X1 to the feature subset Ssub , choose X3 as the
which is done in wrapper methods. The main approach is main feature Xm among the remaining features in Slist .
to incorporate the feature selection as part of the training Step 7: If Slist is not null, repeat the process of Step 5,
process. suppose that we remove X6 and add X5 to the feature subset
FCBF is a fast-correlation filter algorithm proposed by Ssub . If Slist is null, Ssub is the terminal Subset, the selection
Yu and Liu [36], Senliol et al. [37] in 2003. It has a wide range is stop.
of applications in speech recognition [38], network traffic After the above process, we get the final feature subset
classification [39], and other fields because of its fast calcula- Ssub . Compared with other algorithms, one of the advantages
tion. The FCBF algorithm employs symmetrical uncertainty of the FCBF algorithm is the ability to remove redundant
(SU ) to measure the correlation between two features. The features. For two features X1 and X2 , with mutual redundancy,
theoretical basis is that if the SU of feature X and target Y is suppose that X1 has a higher correlation with target Y . After
high, and the SU of other features and target Y is low, then calculation, feature X1 with the higher correlation with cat-
feature X is more important and has a higher weight. When egory Y is retained, and X2 with the lower correlation will
the value of SU between two features is 1, it means that X be removed. At the same time, the more relevant X1 can be
and Y are completely correlated; in other words, if X → Y , used to filter other features. For a dataset with N features
then Y → X . When the value of SU is 0, it means that X and and M instances, the time complexity is O(MNlogN ), so it
Y are completely independent. is a fast filtering feature selection algorithm. For the features
The SU uses entropy and conditional entropy to calculate generated by the FCBF algorithm, we sort it in descending
the correlation of features. The entropy of X is: order and then select a certain number of features to form the
X subset of required features.
H (X ) = − P(xi )log2 (P(xi )) (2)
i D. MALWARE DETECTION AND FAMILY CLASSIFICATION
After using the FCBF algorithm for feature selection, the con-
and the entropy of X after observing values of another vari- structed feature subset will be processed by the classifica-
able Y is defined as: tion algorithm, and the maliciousness of the sample will be
X X detected. CatBoost [40], [41] is a machine learning library
H (X |Y ) = − P(yi ) P(xi |yi )log2 (P(xi , yi )) (3)
open-sourced by Yandex in 2017. This algorithm is similar
j i
to XGBoost [42] and LightGBM [43] and is an improved
where P(xi ) is the prior probabilities for all values of X , and algorithm based on the framework of the gradient boosting
P(xi |yi ) is the posterior probabilities of X given the values of decision tree (GBDT) algorithm. CatBoost is based on the
Y .IG(X , Y ) represents the information gain: oblivious trees algorithm with few parameters, supporting
categorical variables and high accuracy. Compared with other
IG(X |Y ) = H (X ) − H (X |Y ) (4) GBDT-based algorithms, it can process categorical features
efficiently and reasonably. In addition, it can also handle gra-
Then, SU (X , Y ) between X and Y is: dient bias and prediction shift problems and improve the algo-
2IG(X , Y ) rithm’s accuracy and generalization ability. The CatBoost
SU (X |Y ) = (5) algorithm mainly proposes key methods from two aspects,
H (X ) + H (Y )
dealing with category features and ordered boosting.
An example illustrating the process of the FCBF algorithm We usually need to process categorical features before
is described as the following 7 steps. building a model. Suppose we have a dataset D = (Xi , Yi ),
Step 1: Calculate symmetric uncertainty SUX i ,Y between i = 1, 2, . . . , n. Xi = (xi ,1 , . . . , xi ,m ) is a vector with m fea-
feature Xi and target Y . tures, including numerical features and categorical features,
Step 2: Set threshold δ, if SUX i ,Y > δ, add Xi to feature set and Yi ∈ R is the label. The most common way to deal with
Slist and arrange the features in descending order according to categorical features in GBDT is to replace them with the
the (SUX i ,Y ) values. Suppose that six features X1 , . . . , X6 are average values of the tags corresponding to the categorical
obtained here, SUX 1 ,Y are the maximum values and SUX 6 ,Y features. In the decision tree, the label average value will be
are the minimum values. used as the criterion for node splitting. This method is called
Step 3: Select feature X1 (the first feature in Slist ) with the greedy target-based statistics, and it is expressed by the for-
maximum value of SUX i ,Y as the main feature Xm . mula below, where [·] denotes Iverson brackets, i.e., [xj ,k =
Step 4: Select features X2 , X3 , X4 , X5 , X6 , whose symmetry xi ,k ] equals 1 if xj ,k = xi ,k and 0 otherwise. This procedure
uncertainty (SUX i ,Y ) is less than the main feature (SUX m ,Y ) obviously leads to overfitting.
in the Slist . Calculate the symmetric uncertainty (SUX i ,X m ) Pn
between the feature and Xm , and the symmetric uncertainty j=1 [x j ,k = x i ,k ] · Y j
Pn (6)
(SUX i ,Y ) between the feature and the category Y . j=1 [x j ,k = x i ,k ]
CatBoost uses a more efficient strategy that reduces version is 3.7.6, and the main libraries used include Numpy,
overfitting and uses the whole dataset for training. Let Pandas, and Skfeature.
σ = (σ 1 , . . . , σ n ) as the permutation, xσp ,k is substituted
with (7). B. EVALUATION METRICS
Pp−1 The evaluation metrics are defined as follows. True positive
j=1 [xσj ,k = xσp ,k ]Y j + a · P (TP): the number of samples that are actually positive and
Pp−1 (7) predicted positive. False Positive (FP): the number of samples
j=1 [xσj ,k = xσp ,k ] + a that are actually negative but predicted positive. False Nega-
tive (FN): the number of samples that are actually positive
We also add a prior value P and a parameter a > 0, which is
but predicted negative. True Negative (TN): the number of
the weight of the prior. Adding a prior is a common practice
samples that are actually negative and predicted negative.
and helps to reduce the noise obtained from low-frequency
categories.
1) TPR
Prediction shift is often a problem that plagues modeling.
The percentage of samples correctly identified as positive out
In each iteration of GDBT, the loss function uses the same
of the total positive samples.
dataset to obtain the gradient of the current model and then
trains to obtain the base classifier. However, it will lead to TP
TPR = (8)
gradient bias and overfitting. CatBoost replaces the gradi- TP + FN
ent estimation method in traditional algorithms with ordered
2) FPR
boosting, reducing the deviation of gradient estimation and
improving the model’s generalization ability. The principle of The percentage of samples wrongly identified as positive out
ranking improvement is as follows. Suppose that Xi is sorted of the total negatives samples.
by a random arrangement σ . To obtain an unbiased gradient FP
FPR = (9)
estimation, CatBoost will train a separate model Mi for each FP + TN
sample Xi , and model Mi is obtained by training using a
training set that does not contain sample Xi . Then, model Mi 3) ACCURACY
is used to estimate the gradient of the sample, and finally, this The percentage of correctly classified samples out of the total
gradient training base learner is used to learn the final model. number of samples.
TP + TN
IV. EXPERIMENTS AND EVALUATIONS Accuracy = (10)
TP + TN + FP + FN
In this section, we discuss the parameter settings and classifi-
cation results of the presented FAMD framework from 6 dif- 4) PRECISION
ferent parts of experiments. The parameter settings include The percentage of correctly predicted positive samples out of
N-Gram selection and FCBF algorithm parameter selection. the total predicted positive samples.
In the classification, we compare the malware detection TP
results with other classifiers, and the key feature distributions Precision = (11)
TP + FP
are also discussed in this part. The proposed method are com-
pared with other state-of-the-art works, and we also evaluated 5) F1-SOCRE
the family classification results. The combination of precision and recall metrics that serves as
a comprise. The best F1-score equals 1, while the worst score
A. DATASETS AND EXPERIMENTAL ENVIRONMENT is 0.
The experiment uses two datasets: (1) The Drebin dataset, Precision × Recall
which contains 5,560 malicious samples and 5,666 benign F1 − Socre = 2× (12)
Precision + Recall
samples. It is widely used as a benchmark dataset and is
used to compare FAMD with other similar works. (2) The 6) ROC CURVE
DS-1 dataset. It is collected by this work and contains a total The ROC curve is created by plotting the true positive rate
of 25,737 applications, of which 12,989 are malicious sam- (TPR) against the false positive rate (FPR) at various values
ples and 12,748 are benign samples. The maximum size of a and threshold settings. It illustrates the diagnostic ability of
benign sample is 1.16 GB, while its minimum size is 8 KB. a binary classifier system as its discrimination threshold is
The maximum size of a malicious sample is 31.3 MB and its varied.
minimum size is 11 KB. We collected all of the benign sam-
ples from the third-party markets and used VirusTotal [44] to 7) AUC
detect the maliciousness of each benign sample to construct The area under the ROC curve is AUC, and its value can be
a training dataset as pure as possible. used to intuitively evaluate the quality of the classifier. The
The experiments use a Dell Power Edge 720 server with closer the AUC is to 1.0, the better the detection method will
Intel Xeon E5-2603 CPU and 64GB RAM. The Python be. When it is equal to 0.5, it has no application value.
TABLE 2. The number of sample failures while extracting different TABLE 3. Accuracy comparison of different N values.
features.
C. EXPERIMENTAL RESULTS
1) N-GRAM SETTING
For Dalvik opcodes based on the N-Gram, the value of
N affects two aspects: classification accuracy and feature
numbers. We use the DS-1 dataset to set the segmentation
length of the Dalvik opcode. When the length is set to N =
[2, 3, 4, 5], the corresponding length of the N-Gram opcode
sequence is extracted. The extracted features are input into the
CatBoost classifier, and 10-fold cross-validation is selected to
find the most appropriate length of N .
Due to the diverse designs of Android applications, espe-
cially malware applications that deliberately evade certain
features. From some APK files cannot be extracted a single
kind of features such as permissions. This results in many
samples being ignored in classification. We compare the
number of samples whose extraction failed when extracting
features individually and when extracting combined features
FIGURE 2. The accuracy of the FCBF algorithm using different parameters.
on the DS-1 dataset, as shown in Table 2.
It can be seen from Table 2 that when extracting the permis-
sion features combined with N-Gram opcodes, the features 96.21% when the features are ‘‘Permission with 5-Gram’’,
can be extracted from most of the samples, which is better and the second-best result is 95.84% with ‘‘Permission with
than extracting a single kind of feature. When extracting 4-Gram’’.
permission features, benign samples are more difficult to According to the results in Table 3 and Table 4, as well as
extract. Since we extracted and discussed official Android considering about the accuracy and feature dimensionality,
permissions in this work, it may related to some samples that we set the value of N to 4, and employ ‘‘Permission with
employ more customer permissions. The feature extraction 4-Gram’’ as features in the following experiments.
of N-Gram opcodes is the opposite. There are more cases of
relatively failed extraction in malware, such as the sample 2) FCBF ALGORITHM PARAMETER SETTING
(SHA-1: 8d2795c2e790c54b401fd52eb56279f6af0a07fb), Since the original feature set has high dimensionality, we use
which is small in size and performs malicious behaviors FCBF algorithm to perform feature selection to construct
through calling permissions. appropriate feature subset. We set the range of threshold δ
We compare the accuracy and the number of constructed to [0.005, 0.03], with the interval of 0.005, and the feature
features with different N-Gram length of Dalvik opcodes numbers are set in the range of 100 to 500. The result is shown
in Table 3. It can be seen that as the value of N gradually in Fig. 2.
increases, the classification accuracy is better but the growth From Fig. 2, as the number of features increases, the detec-
trend is getting smaller. However, as N increases, the number tion results are getting better. The best accuracy is achieved
of features will also increase obviously, leading to increased when threshold δ is set to 0.005 and the number of features is
computational consumption. set to 500, and that will be the chosen parameters of FCBF in
We combine the extracted permission features with the this work.
Dalvik opcode features and apply different values for N , and
the results are shown in Table 4. 3) COMPARE WITH OTHER CLASSIFIERS
The combination of two kinds of features achieves better The DS-1 dataset is used as experimental data, with 70% of
accuracy than any single feature kind. The best accuracy is the data is used as the training set and the rest is the test
V. CONCLUSION
The number of applications that can be classified as malware
continues to increase, new types of malware and camouflage
techniques are constantly updating, effectively detecting mal-
ware in a relatively short time is of considerable significance
FIGURE 4. Classification results of top 20 malware families in Drebin to the third-party application markets and users. How to
dataset.
improve the detection accuracy and reduce the detection time
are still the problems to be solved.
We present a fast Android malware detection framework,
FAMD, which combines permission features and Dalvik
opcode features from different operation levels to construct
feature vectors. To reduce the feature dimensionality and time
complexity of the method, the FCBF algorithm is employed
for feature selection. As a classifier proposed in recent years,
CatBoost is employed in this work to conduct malware detec-
tion and family classification.
In the experiments, we segment the opcodes with 4-Gram
and vectorize the features combined with permissions. With
the CatBoost as the classifier, the result achieves an accu-
racy of 97.40% in malware detection, and 97.38% in
family classification. Compared with other state-of-the-art
works, FAMD performs better comprehensively in accuracy
FIGURE 5. Confusion matrix of the top 20 Drebin malware families. and time consumption. It can be seen in the experiments
that there is a clear difference in the distribution of cer-
The overall accuracy of malware family classification is tain key features in malicious applications and benign
97.38%, which shows the effectiveness and feasibility of applications.
Since CatBoost is a supervised learning framework, this [23] A. Egitmen, I. Bulut, R. C. Aygun, A. B. Gunduz, O. Seyrekbasan, and
work is inadequate in detecting new emerging malicious A. G. Yavuz, ‘‘Combat mobile evasive malware via skip-gram-based mal-
ware detection,’’ Secur. Commun. Netw., vol. 2020, pp. 1–10, Apr. 2020.
applications, which we aim to improve in further work. [24] Y. Aafer, W. Du, and H. Yin, ‘‘Droidapiminer: Mining api-level features for
robust malware detection in Android,’’ in Proc. Int. Conf. Secur. Privacy
REFERENCES Commun. Syst. Cham, Switzerland: Springer, Sep. 2013, pp. 86–103.
[25] L. Cen, C. S. Gates, L. Si, and N. Li, ‘‘A probabilistic discriminative model
[1] (2020). Smartphone Market Share. [Online]. Available: https://www. for Android malware detection with decompiled source code,’’ IEEE Trans.
idc.com/promo/smartphone-market-share/os Dependable Secure Comput., vol. 12, no. 4, pp. 400–412, Jul. 2015.
[2] (2020). Mobile Malware Evolution 2019. [Online]. Available: [26] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, ‘‘HinDroid: An intelligent
https://securelist.com/mobile-malware-evolution-2019/96280 Android malware detection system based on structured heterogeneous
[3] W. Enck, M. Ongtang, and P. McDaniel, ‘‘On lightweight mobile phone information network,’’ in Proc. 23rd ACM SIGKDD Int. Conf. Knowl.
application certification,’’ in Proc. 16th ACM Conf. Comput. Commun. Discovery Data Mining, Aug. 2017, pp. 1507–1515.
Secur. (CCS), 2009, pp. 235–245. [27] Z. Ma, H. Ge, Y. Liu, M. Zhao, and J. Ma, ‘‘A combination method
[4] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, ‘‘MADAM: Effec- for Android malware detection based on control flow graphs and
tive and efficient behavior-based Android malware detection and preven- machine learning algorithms,’’ IEEE Access, vol. 7, pp. 21235–21245,
tion,’’ IEEE Trans. Dependable Secure Comput., vol. 15, no. 1, pp. 83–97, 2019.
Jan. 2018. [28] J. Zhang, Z. Qin, H. Yin, L. Ou, and K. Zhang, ‘‘A feature-hybrid
[5] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, ‘‘A multimodal deep malware variants detection using CNN based opcode embedding and
learning method for Android malware detection using various features,’’ BPNN based API embedding,’’ Comput. Secur., vol. 84, pp. 376–392,
IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, Mar. 2019. Jul. 2019.
[6] H. Zhang, S. Luo, Y. Zhang, and L. Pan, ‘‘An efficient Android malware [29] D. Arp, M. Spreitzenbarth, C. Siemens, M. Hübner, H. Gascon, and
detection system based on method-level behavioral semantic analysis,’’ K. Rieck, ‘‘Drebin: Effective and explainable detection of Android mal-
IEEE Access, vol. 7, pp. 69246–69256, 2019. ware in your pocket,’’ in Proc. Netw. Distrib. Syst. Secur. Symp., vol. 14,
[7] L. Onwuzurike, E. Mariconti, P. Andriotis, E. De Cristofaro, G. Ross, 2014, pp. 23–26.
and G. Stringhini, ‘‘MaMaDroid: Detecting Android malware by building [30] K. Xu, Y. Li, and R. H. Deng, ‘‘ICCDetector: ICC-based malware detec-
Markov chains of behavioral models (extended version),’’ ACM Trans. tion on Android,’’ IEEE Trans. Inf. Forensics Security, vol. 11, no. 6,
Privacy Secur., vol. 22, no. 2, pp. 1–34, 2019. pp. 1252–1264, Jun. 2016.
[8] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox, J. Jung, [31] M. Alazab, M. Alazab, A. Shalaginov, A. Mesleh, and A. Awajan, ‘‘Intel-
P. McDaniel, and A. N. Sheth, ‘‘TaintDroid: An information-flow tracking ligent mobile malware detection using permission requests and API calls,’’
system for realtime privacy monitoring on smartphones,’’ ACM Trans. Future Gener. Comput. Syst., vol. 107, pp. 509–521, Jun. 2020.
Comput. Syst., vol. 32, no. 2, pp. 1–29, Jun. 2014. [32] Permissions Overview. Accessed: Oct. 27, 2020. [Online]. Available:
[9] J. Chen, C. Wang, Z. Zhao, K. Chen, R. Du, and G.-J. Ahn, ‘‘Uncovering https://developer.android.google.cn/guide/topics/permissions/overview
the face of Android ransomware: Characterization and real-time detec- [33] W. B. Cavnar and J. M. Trenkle, ‘‘N-gram-based text categorization,’’
tion,’’ IEEE Trans. Inf. Forensics Security, vol. 13, no. 5, pp. 1286–1300, in Proc. 3rd Annu. Symp. Document Anal. Inf. Retr., vol. 161175, 1994,
May 2018. pp. 1–14.
[10] H. Cai, N. Meng, B. Ryder, and D. Yao, ‘‘DroidCat: Effective Android [34] Dalvik Bytecode. Accessed: Oct. 27, 2020. [Online]. Available:
malware detection and categorization via app-level profiling,’’ IEEE Trans. https://source.android.com/devices/tech/dalvik/dalvik-bytecode
Inf. Forensics Security, vol. 14, no. 6, pp. 1455–1470, Jun. 2019. [35] Androguard. Accessed: Oct. 27, 2020. [Online]. Available: https://github.
[11] S. Y. Yerima, M. K. Alzaylaee, and S. Sezer, ‘‘Machine learning- com/androguard/androguard
based dynamic analysis of Android apps with improved code coverage,’’ [36] L. Yu and H. Liu, ‘‘Feature selection for high-dimensional data: A fast
EURASIP J. Inf. Secur., vol. 2019, no. 1, p. 4, Dec. 2019. correlation-based filter solution,’’ in Proc. 20th Int. Conf. Mach. Learn.
[12] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, ‘‘Droid-sec: Deep learning in Android (ICML), 2003, pp. 856–863.
malware detection,’’ in Proc. ACM Conf. SIGCOMM, 2014, pp. 371–372. [37] B. Senliol, G. Gulgezen, L. Yu, and Z. Cataltepe, ‘‘Fast correlation based
[13] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, ‘‘CopperDroid: Automatic filter (FCBF) with a different search strategy,’’ in Proc. 23rd Int. Symp.
reconstruction of Android malware behaviors,’’ in Proc. Netw. Distrib. Comput. Inf. Sci., Oct. 2008, pp. 1–4.
Syst. Secur. Symp., 2015, pp. 1–15. [38] D. Gharavian, M. Sheikhan, A. Nazerieh, and S. Garoucy, ‘‘Speech emo-
[14] (2020). Android Permission. [Online]. Available: https://developer. tion recognition using FCBF feature selection method and GA-optimized
android.com/reference/android/Manifest.permission fuzzy ARTMAP neural network,’’ Neural Comput. Appl., vol. 21, no. 8,
[15] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, J. Nieves, P. G. Bringas, pp. 2115–2126, Nov. 2012.
and G. Á. Marañón, ‘‘Mama: Manifest analysis for malware detection in [39] A. W. Moore and D. Zuev, ‘‘Internet traffic classification using Bayesian
Android,’’ Cybern. Syst., vol. 44, nos. 6–7, pp. 469–488, Oct. 2013. analysis techniques,’’ in Proc. ACM SIGMETRICS Int. Conf. Meas. Mod-
[16] W. Wang, X. Wang, D. Feng, J. Liu, Z. Han, and X. Zhang, ‘‘Explor- eling Comput. Syst. (SIGMETRICS), 2005, pp. 50–60.
ing permission-induced risk in Android applications for malicious appli- [40] A. V. Dorogush, V. Ershov, and A. Gulin, ‘‘CatBoost: Gradient boosting
cation detection,’’ IEEE Trans. Inf. Forensics Security, vol. 9, no. 11, with categorical features support,’’ 2018, arXiv:1810.11363. [Online].
pp. 1869–1882, Nov. 2014. Available: http://arxiv.org/abs/1810.11363
[17] K. A. Talha, D. I. Alper, and C. Aydin, ‘‘APK auditor: Permission-based [41] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin,
Android malware detection system,’’ Digit. Invest., vol. 13, pp. 1–14, ‘‘Catboost: Unbiased boosting with categorical features,’’ in Proc. Adv.
Jun. 2015. Neural Inf. Process. Syst., 2018, pp. 6638–6648.
[18] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, and H. Ye, ‘‘Significant permission [42] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
identification for machine-learning-based Android malware detection,’’ in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225, Jul. 2018. Aug. 2016, pp. 785–794.
[19] Q. Jerome, K. Allix, R. State, and T. Engel, ‘‘Using opcode-sequences to [43] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu,
detect malicious Android applications,’’ in Proc. IEEE Int. Conf. Commun. ‘‘Lightgbm: A highly efficient gradient boosting decision tree,’’ in Proc.
(ICC), Jun. 2014, pp. 914–919. Adv. Neural Inf. Process. Syst., 2017, pp. 3146–3154.
[20] N. McLaughlin, J. M. D. Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, [44] Virustotal. Accessed: Oct. 27, 2020. [Online]. Available: https://www.
Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, and G. J. Ahn, ‘‘Deep Android virustotal.com/gui/home/upload
malware detection,’’ in Proc. 7th ACM Conf. Data Appl. Secur. Privacy, [45] G. Canfora, A. De Lorenzo, E. Medvet, F. Mercaldo, and C. A. Visaggio,
2017, pp. 301–308. ‘‘Effectiveness of opcode ngrams for detection of multi family Android
[21] J. Zhang, Z. Qin, K. Zhang, H. Yin, and J. Zou, ‘‘Dalvik opcode graph malware,’’ in Proc. 10th Int. Conf. Availability, Rel. Secur., Aug. 2015,
based Android malware variants detection using global topology features,’’ pp. 333–340.
IEEE Access, vol. 6, pp. 51964–51974, 2018. [46] K. Riad and L. Ke, ‘‘RoughDroid: Operative scheme for functional
[22] A. Pektaş and T. Acarman, ‘‘Learning to detect Android malware via Android malware detection,’’ Secur. Commun. Netw., vol. 2018, pp. 1–10,
opcode sequences,’’ Neurocomputing, vol. 396, pp. 599–608, Jul. 2020. Sep. 2018.
HONGPENG BAI received the B.S. degree from XIAOQIANG DI received the B.S. degree in com-
Liaoning Shihua University, in 2016. He is cur- puter science and technology and the M.S. and
rently pursuing the master’s degree in computer Ph.D. degrees in communication and informa-
technology with the Changchun University of Sci- tion systems from the Changchun University of
ence and Technology, China. His research interests Science and Technology, in 2002, 2007, and 2014,
include machine learning and malware detection. respectively. He was a Visiting Scholar with the
Norwegian University of Science and Technology,
Norway, from August 2012 to August 2013. He is
currently a Professor and a Ph.D. Supervisor with
the Changchun University of Science and Technol-
ogy. His major research interests include network information security and
integrated networks.