0% found this document useful (0 votes)
23 views

FAMD A Fast Multifeature Android Malware Detection

Uploaded by

Muhammet Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

FAMD A Fast Multifeature Android Malware Detection

Uploaded by

Muhammet Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received August 31, 2020, accepted October 15, 2020, date of publication October 22, 2020, date of current

version November 6, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3033026

FAMD: A Fast Multifeature Android Malware


Detection Framework, Design,
and Implementation
HONGPENG BAI 1,2 , NANNAN XIE 1,2 , XIAOQIANG DI 1,2,3 , AND QING YE1
1 School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
2 Jilin Province Key Laboratory of Network and Information Security, Changchun 130022, China
3 Information Center, Changchun University of Science and Technology, Changchun 130022, China

Corresponding author: Nannan Xie ([email protected])


This work was supported in part by the 13th Five-Year Science and Technology Research Project of the Education Department of Jilin
Province under Grant JJKH20200794KJ, in part by the Innovation Fund of Changchun University of Science and Technology under Grant
XJJLG-2018-09, and in part by the Fund of the Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of
Education, Jilin University, under Grant 93K172018K05.

ABSTRACT With Android’s dominant position within the current smartphone OS, increasing number
of malware applications pose a great threat to user privacy and security. Classification algorithms that
use a single feature usually have weak detection performance. Although the use of multiple features can
improve the detection effect, increasing the number of features increases the requirements of the operating
environment and consumes more time. We propose a fast Android malware detection framework based on the
combination of multiple features: FAMD (Fast Android Malware Detector). First, we extracted permissions
and Dalvik opcode sequences from samples to construct the original feature set. Second, the Dalvik opcodes
are preprocessed with the N-Gram technique, and the FCBF (Fast Correlation-Based Filter) algorithm
based on symmetrical uncertainty is employed to reduce feature dimensionality. Finally, the dimensionality-
reduced features are input into the CatBoost classifier for malware detection and family classification. The
dataset DS-1, which we collected, and the baseline dataset Drebin were used in the experiment. The results
show that the combined features can effectively improve the detection accuracy of malware that can reach
97.40% on Drebin dataset, and the malware family classification accuracy can achieve 97.38%. Compared
with other state-of-the-art works, our framework achieves higher accuracy and lower time consumption.

INDEX TERMS Android malware, CatBoost, Dalvik opcode, malware detection.

I. INTRODUCTION 3,503,952 malicious installation packages were found in its


In the past ten years, advancements in mobile internet tech- mobile terminal products. The number of attacks on mobile
nology have changed the lifestyles of countless users and devices increased by 50% in 2019, from 40,386 in 2018 to
have also brought tremendous changes to the proceedures 67,500 in 2019. In addition to spyware and Trojans in tradi-
used in various industries, such as governments and enter- tional network security, the usage of stalkerware on mobile
prises. However, a series of security risks have arisen in devices is growing. Due to the large number of Android
mobile internet technology. Malware applications are hid- malware, the fast update speed and the constant emergence
den in smart terminals, such as information leaks, Trojan of new types of malware, it is always challenging to study
horses, push advertising, and pose threats to user privacy. how to effectively detect malware, reduce the detection time
International Data Company (IDC) [1], estimates, estimates and improve the detection efficiency.
that Android’s smartphone market share will hover around Android malware detection research mainly includes two
86% in 2020. In 2019, Kaspersky’s report [2] showed that aspects. The one is the detection features, which include
requested permissions, API calls, Dalvik opcodes, and inter-
The associate editor coordinating the review of this manuscript and component communication. Different features or combined
approving it for publication was Roberto Pietrantuono . features are employed to detect malicious applications.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 194729
H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

The other is the detection methods, which use different analysis, pattern matching, and static system call analysis.
machine learning methods or combinations of methods as The advantages of static analysis are low resource consump-
classifiers, such as SVM (Support Vector Machine), KNN tion, fast detection, and low real-time requirements, and the
(K-NearestNeighbor), RF (Random Forest), and deep learn- disadvantage is that the detection accuracy is relatively low.
ing methods, to identify the different behavior patterns, and The static analysis method is the most commonly used
establish detection systems. The purpose of these studies is method in current research. Enck et al. [3]designed a set of
to improve the accuracy of malware detection with the hope security rules that use a signature-based approach to detect
that the methods are effective in practice. the application being evaluated. Saracino et al. [4] proposed
In order to achieve the above purpose, we propose a fast a host-based malware detection system for Android devices
Android malware detection framework, FAMD, that com- called MADAM, which simultaneously analyzes and corre-
bines multiple features and uses a classification technique lates features at the kernel level, application level, user level
to detect malware and classifiy malware families. It uses and package level to detect and prevent malicious behav-
permissions and Dalvik opcodes as classification features ior. Kim et al. [5] proposed framework usage permissions,
and further uses the FCBF algorithm to process the fea- strings, API calls, and other features to reflect the various
tures to construct low-dimensional feature vectors. Finally, characteristics of applications from various aspects. Their
the machine learning framework CatBoost based on the gra- feature vector generation method consists of an existence-
dient boosting decision tree is used as the classifier to perform based method and a similarity-based method, and these are
the classification of malware. The main contributions of this very effective in distinguishing between malware and benign
paper are as follows. applications, even though malware has many properties that
• We propose a fast Android malware detection frame- are similar to those of benign applications. In addition,
work, FAMD, which includes three parts: constructing a Zhang et al. [6] keep the abstracted API calls of function
malware detection feature set, preprocessing the features methods to form a set of abstracted API calls transactions
for dimensionality reduction, and performing malware and calculate the confidence of association rules between
detection and family classification on the processed fea- the abstracted API calls. Combine machine learning to iden-
tures. The purpose is to improve the accuracy of malware tify the different behavior patterns, and establish a detection
detection while reducing the feature dimensions. system. The framework of MaMaDroid [7] constructs the
• In terms of feature preprocessing, because the sequences sequence obtained in the API call graph as a Markov chain
of Dalvik opcode are segmented by the N-Gram method, to detect malware from the perspective of behavior.
the feature dimension is high. We use the FCBF algo- Dynamic analysis covers a family of methods based on
rithm to reduce the dimension of the features from analyzing the runtime behavior of an application. It is usually
2467 to 500. necessary to run the application in a specific environment to
• CatBoost is adopted as the classifier for the first time monitor the application’s access to the network, system calls,
in Android malware detection and family classification. files and memory, information access patterns, and process-
Compare with other GBDT-based methods, CatBoost ing behaviors. Dynamic analysis judges the maliciousness
can solve the problems of gradient bias and prediction of an application by analyzing whether the abovementioned
shift, thus reducing the occurrence of over-fitting and behaviors are normal. The advantage of dynamic analysis is
improving the classification accuracy and the general- that it is not affected by code obfuscation and encryption
ization ability of the model. and can analyze an application based on its malware-like
The rest of the paper is organized as follows. Section II behavior. However, it consumes system more resources and
introduces related research on Android malware detection. requires analysts with high technical capabilities, which is not
Section III presents the framework FAMD, and gives the conducive to large-scale applications in testing.
implementation of each part of the framework. Then, section In 2014, Enck et al. [8] proposed a dynamic malware detec-
IV provides more details of our framework implementation tion tool, TaintDroid, which labeled a variety of sensitive
and discusses FAMD’s evaluation results. Finally, section V data, and then monitored the flow path of these contami-
concludes the paper. nated sensitive data in a sandbox environment in real-time to
determine whether the application had malicious behaviors
II. RELATED WORK of privacy data leakage. RansomProber [9] can infer whether
A. STATIC ANALYSIS AND DYNAMIC the user initiated the file encryption operation by analyzing
ANALYSIS OF MALWARE the user interface widgets of related activities and the user’s
The current research on Android malware detection can be finger movement coordinates and has a good effect in detect-
divided into static analysis and dynamic analysis from the ing encrypted ransomware. Cai et al. [10] used a variety of
perspective of feature extraction. Static analysis refers to the dynamic features based on method calls and inter-component
analysis of the source code or the analysis of the features communication (ICC) intents to achieve better robustness
extracted from the source code. This method can analyze than static analysis and dynamic analysis, which depends
a program’s source code without the application being exe- on system calls. Yerima et al. [11] proposed and investigate
cuted. The static analysis includes decompilation, reverse approaches based on stateful event generation and provided

194730 VOLUME 8, 2020


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

much better code coverage, which leads to more accurate opcodes can reflect the behavior pattern of an application to
machine learning-based malware detection. a certain extent by means of the underlying machine code,
There are also some studies combine static and dynamic so they are often used as static analysis features.
features. Yuan et al. [12] used the requested permissions, Jerome et al. [19] used N-Gram-based opcodes as fea-
suspicious API calls, and dynamic behaviors with a total tures to detect malware and classify malware families.
of 202 features to build a complete deep learning model. McLaughlin et al. [20] proposed using a deep convolutional
Tam et al. [13] proposed CopperDroid, which can capture neural network to automatically learn from the original
operations initiated in Java and the native code execution to opcode sequence, thereby eliminating the need for manu-
reconstruct the behavior of Android malware based on the ally designed malware features. Zhang et al. [21] extracted
automatic dynamic analysis system in VMI (virtual machine several global topology features from the Dalvik opcode
introspection). graph of each sample to represent malware. This method
achieves better detection efficiency and robustness. Pektaş
B. FEATURES OF ANDROID MALWARE DETECTION and Acarman [22] extracted the instruction call graph from
1) PERMISSION FEATURES a malicious application and derived an instruction call
According to Android mechanism, every Android appli- sequence to represent Android malware. The accuracy of the
cation runs in a limited-access sandbox. If an applica- proposed malware detection method reached 91.42%. The
tion needs to use resources or information outside of its model proposed by Egitmen et al. [23] extracts skip-gram-
own sandbox, the application has to request the appropriate based features from the instruction sequence of an appli-
permissions. Therefore, malware can be found by view- cation, and a word embedded vector is generated for each
ing the permissions declared in the AndroidManifest.XML unique opcode to realize the high-level representation of the
file. Permission features can be divided into two types: opcode sequence, and it is used as the input feature of the
official permissions and custom permissions. There are detection model.
166 official permissions defined by Android [14], such as
android.permission.INTERNET, which allows applications 3) OTHER FEATURES
to open network sockets. All developers can request these per- In addition to permissions, Dalvik opcodes, Android malware
missions. By defining custom permissions, an application can detection features also include API calls [24]–[26], control
share its resources and capabilities with other applications. flow graphs (CFGs) [27], component information, and hard-
For example, if a developer wants to prevent certain users ware information. For example, the API Getdeviceid() can be
from launching an activity in an application, the developer used to access sensitive data and obtain the user’s device ID.
can define custom permissions to achieve this. After the Therefore, it is also an effective method to detect the mali-
permissions are defined, they can be referenced as part of the ciousness of the application by studying the application’s API
component definition. calls. Zhang et al. [28] represented opcodes with a bi-gram
Android’s permission features can reflect the behavior of model and represented API calls with a frequency vector.
the application in a certain sense. Sanz et al. [15]used the per- Then, they used principal component analysis to optimize the
missions as features and combined them with machine learn- representations and to improve the convergence speed.
ing algorithms to detect Android malware. Wang et al. [16] Since attacks may specifically evade detection by avoid
systematically analyzed the risk of each individual permis- using certain permissions or API calls, employing a single
sion and the risk of a group of collaborative permissions kind of feature in malware detection may affect the results.
by employing machine learning techniques. Talha et al. [17] Some works used combined features to detect malware.
implemented a permission-based Android malware detec- Arp et al. [29] performed extensive static analysis and col-
tion system, APK Auditor, which can achieve 88% accuracy lected as many application features as possible. These
and 92.5% specificity. Li et al. [18] proposed a multi-level features are embedded in a joint vector space, which can auto-
data pruning method, SIGPID, which includes negative-rate matically identify typical patterns that represent malware.
permission sorting, association-rule permission mining, and ICCDetector [30] uses captured interactions between compo-
support-based permission sorting to extract significant per- nents within an application or across application boundaries
missions strategically. When using SVM as the classifier,they as features to detect malware. Alazab et al. [31] com-
can achieve over 90% of precision, recall, accuracy, and bined request permissions and API calls. Compared with
F-measure. benign applications, malicious applications call a different
set of API calls. Malware usually requests dangerous per-
2) DALVIK OPCODE FEATURES missions to access sensitive data more frequently than benign
Dalvik is a virtual machine that was used to run Android applications.
applications in early Android systems. Every time it runs, The purpose of adopting multiple types of features is
it dynamically interprets a part of Dalvik bytecode as machine to improve the detection effect. However, the combination
code. After Android 5.0, the Dalvik virtual machine (DVM) of multiple features will increase the feature dimensions,
was replaced by Android Runtime (ART), but the compilation making the classifier consume much time in the operation
method of the underlying opcode is still compatible. These process and not detect efficiently. Therefore, reducing the

VOLUME 8, 2020 194731


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

FIGURE 1. Framework of FAMD.

time consumed in malware detection is also a focus of • Feature selection. Since the constructed feature vectors
research. Applying appropriate feature selection methods to have high dimensionality which will result in high com-
reduce the features’ dimensions is a solution to this problem. putational cost and overfitting, we employ the feature
selection techniques to reduce dimensionality. FCBF
III. DESIGN AND IMPLEMENTATION OF FAMD algorithm is used to weight the features and construct
A. THE FRAMEWORK OF FAMD the feature subset. The parameters and subset feature
FAMD is a fast Android malware detection framework based numbers will be decided by experiments.
on multifeature combination. We combine the permission • Malware detection and family classification. After
features and Dalvik opcode features from different levels of distinguishing the malicious samples from benign ones,
the operating system. To deal with the high dimensionality dividing malware into families is important to analyze
problem emerged after feature combination, the feature selec- the behaviors of malware. We use a machine learning
tion method is used to reduce the dimensionality, thereby algorithm based on the gradient boosted decision tree,
reducing the classification consumption and achieving the CatBoost, as the classifier to detect malicious sam-
purpose of being fast. Specifically, the FAMD framework will ples and classify the malware families. The evalua-
be divided into four parts: Android application collection, tion metrics such as accuracy, precision, TPR, FPR are
feature extraction and preprocessing, feature selection, mal- used to verify the effectiveness and performance of the
ware detection and family classification, as shown in Fig. 1. framework.
• Android application collection. The applications in B. FEATURE EXTRACTION AND PREPROCESSING
this work are collected from an open source dataset and 1) EXTRACTION OF PERMISSIONS
third-party markets. The collected samples are filtered The purpose of setting permissions is to protect the privacy
by antivirus engine to ensure the purity of the malicious- of Android users. Android applications must apply for per-
ness and normality. The details will be introduced in the mission to access sensitive user data (such as contacts and
experiment section. text messages) and certain system functions (such as camera
• Feature extraction and preprocessing. We use decom- and Internet). Depending on the function, the system may
pilation tools to extract permissions and original opcode automatically grant permissions, or the user may be prompted
sequences from the AndroidManifest.XML file and the to approve the request. Android divides permissions into four
classes.dex file. Based on the N-Gram method, the spe- protection levels [32], which affect whether runtime permis-
cific length of the opcode sequence is extracted from sion requests are required.
the original opcode sequence, and the feature vector • Normal Permission. This category of permissions cov-
of each sample is constructed in combination with the ers situations in which the application needs to access
permission features. Finally, we construct the feature data or resources outside its sandbox. These situations
matrix with each application as a row, and each of the pose little risk to a user’s privacy or the operation of other
extracted features as a column. applications.

194732 VOLUME 8, 2020


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

• Dangerous permission. Contrary to normal permis- TABLE 1. Dalvik instruction mapping table.
sions, if an application should acquire this type of per-
missions, the user’s private data will be exposed to the
risk of tampering.
• Signature permission. This type of permissions is only
open to applications with the same signature. Even if
other applications know this open data interface and they
also register permissions in the AndroidManifest.XML
file, they still cannot access the corresponding data due
to different application signatures.
• SignatureOrSystem permission. This permission cat- to the ‘‘V’’ instruction, and ‘‘goto’’ corresponds to the
egory is similar to signature permission, but it not only ‘‘G’’ instruction. Therefore, the sequence is simplified as
requires the same signature but also requires similar ‘‘MTVGM’’. Then the 3-Gram features of the sequence are
system-level applications. This type of permissions is {MTV},{TVG},{VGM}, the 4-Gram features are {MTVG},
only used for prefabricated applications developed by {TVGM}, and the 5-Gram feature is {MTVGM}.
general mobile phone manufacturers.
3) CONSTRUCTION OF FEATURE VECTORS
2) PROCESSING OF DALVIK OPCODE Androguard [35] is a python-based Android analysis tool that
N-Gram [33] is a method based on statistical language mod- can analyze an Android file structure through decompila-
els. It performs a sliding window operation of size N on the tion and extract static features. All permissions and opcode
content of the text, forming a sequence of byte fragments of sequences of each application can be extracted from the
length N . Each byte segment is called Gram. The frequency AndroidManifest.XML file and the classes.dex file through
of occurrence of all Grams is counted and filtered according Androguard. In this work, in order to limit the dimensional-
to the preset threshold to form a key Gram list, which is the ity of feature vectors and ensure the generality of extracted
feature vector space of this text, and each element in the Gram features, only 166 official permissions are extracted without
list is a feature vector dimension. considering custom permissions. For the extracted Dalvik
The N-Gram model is based on the following hypothesis: opcode sequence, according to the above mapping table,
the N th word’s appearance is only related to the previous an opcode sequence of a specific length is extracted. These
N − 1 words and is not related to any other words. The prob- features constitute the initial feature set.
ability of an entire sentence occurring is the product of the The feature set is numerically simulated in the following
probability of each word occurring. These probabilities can way to construct feature vectors. Assuming an Android appli-
be obtained by directly counting the number of simultaneous cation a, the feature set constructed from all applications
occurrences of N words from the corpus. contains n features, and the feature set is represented by S;
In malware detection, the N-Gram method is often used then, the feature vector of application a is represented by
to process malicious codes. The N-Gram features are usually equation (1).
extracted from the application opcode sequences. N is usually
valued at 2, 3, and 4. Va = {v1 , v2 , . . . , vn },
(
The current Dalvik instruction [34] set contains 230 1, vi ∈ a and vi ∈ S, 1 ≤ i ≤ n;
vi = (1)
instructions, including the ‘‘Move’’ instruction, ‘‘Invoke’’ 0, otherwise
instruction, ‘‘Return’’ instruction and so on. Existing studies
have shown that methods based on N-Grams face the prospect Therefore, the feature vector can be expressed as Va =
of exponential growth in the number of unique N-Grams as {0, 1, 0, 0, 1, 0, 1, . . . , 1}, where 1 indicates that the feature
the value of N increases. Therefore, in this paper, we sim- is included in the application and 0 indicates that the feature
plify the opcodes by remove the irrelevant instructions, retain is not included. For sample labels, 1 represents malware, and
only the seven core instruction sets, and remove the operands. -1 represents benign.
The seven instruction sets, M, R, G, I, T, P, and V, represent
seven types of instructions, move, return, jump, judge, read C. FEATURE SELECTION BASED ON FCBF
data, store data, and call methods, respectively. The instruc- Feature selection is the process of selecting a subset of M
tions are classified and described in Table 1. features from N feature sets while meeting the condition
According to the above mapping, we use the N-Gram to M ≤ N . The purpose of feature selection is to remove
segment the opcode sequence extracted from the applica- the redundant or irrelevant features from a set of features to
tions. The original opcode sequence: ‘‘move-object/from16, reduce the dimensionality.
iget-object, invoke-virtual, goto, move-object/from16’’ is According to the execution process of the feature selection
taken as an example. The ‘‘move-object/from16’’ sequence algorithm, feature selection can be divided into 3 categories:
corresponds to the ‘‘M’’ instruction, ‘‘iget-object’’ corre- Filter methods rely on the general characteristics of the train-
sponds to the ‘‘T’’ instruction, ‘‘invoke-virtual’’ corresponds ing data to select features with independence of any classifier.

VOLUME 8, 2020 194733


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

Wrapper methods use the classifier as a black box and the Step 5: If SUX i ,X m > SUX i ,Y , then this feature is proven
classifier performance as the objective function to evaluate to be a redundant feature, and the feature is removed from
the variable subset. Embedded methods want to reduce the Slist . Here, we assume that features X2 and X4 are removed.
computation time taken up for reclassifying different subsets Step 6: Add X1 to the feature subset Ssub , choose X3 as the
which is done in wrapper methods. The main approach is main feature Xm among the remaining features in Slist .
to incorporate the feature selection as part of the training Step 7: If Slist is not null, repeat the process of Step 5,
process. suppose that we remove X6 and add X5 to the feature subset
FCBF is a fast-correlation filter algorithm proposed by Ssub . If Slist is null, Ssub is the terminal Subset, the selection
Yu and Liu [36], Senliol et al. [37] in 2003. It has a wide range is stop.
of applications in speech recognition [38], network traffic After the above process, we get the final feature subset
classification [39], and other fields because of its fast calcula- Ssub . Compared with other algorithms, one of the advantages
tion. The FCBF algorithm employs symmetrical uncertainty of the FCBF algorithm is the ability to remove redundant
(SU ) to measure the correlation between two features. The features. For two features X1 and X2 , with mutual redundancy,
theoretical basis is that if the SU of feature X and target Y is suppose that X1 has a higher correlation with target Y . After
high, and the SU of other features and target Y is low, then calculation, feature X1 with the higher correlation with cat-
feature X is more important and has a higher weight. When egory Y is retained, and X2 with the lower correlation will
the value of SU between two features is 1, it means that X be removed. At the same time, the more relevant X1 can be
and Y are completely correlated; in other words, if X → Y , used to filter other features. For a dataset with N features
then Y → X . When the value of SU is 0, it means that X and and M instances, the time complexity is O(MNlogN ), so it
Y are completely independent. is a fast filtering feature selection algorithm. For the features
The SU uses entropy and conditional entropy to calculate generated by the FCBF algorithm, we sort it in descending
the correlation of features. The entropy of X is: order and then select a certain number of features to form the
X subset of required features.
H (X ) = − P(xi )log2 (P(xi )) (2)
i D. MALWARE DETECTION AND FAMILY CLASSIFICATION
After using the FCBF algorithm for feature selection, the con-
and the entropy of X after observing values of another vari- structed feature subset will be processed by the classifica-
able Y is defined as: tion algorithm, and the maliciousness of the sample will be
X X detected. CatBoost [40], [41] is a machine learning library
H (X |Y ) = − P(yi ) P(xi |yi )log2 (P(xi , yi )) (3)
open-sourced by Yandex in 2017. This algorithm is similar
j i
to XGBoost [42] and LightGBM [43] and is an improved
where P(xi ) is the prior probabilities for all values of X , and algorithm based on the framework of the gradient boosting
P(xi |yi ) is the posterior probabilities of X given the values of decision tree (GBDT) algorithm. CatBoost is based on the
Y .IG(X , Y ) represents the information gain: oblivious trees algorithm with few parameters, supporting
categorical variables and high accuracy. Compared with other
IG(X |Y ) = H (X ) − H (X |Y ) (4) GBDT-based algorithms, it can process categorical features
efficiently and reasonably. In addition, it can also handle gra-
Then, SU (X , Y ) between X and Y is: dient bias and prediction shift problems and improve the algo-
2IG(X , Y ) rithm’s accuracy and generalization ability. The CatBoost
SU (X |Y ) = (5) algorithm mainly proposes key methods from two aspects,
H (X ) + H (Y )
dealing with category features and ordered boosting.
An example illustrating the process of the FCBF algorithm We usually need to process categorical features before
is described as the following 7 steps. building a model. Suppose we have a dataset D = (Xi , Yi ),
Step 1: Calculate symmetric uncertainty SUX i ,Y between i = 1, 2, . . . , n. Xi = (xi ,1 , . . . , xi ,m ) is a vector with m fea-
feature Xi and target Y . tures, including numerical features and categorical features,
Step 2: Set threshold δ, if SUX i ,Y > δ, add Xi to feature set and Yi ∈ R is the label. The most common way to deal with
Slist and arrange the features in descending order according to categorical features in GBDT is to replace them with the
the (SUX i ,Y ) values. Suppose that six features X1 , . . . , X6 are average values of the tags corresponding to the categorical
obtained here, SUX 1 ,Y are the maximum values and SUX 6 ,Y features. In the decision tree, the label average value will be
are the minimum values. used as the criterion for node splitting. This method is called
Step 3: Select feature X1 (the first feature in Slist ) with the greedy target-based statistics, and it is expressed by the for-
maximum value of SUX i ,Y as the main feature Xm . mula below, where [·] denotes Iverson brackets, i.e., [xj ,k =
Step 4: Select features X2 , X3 , X4 , X5 , X6 , whose symmetry xi ,k ] equals 1 if xj ,k = xi ,k and 0 otherwise. This procedure
uncertainty (SUX i ,Y ) is less than the main feature (SUX m ,Y ) obviously leads to overfitting.
in the Slist . Calculate the symmetric uncertainty (SUX i ,X m ) Pn
between the feature and Xm , and the symmetric uncertainty j=1 [x j ,k = x i ,k ] · Y j
Pn (6)
(SUX i ,Y ) between the feature and the category Y . j=1 [x j ,k = x i ,k ]

194734 VOLUME 8, 2020


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

CatBoost uses a more efficient strategy that reduces version is 3.7.6, and the main libraries used include Numpy,
overfitting and uses the whole dataset for training. Let Pandas, and Skfeature.
σ = (σ 1 , . . . , σ n ) as the permutation, xσp ,k is substituted
with (7). B. EVALUATION METRICS
Pp−1 The evaluation metrics are defined as follows. True positive
j=1 [xσj ,k = xσp ,k ]Y j + a · P (TP): the number of samples that are actually positive and
Pp−1 (7) predicted positive. False Positive (FP): the number of samples
j=1 [xσj ,k = xσp ,k ] + a that are actually negative but predicted positive. False Nega-
tive (FN): the number of samples that are actually positive
We also add a prior value P and a parameter a > 0, which is
but predicted negative. True Negative (TN): the number of
the weight of the prior. Adding a prior is a common practice
samples that are actually negative and predicted negative.
and helps to reduce the noise obtained from low-frequency
categories.
1) TPR
Prediction shift is often a problem that plagues modeling.
The percentage of samples correctly identified as positive out
In each iteration of GDBT, the loss function uses the same
of the total positive samples.
dataset to obtain the gradient of the current model and then
trains to obtain the base classifier. However, it will lead to TP
TPR = (8)
gradient bias and overfitting. CatBoost replaces the gradi- TP + FN
ent estimation method in traditional algorithms with ordered
2) FPR
boosting, reducing the deviation of gradient estimation and
improving the model’s generalization ability. The principle of The percentage of samples wrongly identified as positive out
ranking improvement is as follows. Suppose that Xi is sorted of the total negatives samples.
by a random arrangement σ . To obtain an unbiased gradient FP
FPR = (9)
estimation, CatBoost will train a separate model Mi for each FP + TN
sample Xi , and model Mi is obtained by training using a
training set that does not contain sample Xi . Then, model Mi 3) ACCURACY
is used to estimate the gradient of the sample, and finally, this The percentage of correctly classified samples out of the total
gradient training base learner is used to learn the final model. number of samples.
TP + TN
IV. EXPERIMENTS AND EVALUATIONS Accuracy = (10)
TP + TN + FP + FN
In this section, we discuss the parameter settings and classifi-
cation results of the presented FAMD framework from 6 dif- 4) PRECISION
ferent parts of experiments. The parameter settings include The percentage of correctly predicted positive samples out of
N-Gram selection and FCBF algorithm parameter selection. the total predicted positive samples.
In the classification, we compare the malware detection TP
results with other classifiers, and the key feature distributions Precision = (11)
TP + FP
are also discussed in this part. The proposed method are com-
pared with other state-of-the-art works, and we also evaluated 5) F1-SOCRE
the family classification results. The combination of precision and recall metrics that serves as
a comprise. The best F1-score equals 1, while the worst score
A. DATASETS AND EXPERIMENTAL ENVIRONMENT is 0.
The experiment uses two datasets: (1) The Drebin dataset, Precision × Recall
which contains 5,560 malicious samples and 5,666 benign F1 − Socre = 2× (12)
Precision + Recall
samples. It is widely used as a benchmark dataset and is
used to compare FAMD with other similar works. (2) The 6) ROC CURVE
DS-1 dataset. It is collected by this work and contains a total The ROC curve is created by plotting the true positive rate
of 25,737 applications, of which 12,989 are malicious sam- (TPR) against the false positive rate (FPR) at various values
ples and 12,748 are benign samples. The maximum size of a and threshold settings. It illustrates the diagnostic ability of
benign sample is 1.16 GB, while its minimum size is 8 KB. a binary classifier system as its discrimination threshold is
The maximum size of a malicious sample is 31.3 MB and its varied.
minimum size is 11 KB. We collected all of the benign sam-
ples from the third-party markets and used VirusTotal [44] to 7) AUC
detect the maliciousness of each benign sample to construct The area under the ROC curve is AUC, and its value can be
a training dataset as pure as possible. used to intuitively evaluate the quality of the classifier. The
The experiments use a Dell Power Edge 720 server with closer the AUC is to 1.0, the better the detection method will
Intel Xeon E5-2603 CPU and 64GB RAM. The Python be. When it is equal to 0.5, it has no application value.

VOLUME 8, 2020 194735


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

TABLE 2. The number of sample failures while extracting different TABLE 3. Accuracy comparison of different N values.
features.

TABLE 4. Accuracy comparison of different feature combinations.

C. EXPERIMENTAL RESULTS
1) N-GRAM SETTING
For Dalvik opcodes based on the N-Gram, the value of
N affects two aspects: classification accuracy and feature
numbers. We use the DS-1 dataset to set the segmentation
length of the Dalvik opcode. When the length is set to N =
[2, 3, 4, 5], the corresponding length of the N-Gram opcode
sequence is extracted. The extracted features are input into the
CatBoost classifier, and 10-fold cross-validation is selected to
find the most appropriate length of N .
Due to the diverse designs of Android applications, espe-
cially malware applications that deliberately evade certain
features. From some APK files cannot be extracted a single
kind of features such as permissions. This results in many
samples being ignored in classification. We compare the
number of samples whose extraction failed when extracting
features individually and when extracting combined features
FIGURE 2. The accuracy of the FCBF algorithm using different parameters.
on the DS-1 dataset, as shown in Table 2.
It can be seen from Table 2 that when extracting the permis-
sion features combined with N-Gram opcodes, the features 96.21% when the features are ‘‘Permission with 5-Gram’’,
can be extracted from most of the samples, which is better and the second-best result is 95.84% with ‘‘Permission with
than extracting a single kind of feature. When extracting 4-Gram’’.
permission features, benign samples are more difficult to According to the results in Table 3 and Table 4, as well as
extract. Since we extracted and discussed official Android considering about the accuracy and feature dimensionality,
permissions in this work, it may related to some samples that we set the value of N to 4, and employ ‘‘Permission with
employ more customer permissions. The feature extraction 4-Gram’’ as features in the following experiments.
of N-Gram opcodes is the opposite. There are more cases of
relatively failed extraction in malware, such as the sample 2) FCBF ALGORITHM PARAMETER SETTING
(SHA-1: 8d2795c2e790c54b401fd52eb56279f6af0a07fb), Since the original feature set has high dimensionality, we use
which is small in size and performs malicious behaviors FCBF algorithm to perform feature selection to construct
through calling permissions. appropriate feature subset. We set the range of threshold δ
We compare the accuracy and the number of constructed to [0.005, 0.03], with the interval of 0.005, and the feature
features with different N-Gram length of Dalvik opcodes numbers are set in the range of 100 to 500. The result is shown
in Table 3. It can be seen that as the value of N gradually in Fig. 2.
increases, the classification accuracy is better but the growth From Fig. 2, as the number of features increases, the detec-
trend is getting smaller. However, as N increases, the number tion results are getting better. The best accuracy is achieved
of features will also increase obviously, leading to increased when threshold δ is set to 0.005 and the number of features is
computational consumption. set to 500, and that will be the chosen parameters of FCBF in
We combine the extracted permission features with the this work.
Dalvik opcode features and apply different values for N , and
the results are shown in Table 4. 3) COMPARE WITH OTHER CLASSIFIERS
The combination of two kinds of features achieves better The DS-1 dataset is used as experimental data, with 70% of
accuracy than any single feature kind. The best accuracy is the data is used as the training set and the rest is the test

194736 VOLUME 8, 2020


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

TABLE 5. Experimental results of different classifiers.

these two kinds of features indeed have the classification


ability.
In addition, there are significant differences in the fea-
ture distribution of malicious software and benign applica-
tions. For example, the permission ‘‘RECEIVE_SMS’’ takes
38.51% of malware and only 4.24% of benign applications,
and the permission ‘‘READ_PHONE_STATE’’ takes 92.82%
of malware and 54.13% in benign applications. In another
words, compared with benign applications, malware take
more attempts to obtain the user’s SMS information and
device identification.

5) COMPARE WITH OTHER EXPERIMENTS


FIGURE 3. ROC of different classifiers.
We compare the FAMD with other state-of-the-art works with
Drebin dataset. Since the evaluation metrics in each paper are
set. For all base classifiers, we use the grid search method different, we have collected as comprehensive information as
to find each classifier’s best parameters. The specific exper- possible, as shown in Table 7.
imental results are shown in Table 5. The value of the AUC It can be seen from Table 7 that FAMD can outper-
only counts the comparison when the number of features is form most of other works in terms of accuracy. Moreover,
500, as shown in Fig. 3. It can be seen from Table 5 and FAMD only needs an average of 0.28s to analyze an appli-
Fig. 3 that as the number of features increases, the various cation. Hence, FAMD achieves the purpose of establishing a
experimental indicators of the Catboost classifier are better lightweight and efficient detection framework.
than those other classifiers in most cases. When the number
of features reaches 500, CatBoost that we used can achieve 6) MALWARE FAMILY CLASSIFICATION
95.29% accuracy. In addition to malware detection, malware family classifica-
tion is also concerned in this framework. We take the top
4) ANALYSIS OF THE KEY FEATURES 20 malware families in Drebin dataset and conducts the fam-
We count the importance of the top 10 features ranked by ily classification with the presented methods. The precision,
FCBF and their distribution in malware and benign applica- F1-score, and recall indexes are in Fig. 4 and the confusion
tions. The results are shown in Table 6. The top 10 features matrix is in Fig. 5 to provide a graphical overview. Most of the
include both permissions and opcodes, which proves that samples are correctly classified into their respective families.

VOLUME 8, 2020 194737


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

TABLE 6. Distribution of the top-10 features in malware and benign application.

TABLE 7. Comparison with related work.

the FAMD in family processing after malware detection.


However, the ExploitLinuxLotoor family has a lower pre-
cision. And for the Geinimi family, we only successfully
extracted the features of 17 samples, in other words, from
most of samples of this family cannot be extracted permis-
sions and opcode features. This shows that our framework
performs unsatisfied to detect malware that carries out certain
activities, which is what we need to improve in our future
work.

V. CONCLUSION
The number of applications that can be classified as malware
continues to increase, new types of malware and camouflage
techniques are constantly updating, effectively detecting mal-
ware in a relatively short time is of considerable significance
FIGURE 4. Classification results of top 20 malware families in Drebin to the third-party application markets and users. How to
dataset.
improve the detection accuracy and reduce the detection time
are still the problems to be solved.
We present a fast Android malware detection framework,
FAMD, which combines permission features and Dalvik
opcode features from different operation levels to construct
feature vectors. To reduce the feature dimensionality and time
complexity of the method, the FCBF algorithm is employed
for feature selection. As a classifier proposed in recent years,
CatBoost is employed in this work to conduct malware detec-
tion and family classification.
In the experiments, we segment the opcodes with 4-Gram
and vectorize the features combined with permissions. With
the CatBoost as the classifier, the result achieves an accu-
racy of 97.40% in malware detection, and 97.38% in
family classification. Compared with other state-of-the-art
works, FAMD performs better comprehensively in accuracy
FIGURE 5. Confusion matrix of the top 20 Drebin malware families. and time consumption. It can be seen in the experiments
that there is a clear difference in the distribution of cer-
The overall accuracy of malware family classification is tain key features in malicious applications and benign
97.38%, which shows the effectiveness and feasibility of applications.

194738 VOLUME 8, 2020


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

Since CatBoost is a supervised learning framework, this [23] A. Egitmen, I. Bulut, R. C. Aygun, A. B. Gunduz, O. Seyrekbasan, and
work is inadequate in detecting new emerging malicious A. G. Yavuz, ‘‘Combat mobile evasive malware via skip-gram-based mal-
ware detection,’’ Secur. Commun. Netw., vol. 2020, pp. 1–10, Apr. 2020.
applications, which we aim to improve in further work. [24] Y. Aafer, W. Du, and H. Yin, ‘‘Droidapiminer: Mining api-level features for
robust malware detection in Android,’’ in Proc. Int. Conf. Secur. Privacy
REFERENCES Commun. Syst. Cham, Switzerland: Springer, Sep. 2013, pp. 86–103.
[25] L. Cen, C. S. Gates, L. Si, and N. Li, ‘‘A probabilistic discriminative model
[1] (2020). Smartphone Market Share. [Online]. Available: https://www. for Android malware detection with decompiled source code,’’ IEEE Trans.
idc.com/promo/smartphone-market-share/os Dependable Secure Comput., vol. 12, no. 4, pp. 400–412, Jul. 2015.
[2] (2020). Mobile Malware Evolution 2019. [Online]. Available: [26] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, ‘‘HinDroid: An intelligent
https://securelist.com/mobile-malware-evolution-2019/96280 Android malware detection system based on structured heterogeneous
[3] W. Enck, M. Ongtang, and P. McDaniel, ‘‘On lightweight mobile phone information network,’’ in Proc. 23rd ACM SIGKDD Int. Conf. Knowl.
application certification,’’ in Proc. 16th ACM Conf. Comput. Commun. Discovery Data Mining, Aug. 2017, pp. 1507–1515.
Secur. (CCS), 2009, pp. 235–245. [27] Z. Ma, H. Ge, Y. Liu, M. Zhao, and J. Ma, ‘‘A combination method
[4] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, ‘‘MADAM: Effec- for Android malware detection based on control flow graphs and
tive and efficient behavior-based Android malware detection and preven- machine learning algorithms,’’ IEEE Access, vol. 7, pp. 21235–21245,
tion,’’ IEEE Trans. Dependable Secure Comput., vol. 15, no. 1, pp. 83–97, 2019.
Jan. 2018. [28] J. Zhang, Z. Qin, H. Yin, L. Ou, and K. Zhang, ‘‘A feature-hybrid
[5] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, ‘‘A multimodal deep malware variants detection using CNN based opcode embedding and
learning method for Android malware detection using various features,’’ BPNN based API embedding,’’ Comput. Secur., vol. 84, pp. 376–392,
IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, Mar. 2019. Jul. 2019.
[6] H. Zhang, S. Luo, Y. Zhang, and L. Pan, ‘‘An efficient Android malware [29] D. Arp, M. Spreitzenbarth, C. Siemens, M. Hübner, H. Gascon, and
detection system based on method-level behavioral semantic analysis,’’ K. Rieck, ‘‘Drebin: Effective and explainable detection of Android mal-
IEEE Access, vol. 7, pp. 69246–69256, 2019. ware in your pocket,’’ in Proc. Netw. Distrib. Syst. Secur. Symp., vol. 14,
[7] L. Onwuzurike, E. Mariconti, P. Andriotis, E. De Cristofaro, G. Ross, 2014, pp. 23–26.
and G. Stringhini, ‘‘MaMaDroid: Detecting Android malware by building [30] K. Xu, Y. Li, and R. H. Deng, ‘‘ICCDetector: ICC-based malware detec-
Markov chains of behavioral models (extended version),’’ ACM Trans. tion on Android,’’ IEEE Trans. Inf. Forensics Security, vol. 11, no. 6,
Privacy Secur., vol. 22, no. 2, pp. 1–34, 2019. pp. 1252–1264, Jun. 2016.
[8] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox, J. Jung, [31] M. Alazab, M. Alazab, A. Shalaginov, A. Mesleh, and A. Awajan, ‘‘Intel-
P. McDaniel, and A. N. Sheth, ‘‘TaintDroid: An information-flow tracking ligent mobile malware detection using permission requests and API calls,’’
system for realtime privacy monitoring on smartphones,’’ ACM Trans. Future Gener. Comput. Syst., vol. 107, pp. 509–521, Jun. 2020.
Comput. Syst., vol. 32, no. 2, pp. 1–29, Jun. 2014. [32] Permissions Overview. Accessed: Oct. 27, 2020. [Online]. Available:
[9] J. Chen, C. Wang, Z. Zhao, K. Chen, R. Du, and G.-J. Ahn, ‘‘Uncovering https://developer.android.google.cn/guide/topics/permissions/overview
the face of Android ransomware: Characterization and real-time detec- [33] W. B. Cavnar and J. M. Trenkle, ‘‘N-gram-based text categorization,’’
tion,’’ IEEE Trans. Inf. Forensics Security, vol. 13, no. 5, pp. 1286–1300, in Proc. 3rd Annu. Symp. Document Anal. Inf. Retr., vol. 161175, 1994,
May 2018. pp. 1–14.
[10] H. Cai, N. Meng, B. Ryder, and D. Yao, ‘‘DroidCat: Effective Android [34] Dalvik Bytecode. Accessed: Oct. 27, 2020. [Online]. Available:
malware detection and categorization via app-level profiling,’’ IEEE Trans. https://source.android.com/devices/tech/dalvik/dalvik-bytecode
Inf. Forensics Security, vol. 14, no. 6, pp. 1455–1470, Jun. 2019. [35] Androguard. Accessed: Oct. 27, 2020. [Online]. Available: https://github.
[11] S. Y. Yerima, M. K. Alzaylaee, and S. Sezer, ‘‘Machine learning- com/androguard/androguard
based dynamic analysis of Android apps with improved code coverage,’’ [36] L. Yu and H. Liu, ‘‘Feature selection for high-dimensional data: A fast
EURASIP J. Inf. Secur., vol. 2019, no. 1, p. 4, Dec. 2019. correlation-based filter solution,’’ in Proc. 20th Int. Conf. Mach. Learn.
[12] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, ‘‘Droid-sec: Deep learning in Android (ICML), 2003, pp. 856–863.
malware detection,’’ in Proc. ACM Conf. SIGCOMM, 2014, pp. 371–372. [37] B. Senliol, G. Gulgezen, L. Yu, and Z. Cataltepe, ‘‘Fast correlation based
[13] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, ‘‘CopperDroid: Automatic filter (FCBF) with a different search strategy,’’ in Proc. 23rd Int. Symp.
reconstruction of Android malware behaviors,’’ in Proc. Netw. Distrib. Comput. Inf. Sci., Oct. 2008, pp. 1–4.
Syst. Secur. Symp., 2015, pp. 1–15. [38] D. Gharavian, M. Sheikhan, A. Nazerieh, and S. Garoucy, ‘‘Speech emo-
[14] (2020). Android Permission. [Online]. Available: https://developer. tion recognition using FCBF feature selection method and GA-optimized
android.com/reference/android/Manifest.permission fuzzy ARTMAP neural network,’’ Neural Comput. Appl., vol. 21, no. 8,
[15] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, J. Nieves, P. G. Bringas, pp. 2115–2126, Nov. 2012.
and G. Á. Marañón, ‘‘Mama: Manifest analysis for malware detection in [39] A. W. Moore and D. Zuev, ‘‘Internet traffic classification using Bayesian
Android,’’ Cybern. Syst., vol. 44, nos. 6–7, pp. 469–488, Oct. 2013. analysis techniques,’’ in Proc. ACM SIGMETRICS Int. Conf. Meas. Mod-
[16] W. Wang, X. Wang, D. Feng, J. Liu, Z. Han, and X. Zhang, ‘‘Explor- eling Comput. Syst. (SIGMETRICS), 2005, pp. 50–60.
ing permission-induced risk in Android applications for malicious appli- [40] A. V. Dorogush, V. Ershov, and A. Gulin, ‘‘CatBoost: Gradient boosting
cation detection,’’ IEEE Trans. Inf. Forensics Security, vol. 9, no. 11, with categorical features support,’’ 2018, arXiv:1810.11363. [Online].
pp. 1869–1882, Nov. 2014. Available: http://arxiv.org/abs/1810.11363
[17] K. A. Talha, D. I. Alper, and C. Aydin, ‘‘APK auditor: Permission-based [41] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin,
Android malware detection system,’’ Digit. Invest., vol. 13, pp. 1–14, ‘‘Catboost: Unbiased boosting with categorical features,’’ in Proc. Adv.
Jun. 2015. Neural Inf. Process. Syst., 2018, pp. 6638–6648.
[18] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, and H. Ye, ‘‘Significant permission [42] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
identification for machine-learning-based Android malware detection,’’ in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225, Jul. 2018. Aug. 2016, pp. 785–794.
[19] Q. Jerome, K. Allix, R. State, and T. Engel, ‘‘Using opcode-sequences to [43] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu,
detect malicious Android applications,’’ in Proc. IEEE Int. Conf. Commun. ‘‘Lightgbm: A highly efficient gradient boosting decision tree,’’ in Proc.
(ICC), Jun. 2014, pp. 914–919. Adv. Neural Inf. Process. Syst., 2017, pp. 3146–3154.
[20] N. McLaughlin, J. M. D. Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, [44] Virustotal. Accessed: Oct. 27, 2020. [Online]. Available: https://www.
Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, and G. J. Ahn, ‘‘Deep Android virustotal.com/gui/home/upload
malware detection,’’ in Proc. 7th ACM Conf. Data Appl. Secur. Privacy, [45] G. Canfora, A. De Lorenzo, E. Medvet, F. Mercaldo, and C. A. Visaggio,
2017, pp. 301–308. ‘‘Effectiveness of opcode ngrams for detection of multi family Android
[21] J. Zhang, Z. Qin, K. Zhang, H. Yin, and J. Zou, ‘‘Dalvik opcode graph malware,’’ in Proc. 10th Int. Conf. Availability, Rel. Secur., Aug. 2015,
based Android malware variants detection using global topology features,’’ pp. 333–340.
IEEE Access, vol. 6, pp. 51964–51974, 2018. [46] K. Riad and L. Ke, ‘‘RoughDroid: Operative scheme for functional
[22] A. Pektaş and T. Acarman, ‘‘Learning to detect Android malware via Android malware detection,’’ Secur. Commun. Netw., vol. 2018, pp. 1–10,
opcode sequences,’’ Neurocomputing, vol. 396, pp. 599–608, Jul. 2020. Sep. 2018.

VOLUME 8, 2020 194739


H. Bai et al.: FAMD: Fast Multifeature Android Malware Detection Framework, Design, and Implementation

HONGPENG BAI received the B.S. degree from XIAOQIANG DI received the B.S. degree in com-
Liaoning Shihua University, in 2016. He is cur- puter science and technology and the M.S. and
rently pursuing the master’s degree in computer Ph.D. degrees in communication and informa-
technology with the Changchun University of Sci- tion systems from the Changchun University of
ence and Technology, China. His research interests Science and Technology, in 2002, 2007, and 2014,
include machine learning and malware detection. respectively. He was a Visiting Scholar with the
Norwegian University of Science and Technology,
Norway, from August 2012 to August 2013. He is
currently a Professor and a Ph.D. Supervisor with
the Changchun University of Science and Technol-
ogy. His major research interests include network information security and
integrated networks.

NANNAN XIE received the B.S. degree in


software engineering and the Ph.D. degree in
computer system architecture from Jilin Univer- QING YE received the B.S. degree in computer
sity in 2010 and 2015, respectively. She was science and technology from Northeast Normal
a Postdoctoral Researcher with Beijing Jiaotong University, in 1992, and the M.S. degree from the
University, China, from 2015 to 2017. She is cur- Changchun University of Science and Technology,
rently a Lecturer and a master’s Supervisor with in 1995. She is currently a Professor with the
the Changchun University of Science and Technol- Changchun University of Science and Technology.
ogy. She has published about 20 scientific papers Her major research interests include database and
in various journals and international conferences. data mining, software engineering and information
Her main research interests include network intrusion detection and mobile systems, and machine learning.
security.

194740 VOLUME 8, 2020

You might also like