A LSTM Based Framework For Handling Multiclass Imbalance in DGA Botnet Detection 1
A LSTM Based Framework For Handling Multiclass Imbalance in DGA Botnet Detection 1
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: In recent years, botnets have become a major threat on the Internet. Most sophisticated bots use Do-
Received 26 May 2017 main Generation Algorithms (DGA) to pseudo-randomly generate a large number of domains and select a
Revised 22 October 2017
subset in order to communicate with Command and Control (C&C) server. The basic aim is to avoid black-
Accepted 6 November 2017
listing, sinkholing and evade the security systems. Long Short-Term Memory network (LSTM) provides a
Available online 16 November 2017
mean to combat this botnet type. It operates on raw domains and is amenable to immediate applications.
Communicated by Dr. Xin Luo LSTM is however prone to multiclass imbalance problem, which becomes even more significant in DGA
malware detection. This is due the fact that many DGA classes have a very little support in the training
Keywords:
Multiclass imbalance dataset. This paper presents a novel LSTM.MI algorithm to combine both binary and multiclass classifica-
Long Short-Term Memory Networks tion models, where the original LSTM is adapted to be cost-sensitive. The cost items are introduced into
LSTM backpropagation learning procedure to take into account the identification importance among classes. Ex-
Cost-sensitive learning periments are carried out on a real-world collected dataset. They demonstrate that LSTM.MI provides an
improvement of at least 7% in terms of macro-averaging recall and precision as compared to the original
LSTM and other state-of-the-art cost-sensitive methods. It is also able to preserve the high accuracy on
non-DGA generated class (0.9849 F1-score), while helping recognize 5 additional bot families.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction They found that up to 26.5% of the domains were still missed
for the majority of malware families. Only a single blacklist is
Botnet is a collection of malware-infected machines or bots. sufficient to protect against malwares that use DGA.
It has become the main mean for cyber-criminals to send spam In addition to these techniques, another family of methods
mails, steal personal data and launch distributed denial of service emerges when the use of DGA classification is considered. The
attacks [1]. Most bots today rely on domain generation algorithm basic motivation is to identify malicious domains and more im-
(DGA) to generate a list of candidate domain names in the attempt portantly find the related malware structure. In general, DGA
to connect with the so-called command and control (C&C) server. categorization can be treated as a multiclass task. It can be either
This algorithm is also known as domain fluxing, where the domain retrospective or real-time detection [3]. Retrospective detection
list is changed over the time to avoid the limitations that allow divides domains into clusters to obtain statistical properties such
researchers to shut down botnets [2]. as history of suspicious activities. Real-time detection is a much
Uncovering DGA is critical in security community. Existing harder problem; it is based on the sole domain name and lin-
solutions are largely based on reverse engineering and blacklisting guistic features to build the model [2]. Using linguistic properties
the C&C domains to detect bots and block their traffic. Reverse has a potential drawback because they can be easily bypassed by
engineering requires a malware sample that may not be always the malware author, while deriving a new set of features is rather
possible in practice [1,2]. Blacklisting, on the other hand becomes challenging.
increasingly difficult as the rate of dynamically generated domains Woodbridge et al. [3] overcame this problem by introducing the
increases [3]. Kührer et al. [4] empirically assessed 15 public feature-less Long Short-Term Memory network (LSTM) [5,6]. This
malware blacklists and 4 blacklists operated by anti-virus vendors. algorithm could be deployed in almost any setting. It also demon-
strated to provide a 90% detection rate at a 10−4 false positive (FP)
rate. While effective, LSTM is naturally sensitive to the multiclass
∗
Corresponding author. imbalance problem. Due to the time and cost required to collect
E-mail addresses: [email protected] (D. Tran), [email protected] samples, the non-DGA generated class vastly outnumbers other
(H. Mac), [email protected] (V. Tong), [email protected] (H.A. Tran),
DGA generated classes; it is common that the imbalance ratio is
[email protected] (L.G. Nguyen).
https://doi.org/10.1016/j.neucom.2017.11.018
0925-2312/© 2017 Elsevier B.V. All rights reserved.
2402 D. Tran et al. / Neurocomputing 275 (2018) 2401–2413
on the order of 1:10 0 0. LSTM is accuracy-oriented; it tends to bias similar features to indicate the various malware families. The C&C
towards the prevalent class, leaving a number of DGA families to detection was based on Hidden Markov Models and active DGA
be undetectable. generated domains.
This work introduces a novel LSTM.MI algorithm that is robust Salomon and Brustoloni [17] observed a high similarity be-
to imbalanced class distribution. The basic idea is to rely on tween domain names that belong to the same DGA. Kwon et al.
both binary and multiclass classification models, where LSTM is [18] developed a scalable approach called PsyBoG. The algorithm
adapted to include cost items into its backpropagation learning leveraged the power spectral density (PSD) analysis to discover the
mechanism. The cost items take into account identification impor- major frequencies, given by the periodic DNS queries of botnets.
tance between classes [7]. It is noticeable that dealing with class It produced a detection rate of 95% and identified 23 unknown
imbalance has been an active research area over the last decade. and 26 known malware families with 0.1% false positive (FP). Gu
Although numerous solutions were developed, the majority of et al. [19] recognized all the Kraken and 99.8% of Conficker bots
efforts so far have focused on two-class scenario [8]. Zhou and using a chain of quantity and linguistic evidence, and traffic that
Liu [9] pointed out that handling multiclass classification poses is collected from a single network.
new challenges and is more difficult than handling the binary one. In all the above works, DGA classification is achieved retro-
The LSTM.MI algorithm is investigated through experiments on spectively. Krishnan et al. [20] argued that retrospective detection
the real-world collected dataset, where the multiclass imbalance is not amenable to immediate applications. It requires several
prevails. We observe that this algorithm is able to provide an hours before the domain clusters meet the minimum threshold to
improvement of at least 7% in terms of macro-averaging recall, produce a good performance. Schiavoni et al. [2] presented a real-
precision and F1-score with respect to the original LSTM and other time DGA detection using two classes of linguistic properties. i.e.,
state-of-the-art solutions. meaningful character ratio and n-gram normality score. Nonethe-
The structure of this paper is as follows. Section 2 pro- less, the linguistic/statistical features are language dependent and
vides an overview on DGAs, DGA classification methods and can be easily circumvented by malware author.
LSTM. The LSTM.MI algorithm is discussed in detail in Section 3.
Section 4 presents our empirical results. Section 5 is dedicated to
2.3. Long Short-Term Memory network
conclusion and future works.
Long Short-Term Memory network (LSTM) [5,6] is a deep Re-
2. Background and related works
current Neural Network (RNN) that is better than the conventional
do tre thoi
RNN on tasks involving long time lags [21]. It has been applied gian dai
2.1. Domain Generation Algorithms
in a range of applications such as language modeling [22], speech
recognition [23] and DGA botnet detection [3]. LSTM basic unit
Domain Generation Algorithm (DGA) is based on seeds that
is the memory block containing one or more memory cells and
involve current data and static integer values in creating a pseudo-
three multiplicative gating units (see Fig. 1(a)). For the gate, f is a
random string. It also appends a top level domain (TLD) such as
logistic sigmoid, ranging from 0 to 1. Let us assume that wlm is the
.ru, .com, etc. to establish a proper domain name. The hardcoded
weight on connection between the source unit m and target unit
IP addresses can be blacklisted and blocked. Modern botnets use
l with l ∈ {out, in, ϕ , cυj } denoting the output, input, forget gate
DGA to build a resilient command and control (C&C) infrastructure
and the cell, respectively. We also consider discrete time steps t. A
[2]. Using dynamic domains makes it difficult for law enforcement
single step involves the update of all units (i.e., forward pass) and
agencies to eliminate botnets [10].
the error computation of wlm (i.e., backward pass). The cell state
DGA malwares may vary in complexity. Conficker.C generates
scυ is updated based on its current state zcυ and other sources of
50,0 0 0 domain names and attempts to contact 500 [11]. W32.Virut j j
uses public-key cryptography to prevent their commands from be- input, i.e., zϕ , zin and zout . The forget gate (ϕ ) is designed to learn
ing mimicked. Suppobox concatenates various words using (typical) to reset blocks once their contents are useless [6].
English dictionary, while Ramnit models the real domain distribu- LSTM aims at minimizing the cost function (we assume soft-
tion [3]. Since the domain names are short-lived, blacklisting and max) of the network with respect to the actual output yk of the
sinkholing would be ineffective. Detecting DGA botnets becomes a k-output neuron and the target tk :
challenging task in computer security.
E (t ) = − t k (t ) log yk (t ) (1)
p∈Samples k
2.2. DGA classification algorithms
This can be done by using the gradient descent with truncated
DGA classification is essential in identifying the group related version of real-time recurrent learning (RTRL) [21]. The weight
to malware generated domains. It provides a deep understanding changes for the output gate, input gate, forget gate and the cell
to recognize bots that share a similar DGA [2]. Perdisci et al. are determined using the learning rate α as follows:
[12] took advantage of bulk predictions on a subset of domains
to derive network features such as number of resolved IPs, IP
f
tr
diversity and prefixes. A C4.5 decision tree was also constructed to wout j m (t ) = α out j zout j (t ) scυj (t ) wkcυj δk (t ) ym (t − 1 ) (2)
υ k
distinguish between malicious flux services and other networks.
Bilge et al. [13] proposed EXPOSURE using J48 tree and various
∂ scυj (t )
time-based, DNS answer-based and domain-based properties. The win j m (t ) = α escυ (t ) (3)
authors demonstrated that EXPOSURE is able to detect all kinds υ
j ∂ win j m
of non-benign domains including dropzones. McGrath and Gupta
[14] measured characteristics including whois records, lexical
features and IP addresses of phishing and non-phishing URL. Ma ∂ scυj (t )
wϕ j m (t ) = α escυ (t ) (4)
et al. [15] utilized length of domain name, host name and other υ
j ∂ wϕ j m
properties to determine bot, which is used for advertising spam.
Yadav et al. [16] computed the distribution of alphabetic charac- ∂ scυj (t )
ters and bi-grams in all domains. Anatonakakis et al. [1] exploited wcυj m (t ) = α escυ (t ) (5)
j ∂ wcυj m
D. Tran et al. / Neurocomputing 275 (2018) 2401–2413 2403
Fig. 1. (a) LSTM memory block with only one cell; (b) the LSTM.MI algorithm.
(7) can be rewritten as To avoid the above problem, the LSTM.MI algorithm (see
Fig. 1(b)) relies on two-class model, where all the abnormal
∂ E (t ) k
t (t )
δk (t ) = − fk (zk (t ) ) = fk (zk (t ) ) k C [class( p), k] (9) domains are grouped into a single DGA class. It receives in input a
∂ yk (t ) y (t ) domain name d and classifies whether d appears to be automati-
cally generated. The intuitive motivation is to produce the optimal
t k (t )
tr
escυ (t ) = yout j (t )
wkcυj f k (zk (t ) ) k C [class( p), k] (10) data construction for lowering the imbalance ratio and achieving
j y (t ) the high detection rate on the non-DGA class. Once d is deemed to
k
be automatically generated, the algorithm subsequently attempts
The change in Eqs. (6) and (7) results in a change in Eqs. (2)–
to assign correct label to the suspicious domain. The fingerprinting
(5), where C[class(p), k] is taken into account. The cost item
is built upon multiclass model; each category is analogous to a
controls the magnitude of the weight updates [37]. Samples with
malware family. In both the stages, the cost-sensitive LSTM is
larger training error are emphasized, making the learning inten-
applied to deal with raw domain names. Fig. 2(d) provides an
tionally biased towards the small classes. The cost-sensitive LSTM
overview of the effect of LSTM.MI, where γ is set equal to 0.3.
is able to address the multiclass imbalance without requiring class
In two-class model, the DGA class contains both the Pykspa and
decomposition.
Corebot samples. The imbalance ratio is reduced from 1:8 to 1:1.6.
Given a data set, the cost matrix is usually unknown. The ge-
It becomes obvious that the LSTM.MI algorithm is able to produce
netic algorithm can be applied to choose the optimal cost matrix.
the accuracies of 85% and 60% for the Pykspa and Corebot classes,
This is however very time-consuming and may not possible to
while retaining the high performance (92.5%) on the Alexa class.
achieve in practice [8]. For the sake of convenience, we assume
For simplicity, we generate a dictionary containing all the pos-
that samples in one category are equally costly. C[i, i] denotes the
sible valid characters (underscore, dash, period and alphanumeric).
misclassification cost of the class i, which is generated using the
The characters are indexed and the domain d is mapped into a
class distribution as
1 γ real-valued vector using these indices. The vector is also padded
with 0 s to have the length of L. It is then projected to create
C [i, i] = (11)
ni a matrix DxL in the embedding layer. L denotes the maximum
where γ ∈ [0, 1] is a trade-off parameter. γ = 1 implies that C[i, domain length in the training data, while D is a parameter that
i] is inversely proportional to the class size ni . The quantity of represents the degree of freedom in the model [3]. In this paper,
the small and prevalent classes are rebalanced into the 1:1:…:1 we choose D = 128. It should be noted that increasing D does
ratio [35]. On the other hand, γ = 0 implies that the cost-sensitive not necessarily produce higher overall accuracy, but introduces
LSTM reduces to the original LSTM, which is cost-insensitive. additional computational complexity.
The cost-sensitive LSTM layer aims to extract various features
3.2. The essence of cost-sensitive LSTM that characterize a given domain name. These features are not
hand-crafted, making it hard for adversary to circumvent. Forward
The gap between the cost-insensitive and cost-sensitive LSTM pass is the same as in the original LSTM [21]. The backward
can be also interpreted from the movement of the decision bound- pass is detailed in Fig. 3. The difference lies on δ k (t) and internal
aries [35]. Consider the multiclass problem with the imbalance state error, which are derived from Eqs. (9) and (10), while the
ratio 1:8 as an example. The training data consists of 40 non-DGA weight updates are computed using Eqs. (2)–(5). To label the
(red circle), 20 Pykspa (blue cross) and 5 Corebot (black square) given domain name, the softmax layer is applied to minimize the
samples, which are projected into 2-dimensional space using cross-entropy or maximize the log-likelihood [3]. We have that
t-SNE [45]. The non-DGA samples are collected from the Alexa top exp (zouti )
10 0,0 0 0 sites. pi = (12)
exp zout j
Fig. 2(a) illustrates the cost-insensitive LSTM that provides an j
accuracy of 92.5% for the prevalent class, while the accuracies
The predicted class k would be
for the small classes are 80% and 0%, respectively. In Fig. 2(b),
higher cost items are associated with the Pykspa and Corebot k = arg max pi (13)
classes. From the optimization viewpoint, the decision boundaries i
are pushed towards the prevalent class [35]. A possible drawback All the LSTM.MI code is written in Python using Keras library
is that the cost-sensitive LSTM increases the accuracy (60%) on [46] and scikit-learn tools [47].
Corebot class, while decreasing the detection rate (87.5%) related
to the non-DGA generated domains. 4. Experiments
From Fig. 2(c), the larger γ value leads to higher recalls
and lower precisions on the small classes. In general, one has 4.1. Dataset specification
to select an appropriate γ such that both these measures are
balanced to achieve better overall F1-score and geometric G-mean In the experiments, we evaluate our candidate method on an
[7]. The boundary movement phenomena can be explained as open dataset containing 1 non-DGA (Alexa) and 37 DGA classes.
either replicating minority samples (oversampling) or removing The dataset is collected from two sources: the Alexa top 10 0,0 0 0
redundant majority samples (undersampling). It demonstrates sites with/without the www. child label [3,48] and the OSINT DGA
the relationship between algorithmic and sampling approaches to feed from Bambenek Consulting [49].
cost-sensitive learning. Similar observations can be found in [35]. In Table 1, the dataset exhibits different imbalance degrees,
ranging from 1:2 to 1:3534. It also involves some notable DGA
3.3. Proposed LSTM.MI algorithm families such as Cryptolocker, Locky, Kraken, Gameover Zeus. Mat-
snu, Cryptowall, Suppobox and Volatile malwares are based on
The cost-sensitive LSTM has an intrinsic weakness. It tends pronounceable domains, which are randomly generated using
to reduce the accuracy on the prevalent non-DGA (Alexa) class (typical) English dictionary with a narrow randomization space.
(see Fig. 2(b) and (c)). This, in turn, interrupts the legitimate The five-fold cross validation (CV) is carried out since several
Internet traffic since highly secure applications potentially block classes are very small. This is to ensure that each fold has at least
DNS queries that are assumed to be malicious. five samples from every class.
D. Tran et al. / Neurocomputing 275 (2018) 2401–2413 2405
Fig. 2. The decision boundaries produced by the (a) cost-insensitive LSTM, (b) cost-sensitive LSTM (γ = 0.3), (c) cost-sensitive LSTM (γ = 1) and (d) LSTM.MI (γ = 0.3). The
imbalance ratio is 1:8.
Table 1
Summary of the collected dataset (Source: the Alexa top 10 0,0 0 0 sites [48] and the OSINT DGA
feed from Bambenek Consulting [49]).
Geodo 58 ✗ Fobber 60 ✗
√ √
Beebone 42 Alexa 88,347
Murofet 816 ✗ Dyre 800 ✗
√
Pykspa 1422 ✗ Cryptowall 94
Padcrypt 42 ✗ Corebot 28 ✗
Ramnit 9158 ✗ P 200 ✗
√
Volatile 50 Bedep 172 ✗
√
Ranbyus 1232 ✗ Matsnu 48
Qakbot 40 0 0 ✗ PT Goz 6600 ✗
Simda 1365 ✗ Necurs 2398 ✗
Ramdo 200 ✗ Pushdo 168 ✗
√
Suppobox 101 Cryptolocker 600 ✗
Locky 186 ✗ Dircrypt 57 ✗
Tempedreve 25 ✗ Shifu 234 ✗
Qadars 40 ✗ Bamital 60 ✗
Symmi 64 ✗ Kraken 508 ✗
Banjori 42,166 ✗ Nymaim 600 ✗
Tinba 6385 ✗ Shiotob 1253 ✗
Hesperbot 192 ✗ W32.Virut 60 ✗
4.2. Performance measures is the percentage of retrieved domains that are relevant. F1-score
represents a harmonic mean between precision and recall.
In general, a classifier can be gauged using the overall accuracy. As illustrated in Table 2, quality of the overall classification can
Unfortunately, in multiclass imbalanced data, this metric is no be analyzed in two ways. In macro-averaging, a metric is averaged
longer a proper measure since the prevalent classes have much over all classes that are treated equally. Micro-averaging is based
higher impacts with respect to the small classes [7,37,41]. To on the cumulative True Positive, False Positive, True Negative and
give more insight into the accuracy obtained within a given class, False Negative [50]. It favors the classes that have more samples
precision, recall, F1-score are adopted in this paper. Precision is (prevalent classes). Macro-averaging seems to be better indica-
the percentage of relevant domains that are retrieved, while recall tor in multiclass imbalance problem. However, micro-averaging
2406 D. Tran et al. / Neurocomputing 275 (2018) 2401–2413
T rue Positivei
into a single class. The imbalance ratio (IR) reduces to 1:1.08. A
Recallμ = i True Positivei
(True Positivei +False Negativei ) RecallM =
i T rue Positive +False Negative
i
Number of classes
i change in γ does not induce a significant change in the cost items
i
Fig. 4. Performance of the cost-sensitive LSTM on two-class task. (a) Macro-averaging precision, recall and F1-score with respect to the γ ∈ [0, 1], and (b) learning curves
related to the cost-insensitive LSTM and cost-sensitive LSTM (γ = 0.3).
Fig. 5. Performance of the cost-sensitive LSTM on multiclass task. (a) Macro-averaging precision, recall and F1-score, (b) micro-averaging precision, recall and F1-score, (c)
the number of DGA families that cannot be recognized with respect to the γ ∈ [0, 1], and (d) learning curves relating to the cost-insensitive LSTM and cost-sensitive LSTM
(γ = 0.3).
accuracy (0.6034 precision, 0.535 recall and 0.5671 F1-score) for the same in order to observe the effect of varying cost items
multiclass task is also achieved at γ = 0.3. (γ ∈ {0, 0.3}).
Convergence and generalization are important properties of From Figs. 4(b) and 5(d), both the binary and multiclass
the cost-sensitive LSTM. Convergence is defined as the number cost-sensitive LSTM appear to have converged after 17 epochs.
of epochs, required to achieve a sufficiently good solution during The training time is remarkably fast, requiring just an hour
training. Generalization measures the detection capability on new on an Intel core i5 8GB RAM computer. We observe that the
unseen data [51]. In all experiments, the learning rate α is kept cost-sensitive LSTM is slightly prone to overfitting. After 18
2408 D. Tran et al. / Neurocomputing 275 (2018) 2401–2413
Table 3
Performance comparison of the cost-sensitive learning methods on two-class task. The number in boldface denotes the highest F1-score.
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
Non-DGA 0.9853 0.9773 0.9813 0.9849 0.9761 0.9805 0.9836 0.9810 0.9823 0.9816 0.9881 0.9849
DGA 0.9756 0.9842 0.9799 0.9744 0.9837 0.9790 0.9794 0.9823 0.9809 0.9870 0.9799 0.9845
Micro-averaging 0.9806 0.9806 0.9806 0.9799 0.9797 0.9798 0.9816 0.9816 0.9816 0.9842 0.9842 0.9842
Macro-averaging 0.9805 0.9808 0.9806 0.9797 0.9799 0.9798 0.9815 0.9817 0.9816 0.9843 0.9840 0.9842
Table 4
Performance comparison of the cost-sensitive learning methods on multiclass task. The number in boldface denotes the highest F1-score.
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
Geodo 0.1429 0.0833 0.1053 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Beebone 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Murofet 0.4291 0.6503 0.5171 0.7524 0.4847 0.5896 0.5355 0.5092 0.5220 0.5330 0.7423 0.6205
Pykspa 0.7374 0.7218 0.7295 0.7985 0.7535 0.7754 0.6889 0.6549 0.6715 0.8023 0.7430 0.7715
Padcrypt 0.9091 0.8333 0.8696 0.6875 0.9167 0.7857 0.80 0 0 0.6667 0.7273 1.0 0 0 0 0.7500 0.8571
Ramnit 0.5560 0.3903 0.4586 0.5733 0.8859 0.6961 0.4350 0.3619 0.3951 0.6068 0.8062 0.6925
Volatile 0.9091 1.0 0 0 0 0.9524 0.8333 1.0 0 0 0 0.9091 0.8333 1.0 0 0 0 0.9091 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Ranbyus 0.3262 0.4309 0.3713 0.4029 0.3374 0.3673 0.2396 0.3293 0.2774 0.3617 0.7073 0.4787
Qakbot 0.5240 0.4637 0.4920 0.7245 0.5062 0.5960 0.3795 0.4625 0.4169 0.7716 0.4350 0.5564
Simda 0.9536 0.9780 0.9656 0.9681 1.0 0 0 0 0.9838 0.9574 0.9890 0.9730 0.9579 1.0 0 0 0 0.9785
Ramdo 0.8864 0.9750 0.9286 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 0.9091 1.0 0 0 0 0.9524 0.9524 1.0 0 0 0 0.9756
Suppobox 0.4706 0.40 0 0 0.4324 0.3500 0.3500 0.3500 0.4091 0.4500 0.4286 0.4167 0.50 0 0 0.4545
Locky 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0172 0.0270 0.0211 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Tempedreve 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Qadars 0.8333 0.6250 0.7143 1.0 0 0 0 0.50 0 0 0.6667 0.50 0 0 0.6250 0.5556 1.0 0 0 0 0.6250 0.7692
Symmi 0.40 0 0 0.3077 0.3478 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0870 0.1538 0.1111 0.50 0 0 0.1538 0.2353
Banjori 0.9986 0.9992 0.9989 0.9992 1.0 0 0 0 0.9996 0.9981 0.9740 0.9859 0.9988 1.0 0 0 0 0.9994
Tinba 0.8319 0.8794 0.8550 0.8947 0.9984 0.9437 0.8386 0.5450 0.6607 0.8951 0.9961 0.9429
Hesperbot 0.0938 0.0789 0.0857 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0270 0.0263 0.0267 0.3333 0.0263 0.0488
Fobber 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Dyre 0.9810 0.9688 0.9748 0.9639 1.0 0 0 0 0.9816 1.0 0 0 0 0.9938 0.9969 0.9816 1.0 0 0 0 0.9907
Cryptowall 0.2667 0.2105 0.2353 0.50 0 0 0.1053 0.1739 0.0714 0.1579 0.0984 0.6250 0.2632 0.3704
Corebot 0.50 0 0 0.50 0 0 0.50 0 0 0.6667 0.3333 0.4444 0.40 0 0 0.6667 0.50 0 0 0.7500 0.60 0 0 0.6667
P 0.3600 0.2250 0.2769 0.4878 0.50 0 0 0.4938 0.3115 0.4750 0.3762 0.3333 0.5500 0.4151
Bedep 0.4737 0.5294 0.50 0 0 0.7083 0.50 0 0 0.5862 0.4412 0.4412 0.4412 0.6875 0.3235 0.4400
Matsnu 0.8333 0.50 0 0 0.6250 0.5714 0.40 0 0 0.4706 0.2353 0.40 0 0 0.2963 1.0 0 0 0 0.70 0 0 0.8235
PT Goz 0.9970 0.9977 0.9973 0.9962 1.0 0 0 0 0.9981 0.9977 0.9909 0.9943 0.9985 1.0 0 0 0 0.9992
Necurs 0.1318 0.2463 0.1718 0.50 0 0 0.0960 0.1611 0.1246 0.1837 0.1485 0.5248 0.1104 0.1824
Pushdo 0.50 0 0 0.50 0 0 0.50 0 0 0.5714 0.5882 0.5797 0.4694 0.6765 0.5542 0.6571 0.6765 0.6667
Cryptolocker 0.1295 0.1500 0.1390 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0658 0.1667 0.0943 0.20 0 0 0.0167 0.0308
Dircrypt 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Shifu 0.3167 0.4043 0.3551 0.4111 0.7872 0.5401 0.3220 0.4043 0.3585 0.3711 0.7660 0.50 0 0
Bamital 0.6923 0.7500 0.7200 1.0 0 0 0 0.6667 0.80 0 0 0.9231 1.0 0 0 0 0.9600 1.0 0 0 0 0.7500 0.8571
Kraken 0.0431 0.0490 0.0459 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0718 0.1275 0.0919 0.0800 0.0196 0.0315
Nymaim 0.1417 0.1500 0.1457 0.2727 0.10 0 0 0.1463 0.1445 0.2083 0.1706 0.2989 0.2167 0.2512
Shiotob 0.9469 0.9243 0.9355 0.9672 0.9402 0.9535 0.7993 0.9044 0.8486 0.9741 0.9004 0.9358
W32.Virut 0.2727 0.2500 0.2609 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3333 0.3333 0.3333 0.7143 0.4167 0.5263
Micro-averaging 0.8302 0.8194 0.8248 0.8643 0.8803 0.8722 0.8048 0.7724 0.7883 0.8728 0.8775 0.8751
Macro-averaging 0.5024 0.4911 0.4967 0.5298 0.4797 0.5035 0.4423 0.4839 0.4622 0.6034 0.5350 0.5671
epochs, the performance on the training set continues to rise In multiclass task (see Table 4), Threshold-moving cannot
while that on the testing set starts decreasing. The cost-sensitive recognize 10 DGA families. Its performance is mostly identical to
and cost-insensitive learning have somewhat similar overall the accuracy achieved by the cost-insensitive LSTM (i.e., γ = 0).
accuracy. It should be noted that the original LSTM is accuracy- RUSBoost demonstrates to be the worst performer. The rationale
oriented with the aim to maximize the detection rate over all for this is that RUSBoost randomly removes samples to balance the
classes. highly skewed class distribution. It may risk losing important in-
As already discussed, it is able to design a “wrapper” to make formation that characterizes the prevalent classes. Overall, γ = 0.3
LSTM cost-sensitive. The wrapper methods can be oversampling, leads to the highest overall macro-averaging recall, precision and
Threshold-moving and RUSBoost. Undersampling is already in- F1-score. Hence, it is reasonable to directly introduce the cost
tegrated into RUSBoost, and hence, it is omitted in this work. items into the backpropagation learning mechanism.
Threshold-moving is based on type (c) cost matrix, where at least
one C[i, j] = 1 and the unit cost is the minimum cost. We refer the
readers to [9,30] for more details. In two-class task (see Table 3), 4.4. The LSTM.MI algorithm
there is no notable difference between the algorithms since the
imbalance problem (1:1.08) is not significant. The cost-sensitive In LSTM.MI, both binary and multiclass cost-sensitive LSTM
LSTM achieves the highest accuracy. This is followed by RUSBoost. work together in combination. Its basic motivation is to retain the
Such meta-technique combines 50 weak learners (LSTMs) to high accuracy on the prevalent non-DGA class, while increasing
improve the generalization capabilities. It however leads to an the macro-averaging F1-score on other DGA classes. This section
increased computational complexity. is dedicated to investigate LSTM.MI and highlight its advantages
D. Tran et al. / Neurocomputing 275 (2018) 2401–2413 2409
Table 5
Precision, recall and F1-score for CS-NN, CS-SVM, CS-C4.5 and WELM from five-fold cross validation.
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
Geodo 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0303 0.0833 0.0444
Beebone 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.6667 1.0 0 0 0 0.80 0 0 0.80 0 0 1.0 0 0 0 0.8889 0.80 0 0 0.50 0 0 0.6154
Murofet 0.80 0 0 0.0491 0.0925 0.5385 0.1718 0.2605 0.5814 0.6135 0.5970 0.3372 0.3558 0.3463
Pykspa 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3231 0.0739 0.1203 0.5833 0.4930 0.5344 0.3696 0.3592 0.3643
Padcrypt 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.2500 0.1667 0.20 0 0 0.3333 0.4167 0.3704 0.1667 0.0833 0.1111
Ramnit 0.4645 0.7631 0.5774 0.4931 0.7238 0.5866 0.4642 0.4771 0.4705 0.4510 0.4694 0.4600
Volatile 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.8182 0.90 0 0 0.8571 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Ranbyus 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3501 0.6789 0.4620 0.3801 0.4512 0.4126 0.4377 0.4715 0.4540
Qakbot 0.7164 0.3063 0.4291 0.6520 0.2600 0.3718 0.4141 0.3675 0.3894 0.3796 0.3450 0.3615
Simda 0.3049 0.0916 0.1408 0.5287 0.3370 0.4116 0.3119 0.3370 0.3239 0.3020 0.2234 0.2568
Ramdo 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3077 0.20 0 0 0.2424 0.5385 0.5250 0.5316 0.3750 0.30 0 0 0.3333
Suppobox 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Locky 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0256 0.0270 0.0263
Tempedreve 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Qadars 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.1667 0.1250 0.1429 0.3125 0.6250 0.4167 0.2500 0.2500 0.2500
Symmi 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Banjori 0.9306 0.9107 0.9205 0.9641 0.9808 0.9724 0.9995 0.9995 0.9995 0.9976 0.9999 0.9988
Tinba 0.7525 0.9311 0.8323 0.7900 0.9749 0.8728 0.8063 0.8575 0.8311 0.7978 0.8434 0.8199
Hesperbot 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0250 0.0263 0.0256
Fobber 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Alexa 0.9170 0.9675 0.9416 0.9515 0.9730 0.9621 0.9589 0.9650 0.9619 0.9564 0.9606 0.9585
Dyre 0.7571 0.9938 0.8595 0.9933 0.9250 0.9579 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 0.9815 0.9938 0.9876
Cryptowall 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Corebot 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
P 1.0 0 0 0 0.2500 0.40 0 0 0.3333 0.10 0 0 0.1538 0.5517 0.40 0 0 0.4638 0.3514 0.3250 0.3377
Bedep 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0938 0.0882 0.0909
Matsnu 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
PT Goz 0.9613 0.9970 0.9788 0.9831 0.9682 0.9756 0.9894 0.9917 0.9905 0.9751 0.9803 0.9777
Necurs 0.0833 0.0021 0.0041 0.2909 0.0333 0.0598 0.1381 0.1396 0.1389 0.1317 0.1125 0.1213
Pushdo 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0789 0.0882 0.0833 0.0303 0.0294 0.0299
Cryptolocker 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0755 0.0667 0.0708 0.0672 0.0667 0.0669
Dircrypt 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Shifu 0.1765 0.0638 0.0937 0.1776 0.4043 0.2468 0.1739 0.1702 0.1720 0.2090 0.2979 0.2456
Bamital 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 1.0 0 0 0 0.6667 0.80 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 0.9091 0.8333 0.8696
Kraken 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0500 0.0294 0.0370 0.0526 0.0588 0.0556
Nymaim 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 1.0 0 0 0 0.0083 0.0165 0.0748 0.0667 0.0705 0.0667 0.0417 0.0513
Shiotob 0.3889 0.6693 0.4919 0.5292 0.5418 0.5354 0.7930 0.7171 0.7531 0.6777 0.5697 0.6190
W32.Virut 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Micro-averaging 0.8309 0.8624 0.8463 0.8740 0.8884 0.8811 0.8792 0.8834 0.8811 0.8707 0.8739 0.8723
Macro-averaging 0.2172 0.1841 0.1993 0.3234 0.2714 0.2951 0.3481 0.3605 0.3517 0.2959 0.2814 0.2884
over the cost-sensitive neural network (CS-NN) [44], cost-sensitive is also capable of retaining the high precision (0.9816) and recall
support vector machines (CS-SVM) [34], cost-sensitive C4.5 (CS- (0.9881) on the Alexa (non-DGA) class. CS-NN, CS-SVM CS-C4.5,
C4.5) [30] and Weighted Extreme Learning Machine (WELM) [35]. WELM and C5.0 demonstrate to be inferior to both the original
We also compare such algorithm with other state-of-the-art DGA LSTM and LSTM.MI.
detection methods, namely C5.0 decision tree (C5.0) [52], Hidden In Tables 7 and 8, the Wilcoxon signed ranks test has been
Markov Model (HMM) [1] and the original LSTM (LSTM) [3]. done to compare each pair of algorithms based on their F1-score
HMM is similar to the original LSTM, which can directly oper- on the various non-DGA and DGA data classes [53]. The detailed
ate on the raw domains. We train one distinct HMM per class. The ranks are shown in Table 7. The sum of ranks below the diagonal
number of hidden states is set equal to the average length of the is denoted as R+ , while that above the diagonal is denoted as R−
domain names in the training dataset. CS-NN, CS-SVM CS-C4.5, [54]. For a confidence level of α = 0.05 and N = 38 data classes,
WELM and C5.0 represent the retrospective group of methods, the two methods are considered to be significantly different if the
which construct the classification model using hand-crafted fea- smaller of R+ and R− is equal or less than 235. Table 8 illustrates
tures, i.e., entropy of character distribution, n-gram normality the summary of the Wilcoxon test. It is obvious that LSTM.MI
score (n ∈ [1, 2, 3, 4, 5]), length of domain name, and meaningful achieves the highest performance and the difference between
character ratio [1,12,13]. CS-C4.5 relies on the instance-weighting LSTM.MI and other algorithms is significant.
method to modify the decision tree splitting criterion that makes Table 9 shows the evaluation time of the various methods. In
C4.5 cost-sensitive [33]. most cases, the training is carried out off-line; hence, it is not
In Tables 5 and 6, HMM produces worse accuracy than ex- of interest. For practical uses, evaluation may be critical [43]. As
pected. According to Woodbridge et al. [3], a possible reason may illustrated in Table 9, there is almost no computation cost in LSTM
be due to the collected dataset that involves 37 different DGA (9 ms). LSTM.MI has to render the decision using the outcomes
families. We note that in [1], HMM was only evaluated using of both the binary and multiclass cost-sensitive models, thus,
Conficker, Murofet, Bobax, and Sinowal. The dominance of LSTM.MI requiring more evaluation time (18 ms) to process a domain. The
is established with a large margin (7% macro-averaging F1-score evaluation time of WELM is not given because it is executed using
improvement) compared to the original LSTM. LSTM.MI is proved Matlab. CS-NN, CS-SVM, CS-C4.5 and C5.0 are computationally
to be most robust against the multiclass imbalance problem. It expensive. This is due to the fact that these methods require
2410 D. Tran et al. / Neurocomputing 275 (2018) 2401–2413
Table 6
Precision, recall and F1-score for HMM, C5.0 decision tree, L STM and L STM.MI from five-fold cross validation.
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
Geodo 0.0127 0.4167 0.0246 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Beebone 0.0308 0.7500 0.0591 0.6250 1.0 0 0 0 0.7692 1.0 0 0 0 0.9375 0.9667 0.9046 1.0 0 0 0 0.9499
Murofet 0.8235 0.2577 0.3925 0.3810 0.4706 0.4211 0.5894 0.6013 0.5952 0.6977 0.6563 0.6755
Pykspa 0.3090 0.1937 0.2381 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.8301 0.6708 0.7420 0.8558 0.7087 0.7752
Padcrypt 0.2069 1.0 0 0 0 0.3429 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.9445 0.5834 0.7143 0.9286 0.9667 0.9443
Ramnit 0.1081 0.0551 0.0730 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.5627 0.8163 0.6661 0.6070 0.8260 0.6998
Volatile 0.0136 0.60 0 0 0.0267 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.8889 0.6500 0.7434 1.0 0 0 0 0.5540 0.7130
Ranbyus 0.0424 0.2236 0.0713 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.4140 0.4289 0.4180 0.4324 0.6646 0.5239
Qakbot 0.1240 0.0587 0.0797 0.9773 0.9835 0.9804 0.7109 0.5069 0.5918 0.7669 0.5290 0.6260
Simda 0.0137 0.1465 0.0250 0.7685 0.9640 0.8552 0.9079 0.8993 0.9035 0.8907 0.9010 0.8958
Ramdo 0.0388 0.7250 0.0737 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.9518 0.9875 0.9693 0.9759 0.9537 0.9646
Suppobox 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3542 0.0934 0.1471
Locky 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3492 0.2767 0.3088 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Tempedreve 0.0015 0.80 0 0 0.0031 0.9507 0.9766 0.9635 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Qadars 0.0309 0.7500 0.0594 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.90 0 0 0.4318 0.5744
Symmi 0.0065 0.1538 0.0125 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.3750 0.0857 0.1389
Banjori 0.9143 0.1051 0.1885 0.6667 0.2857 0.40 0 0 0.9992 0.9999 0.9995 0.9984 1.0 0 0 0 0.9991
Tinba 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.60 0 0 0.4167 0.4918 0.8846 0.9456 0.9141 0.8921 0.9814 0.9345
Hesperbot 0.0037 0.0526 0.0069 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Fobber 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Alexa 1.0 0 0 0 0.0 0 02 0.0 0 03 0.9899 0.9868 0.9883 0.9746 0.9924 0.9834 0.9816 0.9881 0.9849
Dyre 0.9697 1.0 0 0 0 0.9846 0.1646 0.0567 0.0844 0.9611 0.9969 0.9786 0.9746 1.0 0 0 0 0.9872
Cryptowall 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Corebot 0.0017 0.40 0 0 0.0035 0.3116 0.2191 0.2573 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.80 0 0 0.50 0 0 0.5857
P 0.2727 0.2250 0.2466 0.0645 0.0140 0.0230 0.5718 0.3625 0.4434 0.5055 0.4405 0.4708
Bedep 0.0060 0.1471 0.0115 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.90 0 0 0.2353 0.3723 0.7322 0.3936 0.5117
Matsnu 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0800 0.0435 0.0563 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
PT Goz 0.9811 0.6682 0.7950 0.9091 1.0 0 0 0 0.9524 0.9978 0.9989 0.9983 0.9975 0.9986 0.9980
Necurs 0.0244 0.0729 0.0366 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.4001 0.0875 0.1424 0.4565 0.0982 0.1603
Pushdo 0.0036 0.2353 0.0071 0.1071 0.0268 0.0429 0.6762 0.1912 0.2969 0.5852 0.4917 0.5341
Cryptolocker 0.0163 0.6917 0.0318 0.6406 0.5538 0.5940 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.1111 0.0079 0.0147
Dircrypt 0.0017 0.0909 0.0034 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Shifu 0.0250 1.0 0 0 0 0.0489 0.2222 0.20 0 0 0.2105 0.3671 0.2872 0.3115 0.3294 0.6518 0.4330
Bamital 0.6316 1.0 0 0 0 0.7742 0.4839 0.5797 0.5275 0.9445 0.5417 0.6751 1.0 0 0 0 0.5750 0.7143
Kraken 0.0041 0.0196 0.0068 0.4545 0.4545 0.4545 0.0625 0.0049 0.0091 0.1500 0.0126 0.0233
Nymaim 0.0085 0.2250 0.0165 0.3062 0.3900 0.3431 0.3118 0.0625 0.1031 0.1787 0.1080 0.1342
Shiotob 0.2404 0.2749 0.2565 0.4767 0.3761 0.4205 0.9132 0.8785 0.8955 0.9285 0.8774 0.9022
W32.Virut 0.0035 1.0 0 0 0 0.0070 0.4403 0.2439 0.3139 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0 0
Micro-averaging 0.8085 0.0782 0.1426 0.8652 0.8854 0.8751 0.9175 0.9295 0.9235 0.9261 0.9336 0.9298
Macro-averaging 0.1808 0.3510 0.2386 0.3150 0.3031 0.3089 0.4675 0.3860 0.4229 0.5344 0.4604 0.4946
Table 7
Ranks computed by the Wilcoxon test.
Table 8
Summary of the Wilcoxon test. ● implies that the algorithm in the row improves the algorithm
of the column, while ◦ implies that the algorithm in the column improves the algorithm of the
row. Lower diagonal level of significance α = 0.95; Upper diagonal level of significance α = 0.9.
CS-NN – ◦ ◦ ◦ ◦ ◦ ◦
CS-SVM ● – ◦ ◦ ◦ ◦
WELM ● ● – ● ● ◦ ◦
CS-C4.5 ● ◦ – ● ◦ ◦
HMM ◦ ◦ – ◦ ◦ ◦
C5.0 ● – ◦
LSTM ● ● ● ● ● – ◦
LSTM.MI ● ● ● ● ● ● ● –
D. Tran et al. / Neurocomputing 275 (2018) 2401–2413 2411
Table 9
The evaluation time (in ms) of CS-NN, CS-SVM, CS-C4.5, WELM, HMM, C5.0, LSTM and LSTM.MI.
Fig. 6. Confusion matrix related to the malware families that cannot be detected by the LSTM.MI algorithm. The values are normalized to be in the range [0, 1]. 1 is
illustrated as black, while 0 is illustrated as white.
5. Conclusions
This work is supported by the Ministry of Science and Technology [29] K. Yan, Z. Ji, H. Lu, J. Huang, W. Shen, Y. Xue, Fast and accurate classification
of Vietnam (MOST) under the grant no. KC.01.01/16-20. of time series data using extended ELM: application in fault diagnosis of air
handling units, IEEE Transactions on Systems, Man, and Cybernetics: Systems,
2017.
References [30] K.M. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE
Trans. Knowl. Data Eng. 14 (3) (2002) 659–665.
[1] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, [31] S. Zhang, X. Zhu, J. Zhang, C. Zhang, Cost-time sensitive decision tree with
and D. Dagon, From throw-away traffic to bots: detecting the rise of DGA- missing values, in: Proceedings of the Knowledge Science, Engineering and
based Malware. In: Proceedings of the Twenty-first USENIX Security Sympo- Management, 4798, 2007, pp. 447–459.
sium (USENIX Security 12) (2012). [32] H. Lu, L. Yang, K. Yan, Y. Xue, Z. Gao, A cost-sensitive rotation forest algo-
[2] S. Schiavoni, F. Maggi, L. Cavallaro, S. Zanero, Phoenix, DGA-based botnet track- rithm for gene expression data classification, Neurocomputing 228 (2017) 270–
ing and intelligence, in: Proceedings of the International Conference on De- 276.
tection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), in: [33] L. Jiang, C. Li, S. Wang, Cost-sensitive Bayesian network classifiers, Pattern
Lecture Notes in Computer Science, 8550, 2014, pp. 192–211. Recogn. Lett. 45 (2014) 211–216.
[3] J. Woodbridge, H.S. Anderson, A. Ahuja, and D. Grant, Predicting do- [34] G. Wu, E.Y. Chang, Adaptive feature-space conformal transformation for imbal-
main generation algorithms with long short-term memory networks. CoRR anced data learning, in: Proceedings of the Twentieth International Conference
abs/1611.00791 (2016). arXiv:1611.00791. on Machine Learning, 2003, pp. 816–823.
[4] M. Kuhrer, C. Rossow, T. Holz, Paint it black: evaluating the effectiveness of [35] W. Zong, G.-B. Huang, Y. Chen, Weighted extreme learning machine for imbal-
malware blacklists„ in: Proceedings of the Research in Attacks, Intrusions and ance learning, Neurocomputing 101 (2013) 229–242.
Defenses (RAID), Gothenburg, Sweden, 2014, pp. 1–21. [36] Z.H. Zhou, L. Xu-Ying, On multi-class cost-sensitive learning, Comput. Intell. 26
[5] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (3) (2010) 232–257.
(1997) 1735–1780. [37] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
[6] F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction Eng. 21 (9) (2009) 1263–1284.
with LSTM, Neural Comput. 12 (10) (20 0 0) 2451–2471. [38] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory undersampling for class-imbalance
[7] Y. Sun, M.S. Kamel, A.K. Wong, Y. Wang, Cost-sensitive boosting for classifica- learning, IEEE Trans. Syst. Man Cybern. Part B Cybern. 39 (2) (2009) 539–
tion of imbalanced data, Pattern Recogn. 40 (12) (2007) 3358–3378. 550.
[8] S. Wang, X. Yao, Multiclass imbalance problems: Analysis and potential so- [39] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Pro-
lutions, IEEE Trans. Syst. Man Cybern Part B Cybern. 42 (4) (2012) 1119– ceedings of the International Conference on Machine Learning (ICML), 96,
1130. 1996, pp. 148–156.
[9] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods ad- [40] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review
dressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 (1) on ensembles for the class imbalance problem: bagging-, boosting-, and hy-
(2006) 63–77. brid-based approaches, IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42 (4)
[10] J. Hagen, and S. Luo, Why domain generation algorithms (DGA)? Trend Micro (2012) 463–484.
(Posted on: August 18, 2016). [41] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, A. Napolitano, RUSBoost, A hybrid
[11] S. Yadav, A.K.K. Reddy, A.L. Reddy, S. Ranjan, Detecting algorithmically gener- approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. Part A
ated malicious domain names, in: Proceedings of the Tenth ACM SIGCOMM Syst. Human 40 (1) (2010) 185–197.
Conference on Internet measurement, 2010, pp. 48–61. [42] N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, SMOTEBoost, Improving
[12] R. Perdisci, I. Corona, G. Giacinto, Early detection of malicious flux networks prediction of the minority class in boosting, in: Knowledge Discovery in
via large-scale passive DNS analysis, IEEE Trans. Dependable Secure Comput. 9 Databases: PKDD, Springer, Berlin Heidelberg, 2003, pp. 107–119. (2003).
(5) (2012) 714–726. [43] Q.D. Tran, P. Liatsis, RABOC, An approach to handle class imbalance in multi-
[13] L. Bilge, E. Kirda, C. Kruegel, M. Balduzzi, EXPOSURE: finding malicious do- modal biometric authentication, Neurocomputing 188 (2016) 167–177.
mains using passive DNS analysis, in: Proceedings of the Network and Dis- [44] M. Kukar, I. Kononenko, Cost-sensitive learning with neural networks, in:
tributed System Security Symposium (NDSS), 2011. Proceedings of the Thirteenth European Conference on Artificial Intelligence
[14] D.K. McGrath, M. Gupta, Behind Phishing, An examination of phisher Modi (ECAI), 1998, pp. 445–449.
operandi, in: Proceedings of the First USENIX Workshop on Large-Scale Ex- [45] L.V.D. Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res. 9
ploits and Emergent Threats (LEET), 2008. (2008) 2579–2605.
[15] J. Ma, L.K. Saul, S. Savage, and G. Voelker, Beyond blacklists: Learning to detect [46] F. CholletKeras, 2016. https://github.com/fchollet/keras.
malicious Web sites from suspicious URLs. In: Proceedings of the Knowledge [47] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B Thirion, O. Grisel,
discovery and data mining ACM KDD (2009). M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
[16] S. Yadav, A.K.K. Reddy, A.N. Reddy, S. Ranjan, Detecting algorithmically gener- D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine
ated domain-flux attacks with DNS traffic analysis, IEEE/ACM Trans. Netw. 20 learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
(5) (2012) 1663–1677. [48] Does Alexa have a list of its top-ranked websites? 2016.
[17] R.V. Salomon, J.C. Brustoloni, Identifying botnets using anomaly detection tech- Available online at: https://support.alexa.com/hc/en-us/articles/
niques applied to DNS traffic, in: Proceedings of the Fifth IEEE Consumer Com- 200449834- Does- Alexa- have- a- list- of- its- top- ranked- websites- .
munications and Networking Conference (CNCC), 2008, pp. 476–481. [49] Bambenek Consulting - Master feeds, 2016. Available online at: http://osint.
[18] J. Kwon, J. Lee, H. Lee, A. Perrig, PsyBoG: a scalable botnet detection method bambenekconsulting.com/feeds/.
for large-scale DNS traffic, Comput. Netw. 97 (2016) 48–73. [50] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for
[19] G. Gu, R. Perdisci, J. Zhang, W. Lee, BotMiner, Clustering analysis of network classification tasks, Inf. Process. Manag. 45 (4) (2009) 427–437.
traffic for protocol-and structure-independent botnet detection, USENIX Secur. [51] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirec-
Symp. 5 (2) (2008) 139–154. tional LSTM and other neural network architectures, Neural Netw. 18 (5)
[20] S. Krishnan, T. Taylor, F. Monrose, J. McHugh, Crossing the threshold: detecting (2005) 602–610.
network malfeasance via sequential hypothesis testing, in: Proceedings of the [52] R. Quinlan, Data mining tools see5 and c5, 2008. Available online at: https:
Fourty-third Annual IEEE/IFIP International Conference on Dependable Systems //www.rulequest.com/.
and Networks (DSN), 2013, pp. 1–12. [53] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Her-
[21] F.A. Gers, N.N. Schraudolph, J. Schmidhuber, Learning precise timing with rera, KEEL data-mining software tool: data set repository, integration of al-
LSTM recurrent networks, J. Mach. Learn. Res. 3 (2002) 115–143. gorithms and experimental analysis framework, J. Multiple Valued Logic Soft
[22] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur, Recurrent neural Comput. 17 (2–3) (2011) 255–287.
network based language model, Interspeech 2 (3) (2010). [54] L. Jiang, S. Wang, S.C. Li, L. Zhang, Structure extended multinomial naive Bayes,
[23] A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent Inf. Sci. 329 (2016) 346–356.
neural networks, IEEE International Conference on Acoustics, Speech And Sig- [55] S. Garcia, The stratosphere IPS project, 2015. Available online at https://
nal Processing (ICASSP), 2013. stratosphereips.org/category/dataset.html.
[24] C.X. Ling, S.S. Victor, in: Cost-Sensitive Learning, Encyclopedia of Machine
Learning, Springer, US, 2011, pp. 231–235. Duc Tran received the M.Sc. degree in electrical engi-
[25] L. Jiang, C. Qiu, C. Li, A novel minority cloning technique for cost-sensitive neering from Budapest University of Technology and Eco-
learning, Int. J. Pattern Recogn. Artif. Intell. 29 (4) (2015) 1551004. nomics, Hungary, in 2008, and the Ph.D. degree in infor-
[26] C. Qiu, L. Jiang, G. Kong, A differential evolution-based method for class-imbal- mation engineering from City University London, United
anced cost-sensitive learning, in: Proceedings of 2015 International Joint Con- Kingdom, in 2015. He is currently a research fellow
ference on Neural Network, 2015, pp. 1–8. with Bach Khoa Cybersecurity Centre, Hanoi University of
[27] C. Elkan, The foundations of cost-senstive learning, in: Proceedings of the Science and Technology, Vietnam. His research interests
Seventeenth International Joint Conference on Artificial Intelligence, Seattle, include machine learning, image processing, biometric
Washington, 2011, pp. 973–978. authentication, and various pattern recognition-related al-
[28] Y. Liu, H. Lu, K. Yan, H. Xia, C. An, Applying cost-sensitive extreme learning gorithms.
machine and dissimilarity integration to gene expression data classification,
Comput. Intell. Neurosci. (2016).
D. Tran et al. / Neurocomputing 275 (2018) 2401–2413 2413
Hieu Mac received the M.Sc. degree in data communica- Hai Anh Tran is a Lecturer in the School of Information
tion and computer networks from the Hanoi University and Communication Technology, Hanoi University of Sci-
of Science and Technology, Vietnam, in 2016, where he is ence and Technology, Vietnam. He completed his Ph.D. at
currently pursuing the Ph.D. degree with Bach Khoa Cy- University of Paris-Est Creteil (France) (UPEC) in 2012. His
bersecurity Centre. His research interests include feature research interests lie in the area of Distributed Systems,
learning and extraction, machine learning, and applica- QoE, Data Mining, IoT, ranging from theory to design
tions in computer security. to implementation. He has actively collaborated with re-
searchers in several other disciplines of computer science,
particularly security and cryptography. Tran has served on
conference and workshop program committees for SOICT
conference since 2015.
Van Tong received the B.Eng. degree in computer science Linh Giang Nguyen is an associate professor at the
from Hanoi University of Science and Technology, Viet- Department of Data Communication and Computer Net-
nam, in 2017. His current research interests include deep works, Hanoi University of Science and Technology, Viet-
learning, network security and IoT. nam. He received his M.Eng. degree in 1991 and Ph.D. de-
gree in 1995 in conputer engineering from the Georgian
University of Technology, Georgia, former USSR. His re-
search interests include machine learning, data analysis,
optimization, wireless sensor networks, IoT, network se-
curity, signal processing.