Application Traffic Classification Using Neural Networks
Application Traffic Classification Using Neural Networks
CAN UNCLASSIFIED
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Disclaimer: This publication was prepared by Defence Research and Development Canada, an agency of the Department of
National Defence. The information contained in this publication has been derived and determined through best practice and
adherence to the highest standards of responsible conduct of scientific research. This information is intended for the use of the
Department of National Defence, the Canadian Armed Forces (“Canada”) and Public Safety partners and, as permitted, may be
shared with academia, industry, Canada’s allies, and the public (“Third Parties”). Any use by, or any reliance on or decisions
made based on this publication by Third Parties, are done at their own risk and responsibility. Canada does not assume any
liability for any damages or losses which may arise from any use of, or reliance on, the publication.
Endorsement statement: This publication has been peer-reviewed and published by the Editorial Office of Defence Research and
Development Canada, an agency of the Department of National Defence of Canada. Inquiries can be sent to:
[email protected].
⃝
c Her Majesty the Queen in Right of Canada, Department of National Defence, 2022
⃝
c Sa Majesté la Reine du chef du Canada, ministère de la Défense nationale, 2022
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Abstract
This Scientific Report investigates the efficacy of machine learning techniques on a
passive eavesdropper’s ability to classify encrypted traffic in wireless networks, where the
eavesdropper can observe only the timing, sequence, and duration of packets. Datasets
were generated using a custom-built traffic generation tool to generate encrypted traffic
sets consisting of four application types: file transfer protocol (FTP), hypertext transfer
protocol (HTTP), voice over Internet protocol (VoIP), and extensible messaging and
presence protocol (XMPP). The eavesdropper gathered encrypted packets and used the
recurrent neural network (RNN) machine learning technique to classify the applications of
packets in the dataset on a packet-by-packet basis. In evaluating the efficacy of RNNs for
traffic classification, the effect of a number of parameters on accuracy and training time
were examined, including: the window size of data examined by the RNN; the effect of
partial knowledge; different configurations of RNNs such as many-to-many or many-to-one;
and how best to combine the outputs of an RNN to make a classification prediction.
This work is important to the defence and security community, as it studies how much a
passive eavesdropper can learn about network activities by using deep learning techniques
without breaking network encryption.
DRDC-RDDC-2022-R052 i
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Résumé
Le présent rapport scientifique examine l’efficacité des techniques d’apprentissage machine
sur la capacité d’un intercepteur passif à classer le trafic chiffré sur des réseaux sans fil en
observant uniquement le moment, la séquence et la durée de transmission des paquets.
Pour ce faire, on a utilisé un outil générateur de trafic personnalisé afin de créer des
ensembles de données chiffrés transmis au moyen de quatre types de protocoles, soit le
protocole de transfert de fichiers (FTP), le protocole de transfert hypertexte (HTTP),
la voix sur protocole Internet (VoIP) et le protocole XMMP (Extensible Messaging and
Presence Protocol). L’intercepteur a capté des paquets chiffrés et s’est servi d’un réseau
de neurones récurrentes (RNR) – une technique d’apprentissage machine – pour classer
un à un les paquets de l’ensemble de données selon leur type d’application. Dans le but
d’évaluer l’efficacité des RNR quant au classement du trafic, on a étudié l’incidence de divers
paramètres sur l’exactitude et le temps d’apprentissage, notamment l’étendue des données
analysées par les RNR, l’effet de connaissances partielles, les différentes configurations de
RNR (plusieurs-à-plusieurs ou plusieurs-à-un) et la meilleure façon de combiner les extrants
d’un RNR pour prédire un classement.
ii DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Résumé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
DRDC-RDDC-2022-R052 iii
CAN UNCLASSIFIED
CAN UNCLASSIFIED
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
List of Abbreviations/Acronyms/Initialisms/Symbols . . . . . . . . . . . . . . . . . . 38
iv DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
List of Figures
Figure 1: Rolled RNN diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 7: F1 score learning curve for various early stopping patiences using GRU
as the RNN unit, 64-32 as the RNN hidden layer model, 10 as the
window size, 64 as the batch size, and 1 as the stride number. . . . . . . 16
Figure 9: Performance of various window sizes and strides using GRU as the RNN
unit, 64-32 as the RNN hidden layer model, and 64 as the batch size. . . 22
Figure 12: Performance of various window sizes with the stride set to 1, the batch
size set to 64, the RNN unit set to GRU, the RNN hidden layer model
set to 64-32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 13: Training times of all tested window size and stride configurations with
the batch size set to 64, the RNN unit set to GRU, the RNN hidden
layer model set to 64-32. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 16: Results from training model with single features plotted alongside
default dual features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
DRDC-RDDC-2022-R052 v
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 18: Plot of F1 scores for aggregate and default test methods. . . . . . . . . . 33
vi DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
List of Tables
Table 1: Train set traffic composition. . . . . . . . . . . . . . . . . . . . . . . . . 9
Table 4: Training time and F1 score under various layer and node combinations
using GRU as the RNN unit, 10 as the window size, 64 as the batch
size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . . . . 13
Table 5: Training time and F1 score of different RNN unit types under 10 and
200 window sizes using 64-32 as the RNN hidden layer model, 64 as the
batch size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . 14
Table 6: Training time and F1 score of various batch sizes under 10 and 200
window sizes using 64-32 as the RNN hidden layer model, 64 as the
batch size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . 15
DRDC-RDDC-2022-R052 vii
CAN UNCLASSIFIED
CAN UNCLASSIFIED
1 Introduction
Traffic analysis encompasses a broad variety of techniques to infer information about
encrypted communication activity by analyzing patterns and characteristics of transmitted
signals, including observable elements such as transmission rate, frequency, timing, and
apparent source/destination. Unlike techniques focused on breaking the encryption of
transmission, traffic analysis generally does not seek to expose the contents of the message
but instead concentrates on deducing such things as the topology of a communication
network, the roles of various users or nodes, and the type of communication taking place.
The ultimate limits of performance of traffic analysis techniques are of interest to network
defenders, who must understand the level to which adversaries can gather useful data
through passive eavesdropping. The traffic analysis threat is especially germane to wireless
networks, where eavesdropping can be performed by a remote receiver without the need
for network taps or invasive physical connections. Understanding these threats is relevant
to securing the Canadian Armed Forces (CAF) use of radios for tactical operations, use
of wireless communications in fixed infrastructure settings, and the potential future use
of wireless Internet of Things (IoT) sensors in infrastructure and real-property settings.
This Scientific Report focuses on traffic classification—a subset of traffic analysis—where
the goal is to determine the type of application layer traffic being transmitted over a
network. Specifically of interest is the efficacy of machine learning techniques to classify
traffic between two nodes in a network on a packet-by-packet basis. Note that the traffic
data used in this Report are collected from network layer and only network layer metadata
such as packet length and packet arrival time are used for performance evaluation.
Traffic classification is itself an old problem, and is performed for a variety of reasons
under different circumstances. Most commonly, traffic classification is performed by
telecommunications and network operators to classify flows internal to their network to
better manage quality of service and prioritize resources depending on application type and
service requirements. In the case of a service provider examining data packets in its own
network, some data may be available in the clear with no lower-layer encryption, including
elements such as packet headers, source/destination Internet Protocol (IP) addresses, and
ports. While service providers examine traffic traversing their own networks, an adversary
tries to perform traffic classification in a network they do not control. In these cases, traffic is
likely encrypted, although quite often certain headers remain visible (e.g., depending upon
the level the encryption is applied, media access control (MAC) or even IP headers may
be visible). Many studies of traffic classification assume that traffic flows can be uniquely
identified by examining metadata, which greatly simplifies the problem. In most military
wireless applications, however, it is reasonable to expect a network to employ link-layer
encryption thus obfuscating all header information. The assumption made in this Report
is that an adversary has access to no header information at all, and can observe only the
timing, order, and duration of packets. Of interest is determining how much an adversary can
infer about applications in use in a network when only this minimum amount of information
is available. Put another way: even if the defender takes precautions to ensure all data and
signalling is encrypted, how much can a passive eavesdropper learn about network activities
without actually compromising the network?
DRDC-RDDC-2022-R052 1
CAN UNCLASSIFIED
CAN UNCLASSIFIED
To study this problem, an in-depth analysis was undertaken using recurrent neural networks
(RNN) to classify traffic on a packet-by-packet basis according to application type.
RNNs are a type of neural network in which previous outputs of the neural network can be
used as inputs to subsequent states; effectively this structure means that RNNs can retain
memory of learning generated from recent inputs and incorporate this into the classification
model. The “internal memory” of RNNs make them a popular machine learning algorithm
for sequence data (including things like identifying speech, audio, and time series). They are
thus a promising candidate for application traffic classification, where timing and ordering
of packets from different applications may each have distinct signatures dictated by their
respective protocols. In evaluating RNNs for traffic classification, the effect of a number
of parameters on accuracy and training time are examined, including: the window size of
data examined by the RNN (i.e., the amount of history examined); the effect of partial
knowledge (i.e., only having packet length, or only having packet timing but not both);
different configurations of RNNs such as many-to-many or many-to-one; and how best to
combine the outputs of an RNN to make a classification prediction.
The remainder of this Report is organized as follows. In Section 2 a brief literature review
of background research and related work in the field of traffic analysis and classification
is presented. Section 3 provides an overview of RNNs, explaining their structure and
parameters of interest. The methodology for the experimental work of this paper is described
in Section 4; this includes a discussion of how network traffic was generated for datasets,
the training and testing process, and the experiments used to evaluate traffic classification
accuracy. Results and discussion are presented in Section 5, with a brief conclusion in
Section 6 summarizing the work.
2 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
2 Related Works
This section details various research efforts in the domain of traffic analysis and classification
specifically pertaining to machine learning methods. This Report aims to supplement
previous work by applying similar techniques at the frame level and with minimal input
features, which to the best of our knowledge has yet to have been explored by the research
community.
Coupled with artificial intelligence (AI) technologies, current traffic analysis is used to
support both offence and defence in the cyber security domain. For instance, traffic
analysis has been widely used in automated intrusion detection systems, reconstruction of
adversarial tactical networks, target selection, and smart jamming [3–7]. With adversaries
also employing advanced network security protection, it becomes more challenging to extract
packet contents by directly inspecting intercepted traffic. The importance of traffic analysis
and classification grows and quickly evolves as adversarial capabilities continually advance
and prior analysis methods become obsolete.
Research has been undertaken to apply supervised learning to traffic classification and
attack detection. Levent Koc et al. proposed a network intrusion detection system based on
a hidden naive Bayes classifier and applied it to the KDD’99 dataset for detecting probe,
denial-of-service (DoS), user to root (U2R), and remote to local (R2L) attacks [9]. Sarker
DRDC-RDDC-2022-R052 3
CAN UNCLASSIFIED
CAN UNCLASSIFIED
et al. presented an intrusion detection tree for malicious behaviour and anomaly detection
[10]. Raiker et al. applied support vector machine (SVM), nearest centroid, and Gaussian
naive Bayes learning techniques to the traffic from a software defined network (SDN) [11].
Vishwakarma et al. presented a k-nearest neighbours (k-NN) for intrusion detection and
similarly to Levent Koc et al. tested it on the KDD’99 dataset for probe, DoS, U2R, and
R2L attack detection [12].
Unsupervised learning discovers patterns and structures and formulates models from
unlabelled datasets. This allows for the detection of unknown traffic types which is
useful in anomaly detection. Clustering-based and association rule-based techniques are
often applied to the classification and attack detection field. Clustering-based algorithms
include distribution-based (e.g., Gaussian mixture model), density-based (e.g., DBSCAN),
centroid-based (e.g., K-means), and hierarchical/connectivity-based [13–15]. Frequent
pattern growth and apriori algorithms fall under the association rule-based category [16].
Several methods were explored in the unsupervised domain as well. Zhang et al. presented an
intrusion detection system for IoT traffic based off a deep belief network [19]. A generative
adversarial deep neural network was implemented by Erpek et al. to launch and mitigate
wireless jamming attacks [20]. Aleroud and Karabatis proposed a similar generative network
for phishing detection [21]. Farahnakian and Keikkon and Javaid et al. both leveraged
auto-encoder derivatives to perform intrusion detection [22, 23].
4 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
In each timestep, t, an RNN unit, A, is passed a feature vector, x<t> and the activation
from the previous timestep, a<t−1> . Both are multiplied by respective weight matrices,
offset by a bias vector and passed through an activation function. The result is the current
timestep’s activation, a<t> , which is fed back into the RNN for the next timestep along
with the following feature vector x<t+1> . In addition to being passed to the next timestep,
the activation is typically passed through a final non-recurrent layer, resulting in an output
y <t> . To illustrate the temporal nature of an RNN, Figure 2 shows an RNN in its unrolled
form.
Note that although this representation appears to consist of several layers, each layer is
in fact the same RNN unit, A, with the same parameters but at a different position
in time. The most common RNN training algorithm is Backpropagation Through Time
(BPTT). Derivatives of the error with respect to the network’s weights are calculated and
accumulated, starting from the last timestep and working, “backwards through time” until
the first timestep is reached. At this point the weights are adjusted using the accumulated
derivative to minimize the error. As with training a feed-forward MLP, this process is
repeated for every training input and often multiple times over for the entire training set
until the error stabilizes at a minimum.
DRDC-RDDC-2022-R052 5
CAN UNCLASSIFIED
CAN UNCLASSIFIED
The RNN configuration shown in Figure 2 is many-to-many where for each timestep, t, there
is an input feature vector, x<t> , and a corresponding output vector, y <t> . In the domain
of packet classification, each x<t> corresponds to a single packet and its features could,
for example, consist of length, interarrival time, important flags, and/or other pertinent
header or payload information. The output, y <t> , is a vector of probabilities describing the
likelihood that x<t> belongs to each of the specified traffic classes.
Other RNN configurations are also well-suited for packet-by-packet traffic classification. A
many-to-one model takes the same input features as the many-to-many model but only
predicts a single label for the entire series. The label could correspond to a prediction for a
packet at a specific position, the majority traffic type in the entire input series, or any other
determining factor chosen by the implementer. Such a model is portrayed in Figure 3.
6 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
In this example it would be natural to have the neural network output a predicted label for
the last packet. Information from all previous packets in the series is propagated through
the network to help make the prediction. Even if the goal is to classify the first packet in
the sequence, the prediction will not be made until the entire sequence has been traversed.
DRDC-RDDC-2022-R052 7
CAN UNCLASSIFIED
CAN UNCLASSIFIED
4 Methodology
The process of developing an RNN classifier involves obtaining a dataset, performing
preprocessing steps, then training, testing, and tuning the RNN. To develop an RNN
classifier to perform traffic analysis, data was collected using the Defence Research and
Development Canada Traffic Generator (DTG) [26]; the data collected for both testing
and training the RNN is discussed in Section 4.1. Section 4.2 discusses the preprocessing
operations that are performed on the raw dataset before it can be fed into the RNN. An
RNN architecture that suits the format of the data is examined in Section 4.3 as well as
the process of selecting hyperparameters to optimize performance.
4.1 Dataset
Gathering a representative dataset is imperative to successfully training a neural network.
The dataset is typically grouped into training, validation, and test sets. Each could be
collected separately or be a subset of a single dataset. Regardless, the stochastic nature
of the data should be consistent across all three sets so that each is representative of the
problem space.
For this research, the purpose-built DTG software tool was chosen to generate labelled
traffic, due to its ability to reliably produce datasets composed of user-controlled random
distributions of application-layer traffic. The four supported traffic types in DTG are FTP,
HTTP, VoIP, and XMPP. Each are generated according to standardized and accepted
stochastic models. See [26] for more details regarding the models and architecture of the
traffic generator.
To generate the network data for traffic classification analysis, two hosts were directly
networked together. The DTG server software was installed on one host, and the DTG
client software was installed on the other. A third host was used to monitor the link between
the DTG client and server using the Wireshark traffic sniffing application and to save the
data as a pcap (packet capture) file. Running the DTG application on the client and server
allowed for the dynamic creation of traffic files, with statistics as described below.
DTG was used to generate three separate traces for the training, validation, and test sets.
Three separate sets were generated (as opposed to sampling from a single set) in order
to ensure that all sets contained the various protocols’ start-up and shut-down packet
exchanges. Shuffling a single set before dividing it into subsets would also have ensured an
equal distribution of start and end sequences. However, in most traffic classification use
cases, data would likely be classified in real time and therefore, the original order should
be preserved in the test set to replicate this scenario. Tables 2 and 3 breakdown the traffic
compositions of the training, validation, and test sets respectively. The traffic type “Other”
refers to packets captured on the link (between client and server) that were not generated by
one of the main DTG traffic types (FTP, HTTP, VoIP, XMPP). Other traffic was limited, in
general, to network protocols such as simple service discovery protocol (SSDP), multicast
8 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
domain name system (MDNS), and address resolution protocol (ARP). A network time
protocol (NTP) service was run in the background to synchronize the hosts and its packets
also fall under the “Other” class.
Table 1: Train set traffic composition.
Traffic Type FTP HTTP VoIP XMPP Other Total
Num Packets 335,636 58,517 981,181 1,811 3,612 1,380,757
% of Set 24.31 4.24 71.06 0.13 0.26 100
Note that the relative percentages of the sets devoted to each traffic type is consistent
across all three sets. The DTG generates all traffic in parallel so packets corresponding
to each protocol are spread evenly within the datasets. Another notable attribute of the
datasets are the class imbalances. There are significantly more VoIP and FTP packets than
any other traffic type and significantly fewer XMPP and Other packets. This is a result
of using the default parameters of the DTG that are intended to mimic real traffic. VoIP
conversations, for example, naturally result in long streams of voice data packets. Changing
the distributions to balance the classes would undermine the integrity of the representative
nature of the datasets and was therefore avoided. It is well known that class imbalances add
to the difficulty of training a classifier. Machine learners often prioritize majority classes
since performing well on them may have the most substantial impact on minimizing the
loss function.1 This can lead to minority classes being overlooked or altogether ignored. If
necessary, a number of techniques can be used to increase the importance of the minority
classes. Such methods include increasing the weight of minority classes in the loss function
and choosing a performance metric well-suited for class imbalances. In the case of the DTG
classification problem, as can be seen in Section 5, the learners had adequate performance
without having to additionally compensate for the class imbalance.
4.2 Preprocessing
Before the datasets can be passed to an RNN they must be converted from raw traces into
groups of input features and labels. Each packet is labelled according to its destination or
1
The loss function is the function that computes the distance between the current output of the algorithm
and the expected output. The loss function used in this Report is softmax().
DRDC-RDDC-2022-R052 9
CAN UNCLASSIFIED
CAN UNCLASSIFIED
source port number, which are unique to the DTG’s supported traffic types. Any packets
with port numbers that fall outside of the DTG range are labelled as “Other.” Note that
the port number is only used to establish the ground truth and is concealed from the RNN
beyond this step. From Wireshark traces, each packet’s length in bytes and the difference
of arrival time from the previous packet, in seconds, are saved as input features. Thus,
the features for each packet consists of an input tuple of interarrival time2 and length as
well as a class label establishing the ground truth. Since inter-arrival times are often in the
microsecond range, normalization was performed on inputs as a means of reducing training
time [27].
Feeding an entire pcap file into an RNN is not feasible due to packet capture’s large size.
Instead, the datasets are split into windows that are passed to the RNN sequentially.
Two parameters determine how a dataset is sliced into windows. The “window size” is
the number of packets to include in each window and the “stride” indicates by how many
packets the window should be shifted before another set of packets is taken. Figure 5 shows
a subset of a dataset (pcap), where each rectangle represents a single packet, and depicts
how the two parameters influence window selection. Note that the data is processed using
the windows and strides for training and testing of the RNN.
The result is a set of windows, each of length “window size” which will be individually fed
into the RNN. The effect of various window sizes and strides is the topic of Section 5.1.
10 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
One of the more influential RNN hyperparameters is the number of layers and the number
of nodes in each layer. Figure 6 depicts the layers and nodes in a bidirectional RNN model.
Increasing the depth (adding layers) allows for the network to learn more complex problems
but at the cost of needing more computational resources, increased training time, and
the possibility of overfitting. The parameter space of layer and node combinations was
explored to find suitable values that performed well on DTG data without adding substantial
complexity for minimal performance gain. Numerous combinations of layers and nodes were
trained and tested with the DTG-generated datasets to converge on optimal values and are
shared in Table 4, where the window size is 10, the batch size is 64, the stride number is 1,
and the RNN unit is GRU with bidirectional model. The well-known F1 metric (described
in further detail in Section 5) is used to evaluate the performance of the learner, with the
average F1 score taken as the averaged score for classification across all four application
types. The Nodes column contains hyphen-separated quantities of nodes for each layer
where the leftmost layer accepts input features and the rightmost layer is the final RNN
layer before the softmax layer.
DRDC-RDDC-2022-R052 11
CAN UNCLASSIFIED
CAN UNCLASSIFIED
The column “Trainable parameters” refers to the quantity of weights and biases throughout
the entire network where the last trainable parameter is the output layer. The number of
trainable parameter in an RNN layer, p, is determined by Equation (1), where r is a constant
that is unique to the type of RNN unit,3 n is the number of nodes in that layer, and m
is the size of the layer’s input vector. A layer with a large number of nodes significantly
increases the complexity of the network as can be seen in the configuration of a single layer
of 128 nodes.
p = r(n2 + nm + n) (1)
Note that the number of epochs for each training sequence is not fixed. The network
is trained until the validation loss ceases to improve (for some pre-determined number
3
r is the number of feed-forward neural networks in a RNN unit. For example, r is 1 for simply RNN, 4 for
LSTM, and 3 for GRU.
12 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Table 4: Training time and F1 score under various layer and node
combinations using GRU as the RNN unit, 10 as the window size, 64 as
the batch size, and 1 as the stride number.
Hidden Nodes Trainable Training Average
Layers Nodes Parameters Time (min) F1 score
1 8 576-85 88 0.850
1 32 6912-325 35 0.886
1 128 101376-1285 42 0.909
2 8-8 576-1248-85 89 0.895
2 16-8 1920-2016-85 172 0.910
2 8-16 576-3264-165 65 0.897
2 32-16 6912-7872-165 140 0.928
2 32-8 6912-3552-85 126 0.924
2 64-32 26112-31104-325 115 0.938
2 64-64 26112-74496-645 78 0.932
2 128-64 101376-123648-645 124 0.916
3 8-8-8 576-1248-1248-85 132 0.903
3 8-16-8 576-3264-2016-85 203 0.914
3 32-16-8 6912-7872-2016-85 114 0.929
3 32-64-32 6912-49920-31104-325 78 0.927
4 8-8-8-8 576-1248-1248-1248-85 199 0.924
4 8-16-16-8 576-3264-4800-2016-85 231 0.926
4 16-32-32-16 1920-12672-18816-7872-165 212 0.924
4 32-64-64-32 6912-49920-74496-31104-325 219 0.924
5 8-8-8-8-8 576-1248-1248-1248-1248-85 269 0.918
“Average F1” score is the most commonly used performance metric throughout this Report.
Section 5 describes in detail why this metric was chosen, what the value represents, and
how to calculate it with Equation (4). In short, the F1 score for a classifier is a measure of
its ability to correctly identify members of a particular class. Its value falls between [0, 1]
where 1.0 is a perfect score. The “Average F1” score is an average of individual class scores,
weighting each class equally. The layer and node combination with the highest F1 score is
2 and 64-32 respectively with a score of 0.938. Increasing the depth of the network past
three layers appears to no longer improve its capacity to learn, while also increasing the
training time. Despite having the deepest configuration, the 5 layer network has relatively
DRDC-RDDC-2022-R052 13
CAN UNCLASSIFIED
CAN UNCLASSIFIED
few parameters because of the low impact of a small number of nodes per layer. This
configuration takes longer to train than shallower networks because the BPTT algorithm
prevents a large portion of calculations from being performed in parallel. The shallowest
configurations with one layer do not learn as well as the multilayer configurations. The best
results are achieved by the two or three layer models with one or more layers having at
least 32 nodes and none of the layers exceeding 64 nodes. The configuration of 2 and 64-32
was chosen due to it having the highest score and a midrange training time.
The simple RNN unit as well as GRU and LSTM units were tested to determine which
is most favourable for the packet classification problem. Two scenarios were examined: a
small window size where long-range dependencies are not critical and a larger window size
were they are. Table 5 combines the results of the scenarios, where the batch size is 64, the
stride number is 1, and the RNN hidden layer model is 64-32, i.e., the first hidden layer
uses 64 nodes and the second hidden layer uses 32 nodes. The simple RNN unit was not
tested on the larger window to avoid excessive training times after it had already performed
poorly on the 10 packet window and since it is not designed for long-range dependencies.
Table 5: Training time and F1 score of different RNN unit types under
10 and 200 window sizes using 64-32 as the RNN hidden layer model,
64 as the batch size, and 1 as the stride number.
Window Size RNN Unit Training Time (min) Average F1 Score
Simple RNN 519 0.891
10 LSTM 96 0.925
GRU 73 0.923
LSTM 321 0.970
200
GRU 339 0.984
The two units designed for long-term dependencies, the GRU and LSTM, trained faster and
performed better than the simple RNN on the smaller window. Between the two, training
times and F1 scores were similar with the GRU performing slightly better on the 200 sized
window. Because of this performance edge, and the fact that LSTMs have an extra gate to
train (impacting memory requirements) the GRU was chosen for the classifier.
The hyperparameter “Batch Size” determines how many training samples to process before
performing BPTT and updating the model’s weights. The largest possible batch size is
equal to the number of samples in the entire training set. Conversely, the smallest batch
size is just a single sample. Generally, a larger batch size will train the model faster at the
cost of the network’s ability to generalize, and results in reduced performance [29]. Table 6
compares results using various batch sizes on both small and large window sizes of 10 and
14 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
200 respectively, where the stride is 1, the RNN unit is GRU, and the RNN hidden layer
model is 64-32.
Table 6: Training time and F1 score of various batch sizes under 10
and 200 window sizes using 64-32 as the RNN hidden layer model, 64 as
the batch size, and 1 as the stride number.
Window Size Batch Size Training Time (min) Average F1 score
32 163 0.916
64 103 0.916
10 128 69 0.923
512 24 0.911
1024 16 0.900
32 177 0.980
64 107 0.975
200 128 44 0.963
512 39 0.958
1024 49 0.949
“Average F1” scores do not deviate considerably across batch sizes on a small window, with
smaller to medium sized batches performing the best. On a larger window size the batch
size has a more significant impact. The spread of “Average F1” scores coincides with the
theory that larger batch sizes reduce a model’s ability to generalize. The impact of batch
size on a small window was less pronounced, so a batch size of 64 was chosen based on the
large window results. A batch size of 32 performed marginally better but took longer to
train and was discarded.
In the context of training an RNN, an epoch refers to the act of iterating through the entire
training set once. Within an epoch, unless the batch size is set to the length of the training
set, the network’s weights will be updated several times. Sometimes from examining a
learning curve or through trial and error, a reasonable fixed number of epochs can be
selected. For the DTG classifier, the number of epochs necessary to converge on a minimum
loss varies depending on the window size. Using the alternative, early stopping (described
below), allowed for networks trained on different window sizes to be fairly compared.
Early stopping is the process of training RNN for an arbitrarily large number of epochs but
stopping early when a condition is met. The condition is usually that the validation loss4 or
a selected metric has not improved based on a minimum delta value5 for a specific number
4
The validation loss is the loss calculated on the validation dataset with the same RNN model and loss
function as that used in training dataset. In addition, the validation loss is calculated after each epoch but
the training loss is calculated during each epoch.
5
The minimum delta value is set for the change to be considered, i.e., change in the value being monitored
should be higher than the minimum delta value. For example, in this Report, we set the minimum delta
value as 0.001, which means only the change of loss over 0.001 to be considered.
DRDC-RDDC-2022-R052 15
CAN UNCLASSIFIED
CAN UNCLASSIFIED
of epochs. The number of epochs is referred to as the patience. Early stopping is evaluated
on the validation set because the training loss6 can continue to improve even though the
validation loss has stagnated or worsened, wasting time and resources. The weights that
produced the best loss or metric before early stopping are reloaded at the end of training.
The validation loss was used as the early stopping monitor until it was determined that
the minimum loss does not correlate with the maximum average F1 score. A network was
trained evaluating both the validation and test sets every epoch to determine if a high
validation F1 score is replicated with the test set. Figure 7 is a plot of the results, where
the window size is 10, the batch size is 64, the stride number is 1, the RNN unit is GRU
with bidirectional model, the RNN hidden layer model is 64-32.
The test set’s F1 scores were consistently better than the validation set. This likely indicates
that the test set is more similar to the training set than the validation set. Nevertheless, the
performance of the test set closely mimics the validation set confirming that validation F1
score should be used as the early stopping monitor. Several key points are also plotted on
the figure. They are the test scores that the network would have achieved if early stopping
patience (ESP) was set to the values listed in the legend. The minimum delta was set to
0.001, the smallest precision value used in this Report for F1 scores. Both ESPs of 5 and
6
The training loss is the loss calculated on the training dataset with the predefined RNN model as shown
in Table 4 and loss function, i.e., softmax() in this Report.
16 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
10 stopped after not improving past the validation F1 score achieved on epoch 13 and
their respective points overlap on Figure 7. An ESP of 5 was chosen because it would have
stopped training shortly after the steepest portion of the curve. Beyond that point, the
validation F1 score improves slightly but would require using an ESP of 25 which would
lengthen training time by roughly a factor of 4. Stopping earlier increases the uncertainty
in results because the process has less time to reach a stable state. Due to the random seed
from tensorflow not set to a fixed value during training and each run creating a different
random number, training was run three times per each set of hyperparameters on the same
dataset in Section 5 and the results were averaged.
Figure 8 is a diagram of the final model using the hyperparameters from Table 7. There are
two fully connected bidirectional GRU layers. The subscripts 64 and 32 refer to how many
nodes are in the units for layers 1 and 2 respectively. The window size parameter discussed
in Section 4.2 determines how long of a sequence the network accepts and is represented by
the subscript ws of the last x input and y output. The operation depicted by joining the lines
of the activations of the first layer’s forward and backwards propagations is concatenation.
DRDC-RDDC-2022-R052 17
CAN UNCLASSIFIED
CAN UNCLASSIFIED
18 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
5 Results
Supervised learning predictive analysis often makes use of confusion matrices. Confusion
for classification problems are of shape (n, n) where n is the number of specified classes.
The rows and columns correspond to the predicted and actual classes respectively. Table 8
shows a zero-initialized confusion matrix for the DTG classification problem.
Table 8: Zeroed confusion matrix.
Actual
FTP HTTP VoIP XMPP Other
FTP 0 0 0 0 0
Predicted
HTTP 0 0 0 0 0
VoIP 0 0 0 0 0
XMPP 0 0 0 0 0
Other 0 0 0 0 0
A cell within the matrix is incremented with every packet prediction made on the test set.
For example, if the first packet of the test set is predicted by the model as FTP packet but
is actually HTTP and the next two packets are correctly predicted as HTTP, the confusion
matrix will be updated to look like Table 9.
Table 9: Incremented confusion matrix.
Actual
FTP HTTP VoIP XMPP Other
FTP 0 1 0 0 0
Predicted
HTTP 0 2 0 0 0
VoIP 0 0 0 0 0
XMPP 0 0 0 0 0
Other 0 0 0 0 0
The matrix will continue to fill until the final packet has been predicted. A perfect model will
only have non-zero values along the diagonal. Conversely, a model with poor performance
will result in a distribution of values across all cells. Table 10 is an example of a completed
confusion matrix with artificial values. The real test set has tens of thousands of samples
and is not correlated with this matrix.
Aside from providing a visualization of error, confusion matrices can be used to calculate
per class true positives (TP), false positives (FP), false negatives (FN), and true negatives
(TN). Table 11 highlights these values for the FTP class. Table 12 depicts the same table
but with XMPP as the class of focus in blue.
Numerous predictive analysis metrics can be derived from TP, FP, FN, and TN values.
Precision is a measure of how many packets are correctly classified to a given class compared
to the total predicted data (TP+FP) for that class and is calculated with Equation (2). The
metric recall, represented by Equation (3), determines how many packets were correctly
DRDC-RDDC-2022-R052 19
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Predicted
HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15
HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15
True Positives
FTP
False Positives
False Negatives
True Negatives
identified compared to the ground truth (TP+FN). Depending on the scenario, one or
the other may be of greater importance. For instance, in a cancer screening test, recall is
likely of greater significance because false positives are less detrimental to someone’s health
and treatment plan than false negatives. In the packet classification domain, the choice of
whether recall or precision is more important depends on the application. The F1 score,
Equation (4), is the harmonic mean of precision and recall. It weighs precision and recall as
equally important without knowing the primary objective of the classification task, and is
a frequently used metric in classification research. Table 13 lists the values of each metric
for the examples in Tables 11 and 12.
TP
precision = (2)
TP + FP
TP
recall = (3)
TP + FN
precision ∗ recall
F1 score = 2 ∗ (4)
precision + recall
20 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Predicted
HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15
True Positives
XMPP
False Positives
False Negatives
True Negatives
Average F1 0.783
Accuracy 0.897
Confusion Matrices are excellent tools for viewing the detailed performance of a single
trained model, but are less useful for comparing among several models. A single global
metric that coalesces the information from the confusion matrix, such as the F1 score, is
better suited to this purpose. Table 13 highlights the difficulty in using the total accuracy
metric for cases where a dataset contains imbalanced classes. The total accuracy is computed
as the number of true positives and true negatives divided by the total sample space
(i.e., TP + TN / (TP + TN + FP + FN)). The accuracy computed over all classes is high
due to the influence of the majority class VoIP. FTP and XMPP are in the minority and
their individual performances are not evident when using total accuracy. Averaging class F1
scores as an overall metric for the model accounts for deficiencies in minority class learning.
For these reasons, individual class and averaged F1 scores are used for the remainder of this
section as the baseline performance metric.
DRDC-RDDC-2022-R052 21
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 9 plots the performance of each tested window size with various strides, where the
batch size is 64, the RNN unit is GRU, and the RNN hidden layer model is 64-32. Note
that a 1-packet window essentially turns the RNN into MLP. No temporal data is made use
of and the model learns solely from the lone input feature vector.
Figure 9: Performance of various window sizes and strides using GRU as the RNN unit,
64-32 as the RNN hidden layer model, and 64 as the batch size.
Increasing the window size and in turn the temporal information available to the classifier
enhances performance. For each window size, we tested values of “stride” up to the maximum
allowable value, where the maximum could not exceed the window size. In general, we
observe that enlarging the stride reduces the average F1 score. There are two factors that
could contribute to this observation. As the stride is increased, the number of windows
available to train with proportionately decreases. Changing the stride from 1 to 5 for
instance will reduce the number of training samples to 20% of what was possible with
the stride set to 1. Decreasing the size of the training set hampers the model’s robustness.
The other possible contributing factor to the performance dip is that (especially for a large
stride equal to the window size) some traffic flows may only be passed to the network as
edge cases. Figures 10 and 11 highlight how a large and small stride affects edge cases
respectively, where the batch size is 64, the RNN unit is GRU, and the RNN hidden layer
model is 64-32.
A pair of XMPP packets are split between the two windows formed in Figure 10. Most
XMPP flows are two packets long consisting of a short message segment followed by an
22 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
XMPP XMPP
window 1
XMPP
window 2
stride
XMPP
acknowledgement packet. Without having a window with the flow intact, the RNN is unable
to observe this information, reducing its ability to correctly learn the behaviour of XMPP.
In Figure 11, however, two of three windows that contain part of the XMPP flow have
it in its entirety. There is more context for the RNN to pull from in this scenario. The
overlapping of windows caused by using a smaller stride enhances the network’s ability to
learn. Larger window sizes compound this effect by increasing the size of flows that can fit
within a window. Figure 12 compares the performance of window sizes with the stride set
to 1 which, for both of the reasons mentioned above, is the best stride for all window size.
The largest window sizes, as intuition would expect, resulted in the highest F1 scores. In
addition to the increased temporal information of a larger window, with a stride of one, each
packet (excluding the first and last window size packets) is passed to the network “window
size” times. So for the 200 window packet, the network will see the same packet 200 times,
which adds to training robustness. For all window sizes the network struggled most with
identifying HTTP. FTP and HTTP share similar traffic behaviour since both clients request
files from a server. The FTP files are larger and get split into more packets at the frame
level. HTTP objects still need to be split at the frame level but make up smaller flows.
Therefore both traffic types make large sequences of 1514-byte (the Maximum Transmission
Unit [MTU]) data packets and respective acknowledgements, eventually terminating with a
single packet of less than or equal to 1514 bytes. The primary distinction between the two
is that HTTP objects are typically smaller and grouped into several objects at a time. This
information will often elude the neural network because the flows can be significantly longer
than the largest window size tested. Since FTP files are larger, there are about 5 times as
many of them in the test set. It appears that as the dominant traffic type between the
two, the RNN leans toward classifying uncertain packets as FTP, possibly resulting in
the reduced HTTP performance. It is encouraging that despite their near identical traffic
patterns, the network can still distinguish between the two with some degree of fidelity.
DRDC-RDDC-2022-R052 23
CAN UNCLASSIFIED
CAN UNCLASSIFIED
XMPP XMPP
window 1
XMPP
window 2
stride
XMPP XMPP
window 3
XMPP XMPP
Of note is the training time for each window and stride configuration. While larger windows
are superior performance-wise, they come with the drawback of being slower to train. In
a tactical environment, if the network needs to be retrained with data recently collected,
training time is of particular importance. Figure 13 plots the “Average F1” score of each
tested configuration against its training time. Each point is annotated with its window size
and stride in that order.
The points closest to the upper left corner of the plot are the most efficient configurations.
The window size of 200 with strides of 25 and 50 achieve reasonable good performance
quickly. Strides of one increase performance but at the cost of training time. The best
configuration is dependent on the test environment; the trade-off between classification
performance and training speed is unavoidable.
24 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 12: Performance of various window sizes with the stride set to 1, the batch size
set to 64, the RNN unit set to GRU, the RNN hidden layer model set to 64-32.
The output prediction of the softmax layer can correspond to any of the timesteps,
depending upon how the model is trained. In the preprocessing phase, the hyperparameter
prediction position—unique to the many-to-one model—selects which packet in the window
will determine the label for the whole window. Setting it to 0 (for example) means that the
many-to-one model will be trained to predict the whole window with the same class as that
in the first packet of the window since the assumption of this model is that the input window
consists of only one packet type. Figure 15 plots the performance of the many-to-one setup
with various prediction positions along with the standard many-to-many model, where the
window size is 10, the batch size is 64, the stride number is 1, the RNN unit is GRU, and
the RNN hidden layer model is 64-32.
The average F1 scores are slightly better when predicting packets away from the edge of the
window. However, the distinction amongst prediction positions is not pronounced, except
for the case of prediction position 0 for XMPP. Consider a scenario where the last XMPP
packet of a flow is the first packet of a window. The only information the network sees at
the first packet position in the window is a packet that looks like an acknowledgement of
a TCP segment, possibly leading to the network misclassifying the packet as being part
of an FTP or HTTP flow that follows. Scenarios such as this are common at window
DRDC-RDDC-2022-R052 25
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 13: Training times of all tested window size and stride configurations with the
batch size set to 64, the RNN unit set to GRU, the RNN hidden layer model set to 64-32.
borders suggesting that prediction positions at least a few packets in from the window
border are preferable. The performance of the many-to-one configurations were comparable
to the many-to-many configuration with prediction position 2 surpassing the default model.
However, by employing an aggregation strategy on the test set as described in Section 5.5,
the many-to-many model remains superior.
26 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
DRDC-RDDC-2022-R052 27
CAN UNCLASSIFIED
CAN UNCLASSIFIED
28 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 16: Results from training model with single features plotted
alongside default dual features.
The interarrival times between packets seems to have less influence on the outcome than
the packet length. With interarrival time omitted the network learns nearly as well as if
they were included. When length is excluded, however, performance significantly degrades,
especially for the minority classes XMPP and Other. One possible explanation of the
difficulty of learning with just interarrival times is the wide range of interarrival times.
Values range from microseconds to seconds while for packet lengths the range is between
tens and thousands of bytes. Furthermore, values for both ranges do not lie uniformly
between the range’s maximum and minimum. Long sequences of packets sent back-to-back
are followed by long gaps. The distinction with packet lengths is that the few values that
lie in the middle of the range are more informative. A sequence of packets consisting of
DRDC-RDDC-2022-R052 29
CAN UNCLASSIFIED
CAN UNCLASSIFIED
two traffic types are not discernible using interarrival time since both are being sent as
fast as possible. With packet lengths, however, non-MTU length packets can indicate a
transition from one traffic type to another. The model itself may also not be configured
well for learning from interarrival times. Performance has been optimized for length and
hyperparameters could be tuned using just interarrival times to determine if there is a more
suitable configuration.
A final note is that while interarrival time seems to provide only minor additional benefit
to the learner, this is not to say that timing in general is not important. As evidenced by
the improvement when using longer windows, valuable learning comes from packet lengths
and their relative sequence with respect to each other. Looking at a single packet length in
isolation produces poor outcomes, but looking at a sequence of packet lengths over a long
window improves outcomes significantly.
Average performance along the window positions loosely shows a peak for positions near
the centre with the exception of the final position. Unlike the many-to-one configuration,
with the many-to-many model predictions are made without information gathered from
traversing the entire window in both directions. Instead activations from the final forward
and backwards layers at the current position are concatenated and then passed to the
softmax. In the many-to-one model, since only one prediction was made for the entire
window, the final concatenation occurred after both the forwards and backwards passes of
the RNN had been fully traversed. For some traffic types this gives an advantage to the inner
window positions. XMPP for example, as mentioned in Section 5.1, may be challenging to
predict when the flow is not padded with other traffic types within the boundary of the
window. Both XMPP and Other classes are best predicted by the middle-most window
positions with the exception of the final position. The last position’s performance may
break this trend because, if a short XMPP or Other flow—which are also usually limited
to a couple of packets—straddles a window border, the last position of the window is more
likely to contain the data packets of the flow while the first position of the next window
will contain the acknowledgements which are less informative.
30 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
FTP and HTTP performance improves with larger position throughout the window,
indicating that the full forward traversal of the previous activations final layer is important
in distinguishing these types. A flow of MTU sized packets terminating with a non-MTU
data packet may only be informative in the forward direction. In reverse, the important
non-MTU data packet cannot be associated with a traffic type without knowing the length
of the reverse flow. VoIP traffic is easily identified regardless of window position because it
is the majority traffic type and has a distinctly distributed uniform pattern.
Up until this point, all classification decisions have been treated as unique, with each one
updating the confusion matrix. While useful for calculating performance metrics, in practice
a single prediction would be required for each packet. It would be ambiguous to have, for
DRDC-RDDC-2022-R052 31
CAN UNCLASSIFIED
CAN UNCLASSIFIED
example, two FTP and two HTTP predictions for a single packet. One solution is to set the
stride of the test set equal to the window size regardless of the value used for the training
and validation sets. This method is fastest but reduces performance since the packet is only
seen in one position (which may not be the most optimal for its traffic type). If optimal
performance is desired, the repeated classifications can be aggregated into a single decision.
One method of combining these is to take the majority vote of the individual decisions.
But this does not take into account the confidence of each prediction. Summing the class
probabilities directly from the softmax and then taking the maximum of the sum remedies
this. Table 14 is an example scenario of a single packet prediction for demonstration purposes
that exemplifies the benefit of using the aggregate method.
Table 14: Probability aggregation benefit analysis.
Traffic Probability 1 Probability 2 Probability 3 Aggregate
FTP 0.35 0.8 0.37 1.52
HTTP 0.4 0.1 0.43 0.93
VoIP 0.05 0.1 0.05 0.2
XMPP 0.1 0.03 0.06 0.19
Other 0.1 0.07 0.03 0.2
The packet’s label is FTP, and in this scenario it has been predicted three times, each
time in a different window position. The orange cells are the maximum values in each
probability column. The default model behaviour is to treat each probability individually.
The packet would be falsely classified as HTTP twice and correctly as FTP once. Taking the
majority prediction from individual probabilities would also wrongly classify the packet as
HTTP. Using the aggregate method, however, the low confidence in the HTTP predictions
is accounted for and the overall dominant probability of FTP becomes apparent. Although
this exact scenario is artificial, FTP and HTTP are often confused by the RNN due to their
shared client-server behaviour and similar instances occur in the test set. Figure 18 plots
the test performance of using the default method, i.e. treating repeated packets as unique,
and with the aggregate method on a 10 packet window that was shifted with a stride of 1.
The aggregate method improves the performance across all traffic types. The improvement
is minimal with VoIP because, as the majority class, its performance is already near perfect
without the assistance of aggregation. When time and computational resources are available,
the aggregate method should be used to enhance the model’s performance.
32 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
Figure 18: Plot of F1 scores for aggregate and default test methods.
DRDC-RDDC-2022-R052 33
CAN UNCLASSIFIED
CAN UNCLASSIFIED
6 Conclusion
This Report demonstrated the feasibility of using recursive neural networks (RNNs) to
perform packet traffic classification by examining only the packet length and timing of
observed packets. RNNs were selected as a candidate for traffic classification due to their
promising performance in other learning tasks in which the sequential order of data must
be retained to influence the classification outcome. A dataset consisting of FTP, HTTP,
VoIP, and XMPP traffic was collected and used to train an RNN, optimize the model’s
hyperparameters, and then investigate its performance.
While specific numerical results are likely unique to the particular use case investigated,
certain results are more generalizable, specifically:
• The performance of the classifier increases when the RNN examines larger window
sizes. This comes at a cost of longer training times, and the performance increase
shows a pattern of diminishing returns as window size increases beyond a certain
point;
• Classifier performance increases for smaller stride lengths, with the best performance
achieved with a stride length of 1. Training time increase for lower stride lengths;
however, if the training time is acceptable for a stride of 1, then this is the
recommended choice;
• Interarrival time measurements have a lesser impact on classifier performance than
packet length;
• Classifier performance improvements can be obtained by combining the results
of packet classification decisions obtained by an RNN using the many-to-many
configuration. The combined-decision many-to-many configuration is recommended
over the many-to-one configuration; and
• The RNN classification performance on a given packet is influenced by the packet’s
position in the examined window; packets near the centre of the window tend to be
classified more accurately.
While this Report demonstrated the feasibility of RNN for a specific use case, more research
is required to gain further insight into the ultimate utility of this strategy. Research can
proceed in several ways, but testing the technique on other datasets should be prioritized.
The datasets generated by DTG in this Report were based on literature best practices,
but data from live networks of interest should also be explored. In addition, while FTP,
HTTP, VoIP, and XMPP are standard Internet traffic types, they may be of less interest
in other domains. For instance, if traffic analysis/classification were desired to be used
for tactical network applications, or smart buildings, or industrial control systems, all of
these would have different application data that would need to be modelled and evaluated.
This research has shown a promising way forward and Defence Research and Development
Canada (DRDC) expects to continue activity in this area.
34 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
References
[1] NSA (1987), The Origination and Evolution of Radio Traffic Analysis: The World
War I Era, Declassified Cryptologic Quarterly Articles.
[2] NSA (1989), The Origination and Evolution of Radio Traffic Analysis: World War II,
Declassified Cryptologic Quarterly Articles.
[3] Qadeer, M. A., Iqbal, A., Zahid, M., and Siddiqui, M. R. (2010), Network Traffic
Analysis and Intrusion Detection Using Packet Sniffer, In 2010 Second International
Conference on Communication Software and Networks.
[4] Butun, I., Morgera, S. D., and Sankar, R. (2014), A Survey of Intrusion Detection
Systems in Wireless Sensor Networks, IEEE Communications Surveys and Tutorials.
[6] Lu, Z., Wang, C., and Wei, M. (2015), On Detection and Concealment of Critical
Roles in Tactical Wireless Networks, In MILCOM 2015 - 2015 IEEE Military
Communications Conference.
[7] Liu, Y., Bild, D. R., Dick, R. P., Mao, Z. M., and Wallach, D. S. (2015), The Mason
Test: A Defense Against Sybil Attacks in Wireless Networks Without Trusted
Authorities, IEEE Transactions on Mobile Computing.
[8] Nguyen, T. T. and Armitage, G. (2008), A Survey of Techniques for Internet Traffic
Classification using Machine Learning, IEEE Communications Surveys and Tutorials.
[9] Koc, L., Mazzuchi, T. A., and Sarkani, S. (2012), A Network Intrusion Detection
System Based on a hidden Naïve Bayes Multiclass Classifier, Expert Systems with
Applications.
[10] Sarker, I. H., Abushark, Y. B., Alsolami, F., and Khan, A. I. (2020), IntruDTree: A
Machine Learning Based Cyber Security Intrusion Detection Model, Symmetry.
[11] Raikar, M. M., S M, M., Mulla, M. M., Shetti, N. S., and Karanandi, M. (2020),
Data Traffic Classification in Software Defined Networks (SDN) using
Supervised-Learning, Procedia Computer Science.
[12] Vishwakarma, S., Sharma, V., and Tiwari, A. (2017), An Intrusion Detection System
using KNN-ACO Algorithm, International Journal of Computer Applications.
[13] Guo, Z., Shi, D., Quevedo, D. E., and Shi, L. (2019), Secure State Estimation
Against Integrity Attacks: A Gaussian Mixture Model Approach, IEEE Transactions
on Signal Processing.
DRDC-RDDC-2022-R052 35
CAN UNCLASSIFIED
CAN UNCLASSIFIED
36 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
[27] Sola, J. and Sevilla, J. (1997), Importance of Input Data Normalization for the
Application of Neural Networks to Complex Industrial Problems, IEEE Transactions
on Nuclear Science.
[29] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017),
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,
In ICLR.
DRDC-RDDC-2022-R052 37
CAN UNCLASSIFIED
CAN UNCLASSIFIED
List of Abbreviations/Acronyms/Initialisms/Symbols
AI artificial intelligence
ARP address resolution protocol
BPTT Backpropagation Through Time
CAF Canadian Armed Forces
CNN convolutional neural networks
DoS denial-of-service
DRDC Defence Research and Development Canada
DTG DRDC Trafic Generator
ESP early stopping patience
FN false negatives
FP false positives
FTP file transfer protocol
GRU gated recurrent units
HTTP hypertext transfer protocol
IoT Internet of Things
IP Internet Protocol
k-NN k-nearest neighbours
LSTM long short-term memory
MAC media access control
MDNS multicast domain name system
MLP multilayer perceptrons
MTU Maximum Transmission Unit
NN neural networks
NTP network time protocol
RNN recurrent neural network
R2L remote to local
SDN software defined network
SSDP simple service discovery protocol
SVM support vector machine
TN true negatives
TP true positive
U2R user to root
XMPP extensible messaging and presence protocol
38 DRDC-RDDC-2022-R052
CAN UNCLASSIFIED
CAN UNCLASSIFIED
1. ORIGINATOR (The name and address of the organization preparing 2a. SECURITY MARKING (Overall security marking of
the document. A DRDC Centre sponsoring a contractor’s report, or a the document, including supplemental markings if
tasking agency, is entered in Section 8.) applicable.)
NON-CONTROLLED GOODS
DMC A
3. TITLE (The document title and subtitle as indicated on the title page.)
4. AUTHORS (Last name, followed by initials – ranks, titles, etc. not to be used. Use semi-colon as delimiter.)
5. DATE OF PUBLICATION (Month and year of publication of 6a. NO. OF PAGES (Total 6b. NO. OF REFS (Total
document.) pages, including Annexes, cited in document.)
excluding DCD, covering
and verso pages.)
March 2022 45 29
Scientific Report
8. SPONSORING CENTRE (The name and address of the department project or laboratory sponsoring the research and
development.)
9a. PROJECT OR GRANT NO. (If appropriate, the applicable 9b. CONTRACT NO. (If appropriate, the applicable contract
research and development project or grant number under number under which the document was written.)
which the document was written. Please specify whether
project or grant.)
10a. DRDC PUBLICATION NUMBER 10b. OTHER DOCUMENT NO(s). (Any other numbers which may
be assigned to this document either by the originator or by
DRDC-RDDC-2022-R052 the sponsor.)
11a. FUTURE DISTRIBUTION WITHIN CANADA (Approval for further dissemination of the document. Security classification must also
be considered.)
Public release
11b. FUTURE DISTRIBUTION OUTSIDE CANADA (Approval for further dissemination of the document. Security classification must also
be considered.)
Public release
CAN UNCLASSIFIED
CAN UNCLASSIFIED
13. ABSTRACT/RÉSUMÉ (When available in the document, the French version of the abstract must be included here.)
This Scientific Report investigates the efficacy of machine learning techniques on a passive
eavesdropper’s ability to classify encrypted traffic in wireless networks, where the eavesdropper
can observe only the timing, sequence, and duration of packets. Datasets were generated
using a custom-built traffic generation tool to generate encrypted traffic sets consisting of
four application types: file transfer protocol (FTP), hypertext transfer protocol (HTTP), voice
over Internet protocol (VoIP), and extensible messaging and presence protocol (XMPP). The
eavesdropper gathered encrypted packets and used the recurrent neural network (RNN)
machine learning technique to classify the applications of packets in the dataset on a
packet-by-packet basis. In evaluating the efficacy of RNNs for traffic classification, the effect
of a number of parameters on accuracy and training time were examined, including: the window
size of data examined by the RNN; the effect of partial knowledge; different configurations of
RNNs such as many-to-many or many-to-one; and how best to combine the outputs of an RNN
to make a classification prediction.
Le présent rapport scientifique examine l’efficacité des techniques d’apprentissage machine sur
la capacité d’un intercepteur passif à classer le trafic chiffré sur des réseaux sans fil en observant
uniquement le moment, la séquence et la durée de transmission des paquets. Pour ce faire, on
a utilisé un outil générateur de trafic personnalisé afin de créer des ensembles de données
chiffrés transmis au moyen de quatre types de protocoles, soit le protocole de transfert de
fichiers (FTP), le protocole de transfert hypertexte (HTTP), la voix sur protocole Internet (VoIP)
et le protocole XMMP (Extensible Messaging and Presence Protocol). L’intercepteur a capté
des paquets chiffrés et s’est servi d’un réseau de neurones récurrentes (RNR) – une technique
d’apprentissage machine – pour classer un à un les paquets de l’ensemble de données selon
leur type d’application. Dans le but d’évaluer l’efficacité des RNR quant au classement du
trafic, on a étudié l’incidence de divers paramètres sur l’exactitude et le temps d’apprentissage,
notamment l’étendue des données analysées par les RNR, l’effet de connaissances partielles,
les différentes configurations de RNR (plusieurs-à-plusieurs ou plusieurs-à-un) et la meilleure
façon de combiner les extrants d’un RNR pour prédire un classement.
CAN UNCLASSIFIED