0% found this document useful (0 votes)

12 views

Application Traffic Classification Using Neural Networks

Application Traffic Classification using Neural Networks

Uploaded by

wosoti7636

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Application Traffic Classification Using Neural Networks

Application Traffic Classification using Neural Networks

Uploaded by

wosoti7636

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Defence Research and Recherche et développement

Development Canada pour la défense Canada

CAN UNCLASSIFIED

Application Traffic Classification using Neural

Networks
An Analysis of Traffic Classification in Adverse Conditions
William Aitken
Queen’s University
David Brown
Ronggong Song
DRDC – Ottawa Research Centre

Terms of release: This document is approved for public release.

Defence Research and Development Canada

Scientific Report
DRDC-RDDC-2022-R052
March 2022

CAN UNCLASSIFIED
CAN UNCLASSIFIED

IMPORTANT INFORMATIVE STATEMENTS

This document was reviewed for Controlled Goods by Defence Research and Development Canada (DRDC) using the Schedule
to the Defence Production Act.

Disclaimer: This publication was prepared by Defence Research and Development Canada, an agency of the Department of
National Defence. The information contained in this publication has been derived and determined through best practice and
adherence to the highest standards of responsible conduct of scientific research. This information is intended for the use of the
Department of National Defence, the Canadian Armed Forces (“Canada”) and Public Safety partners and, as permitted, may be
shared with academia, industry, Canada’s allies, and the public (“Third Parties”). Any use by, or any reliance on or decisions
made based on this publication by Third Parties, are done at their own risk and responsibility. Canada does not assume any
liability for any damages or losses which may arise from any use of, or reliance on, the publication.

Endorsement statement: This publication has been peer-reviewed and published by the Editorial Office of Defence Research and
Development Canada, an agency of the Department of National Defence of Canada. Inquiries can be sent to:
[email protected].

⃝
c Her Majesty the Queen in Right of Canada, Department of National Defence, 2022
⃝
c Sa Majesté la Reine du chef du Canada, ministère de la Défense nationale, 2022

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Abstract
This Scientific Report investigates the efficacy of machine learning techniques on a
passive eavesdropper’s ability to classify encrypted traffic in wireless networks, where the
eavesdropper can observe only the timing, sequence, and duration of packets. Datasets
were generated using a custom-built traffic generation tool to generate encrypted traffic
sets consisting of four application types: file transfer protocol (FTP), hypertext transfer
protocol (HTTP), voice over Internet protocol (VoIP), and extensible messaging and
presence protocol (XMPP). The eavesdropper gathered encrypted packets and used the
recurrent neural network (RNN) machine learning technique to classify the applications of
packets in the dataset on a packet-by-packet basis. In evaluating the efficacy of RNNs for
traffic classification, the effect of a number of parameters on accuracy and training time
were examined, including: the window size of data examined by the RNN; the effect of
partial knowledge; different configurations of RNNs such as many-to-many or many-to-one;
and how best to combine the outputs of an RNN to make a classification prediction.

Significance for Defence and Security

Traffic analysis plays an important role in signal intelligence and cyber operations. The
results from traffic analysis can be used to support decision making on the battlefield such
as target selection for cyber effects. Traffic analysis generally does not seek to expose the
contents of the message, but on learning data such as the topology of a network, the roles
of users, and the type of communication taking place. Understanding the threat posed
by traffic analysis is especially germane to wireless networks, where eavesdropping can be
performed by a remote receiver without the need for network taps or invasive physical
connections.

This work is important to the defence and security community, as it studies how much a
passive eavesdropper can learn about network activities by using deep learning techniques
without breaking network encryption.

DRDC-RDDC-2022-R052 i

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Résumé
Le présent rapport scientifique examine l’efficacité des techniques d’apprentissage machine
sur la capacité d’un intercepteur passif à classer le trafic chiffré sur des réseaux sans fil en
observant uniquement le moment, la séquence et la durée de transmission des paquets.
Pour ce faire, on a utilisé un outil générateur de trafic personnalisé afin de créer des
ensembles de données chiffrés transmis au moyen de quatre types de protocoles, soit le
protocole de transfert de fichiers (FTP), le protocole de transfert hypertexte (HTTP),
la voix sur protocole Internet (VoIP) et le protocole XMMP (Extensible Messaging and
Presence Protocol). L’intercepteur a capté des paquets chiffrés et s’est servi d’un réseau
de neurones récurrentes (RNR) – une technique d’apprentissage machine – pour classer
un à un les paquets de l’ensemble de données selon leur type d’application. Dans le but
d’évaluer l’efficacité des RNR quant au classement du trafic, on a étudié l’incidence de divers
paramètres sur l’exactitude et le temps d’apprentissage, notamment l’étendue des données
analysées par les RNR, l’effet de connaissances partielles, les différentes configurations de
RNR (plusieurs-à-plusieurs ou plusieurs-à-un) et la meilleure façon de combiner les extrants
d’un RNR pour prédire un classement.

Importance pour la défense et la sécurité

L’analyse du trafic réseau joue un rôle important au sein du renseignement d’origine
électromagnétique et des cyberopérations, car ce qui en découle peut servir à appuyer
la prise de décisions sur le champ de bataille, comme le ciblage aux fins de cybereffets.
En règle générale, l’analyse du trafic ne vise pas à découvrir le contenu du message, mais
plutôt à obtenir des données sur la topologie d’un réseau, les rôles des utilisateurs et le type
de communication en cours. Or, la compréhension de la menace que représente l’analyse
du trafic s’avère particulièrement pertinente dans le cas des réseaux sans fil, que l’on peut
écouter au moyen d’un récepteur à distance sans avoir besoin de TAP (test access point)
réseau ni de connexions physiques intrusives.

La présente recherche est importante pour la communauté de la défense et de la sécurité,

car elle permet de vérifier ce qu’un intercepteur passif peut apprendre sur les activités d’un
réseau en utilisant des techniques d’apprentissage machine sans déchiffrement du trafic.

ii DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Significance for Defence and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Résumé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Importance pour la défense et la sécurité . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Traffic Analysis and Classification . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Machine Learning in Traffic Classification . . . . . . . . . . . . . . . . . . 3

2.3 Neural Networks in Traffic Classification . . . . . . . . . . . . . . . . . . . 4

3 Background on Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 5

4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Network Architecture and Hyperparameter Tuning . . . . . . . . . . . . . 11

4.3.1 Layers and Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3.2 RNN Unit Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.3 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.4 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

DRDC-RDDC-2022-R052 iii

CAN UNCLASSIFIED
CAN UNCLASSIFIED

5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 Effects of Window Size and Stride . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 Many-to-Many vs. Many-to-One . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Single Parameter Input Features . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4 F1 Score per Packet Location . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.5 Utilizing Repeated Packet Classifications in the Many-to-Many

Architecture to Improve Test Results . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

List of Abbreviations/Acronyms/Initialisms/Symbols . . . . . . . . . . . . . . . . . . 38

iv DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

List of Figures
Figure 1: Rolled RNN diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2: Unrolled many-to-many RNN diagram. . . . . . . . . . . . . . . . . . . . 6

Figure 3: Unrolled many-to-one RNN diagram. . . . . . . . . . . . . . . . . . . . . 6

Figure 4: Unrolled bidirectional many-to-many RNN diagram. . . . . . . . . . . . 7

Figure 5: Visual of stride and window size parameters. . . . . . . . . . . . . . . . 10

Figure 6: Hidden layers and nodes in a bidirectional RNN. . . . . . . . . . . . . . 12

Figure 7: F1 score learning curve for various early stopping patiences using GRU
as the RNN unit, 64-32 as the RNN hidden layer model, 10 as the
window size, 64 as the batch size, and 1 as the stride number. . . . . . . 16

Figure 8: DTG many-to-many RNN classifier. . . . . . . . . . . . . . . . . . . . . 18

Figure 9: Performance of various window sizes and strides using GRU as the RNN
unit, 64-32 as the RNN hidden layer model, and 64 as the batch size. . . 22

Figure 10: XMPP edge case with maximum stride. . . . . . . . . . . . . . . . . . . 23

Figure 11: XMPP edge case with smaller stride. . . . . . . . . . . . . . . . . . . . . 24

Figure 12: Performance of various window sizes with the stride set to 1, the batch
size set to 64, the RNN unit set to GRU, the RNN hidden layer model
set to 64-32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 13: Training times of all tested window size and stride configurations with
the batch size set to 64, the RNN unit set to GRU, the RNN hidden
layer model set to 64-32. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 14: DTG many-to-one RNN classifier. . . . . . . . . . . . . . . . . . . . . . . 27

Figure 15: Comparison of many-to-one packet prediction locations with default

many-to-many model with the window size set to 10, the batch size set
to 64, the stride set to 1, the RNN unit set to GRU, and the RNN
hidden layer model set to 64-32. . . . . . . . . . . . . . . . . . . . . . . . 28

Figure 16: Results from training model with single features plotted alongside
default dual features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 17: F1 scores of each position in many-to-many model. . . . . . . . . . . . . 31

DRDC-RDDC-2022-R052 v

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 18: Plot of F1 scores for aggregate and default test methods. . . . . . . . . . 33

vi DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

List of Tables
Table 1: Train set traffic composition. . . . . . . . . . . . . . . . . . . . . . . . . 9

Table 2: Validation set traffic composition. . . . . . . . . . . . . . . . . . . . . . . 9

Table 3: Test set traffic composition. . . . . . . . . . . . . . . . . . . . . . . . . . 9

Table 4: Training time and F1 score under various layer and node combinations
using GRU as the RNN unit, 10 as the window size, 64 as the batch
size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . . . . 13

Table 5: Training time and F1 score of different RNN unit types under 10 and
200 window sizes using 64-32 as the RNN hidden layer model, 64 as the
batch size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . 14

Table 6: Training time and F1 score of various batch sizes under 10 and 200
window sizes using 64-32 as the RNN hidden layer model, 64 as the
batch size, and 1 as the stride number. . . . . . . . . . . . . . . . . . . . 15

Table 7: Summary of hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . 17

Table 8: Zeroed confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 9: Incremented confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 10: Example complete confusion matrix. . . . . . . . . . . . . . . . . . . . . 20

Table 11: FTP focused confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . 20

Table 12: XMPP focused confusion matrix. . . . . . . . . . . . . . . . . . . . . . . 21

Table 13: Sample metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Table 14: Probability aggregation benefit analysis. . . . . . . . . . . . . . . . . . . 32

DRDC-RDDC-2022-R052 vii

CAN UNCLASSIFIED
CAN UNCLASSIFIED

1 Introduction
Traffic analysis encompasses a broad variety of techniques to infer information about
encrypted communication activity by analyzing patterns and characteristics of transmitted
signals, including observable elements such as transmission rate, frequency, timing, and
apparent source/destination. Unlike techniques focused on breaking the encryption of
transmission, traffic analysis generally does not seek to expose the contents of the message
but instead concentrates on deducing such things as the topology of a communication
network, the roles of various users or nodes, and the type of communication taking place.
The ultimate limits of performance of traffic analysis techniques are of interest to network
defenders, who must understand the level to which adversaries can gather useful data
through passive eavesdropping. The traffic analysis threat is especially germane to wireless
networks, where eavesdropping can be performed by a remote receiver without the need
for network taps or invasive physical connections. Understanding these threats is relevant
to securing the Canadian Armed Forces (CAF) use of radios for tactical operations, use
of wireless communications in fixed infrastructure settings, and the potential future use
of wireless Internet of Things (IoT) sensors in infrastructure and real-property settings.
This Scientific Report focuses on traffic classification—a subset of traffic analysis—where
the goal is to determine the type of application layer traffic being transmitted over a
network. Specifically of interest is the efficacy of machine learning techniques to classify
traffic between two nodes in a network on a packet-by-packet basis. Note that the traffic
data used in this Report are collected from network layer and only network layer metadata
such as packet length and packet arrival time are used for performance evaluation.

Traffic classification is itself an old problem, and is performed for a variety of reasons
under different circumstances. Most commonly, traffic classification is performed by
telecommunications and network operators to classify flows internal to their network to
better manage quality of service and prioritize resources depending on application type and
service requirements. In the case of a service provider examining data packets in its own
network, some data may be available in the clear with no lower-layer encryption, including
elements such as packet headers, source/destination Internet Protocol (IP) addresses, and
ports. While service providers examine traffic traversing their own networks, an adversary
tries to perform traffic classification in a network they do not control. In these cases, traffic is
likely encrypted, although quite often certain headers remain visible (e.g., depending upon
the level the encryption is applied, media access control (MAC) or even IP headers may
be visible). Many studies of traffic classification assume that traffic flows can be uniquely
identified by examining metadata, which greatly simplifies the problem. In most military
wireless applications, however, it is reasonable to expect a network to employ link-layer
encryption thus obfuscating all header information. The assumption made in this Report
is that an adversary has access to no header information at all, and can observe only the
timing, order, and duration of packets. Of interest is determining how much an adversary can
infer about applications in use in a network when only this minimum amount of information
is available. Put another way: even if the defender takes precautions to ensure all data and
signalling is encrypted, how much can a passive eavesdropper learn about network activities
without actually compromising the network?

DRDC-RDDC-2022-R052 1

CAN UNCLASSIFIED
CAN UNCLASSIFIED

To study this problem, an in-depth analysis was undertaken using recurrent neural networks
(RNN) to classify traffic on a packet-by-packet basis according to application type.

RNNs are a type of neural network in which previous outputs of the neural network can be
used as inputs to subsequent states; effectively this structure means that RNNs can retain
memory of learning generated from recent inputs and incorporate this into the classification
model. The “internal memory” of RNNs make them a popular machine learning algorithm
for sequence data (including things like identifying speech, audio, and time series). They are
thus a promising candidate for application traffic classification, where timing and ordering
of packets from different applications may each have distinct signatures dictated by their
respective protocols. In evaluating RNNs for traffic classification, the effect of a number
of parameters on accuracy and training time are examined, including: the window size of
data examined by the RNN (i.e., the amount of history examined); the effect of partial
knowledge (i.e., only having packet length, or only having packet timing but not both);
different configurations of RNNs such as many-to-many or many-to-one; and how best to
combine the outputs of an RNN to make a classification prediction.

The remainder of this Report is organized as follows. In Section 2 a brief literature review
of background research and related work in the field of traffic analysis and classification
is presented. Section 3 provides an overview of RNNs, explaining their structure and
parameters of interest. The methodology for the experimental work of this paper is described
in Section 4; this includes a discussion of how network traffic was generated for datasets,
the training and testing process, and the experiments used to evaluate traffic classification
accuracy. Results and discussion are presented in Section 5, with a brief conclusion in
Section 6 summarizing the work.

2 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

2 Related Works
This section details various research efforts in the domain of traffic analysis and classification
specifically pertaining to machine learning methods. This Report aims to supplement
previous work by applying similar techniques at the frame level and with minimal input
features, which to the best of our knowledge has yet to have been explored by the research
community.

2.1 Traffic Analysis and Classification

Traffic analysis has been used in the military since the First World War when the telegraph
was used in communication [1]. It played a significant role in signals intelligence when
wireless communication became popular in the military. In World War II, traffic analysis
provided valuable intelligence and accurate recognition of the fingerprints of devices and
operators. For example, in 1941 the British identified and confirmed that a German Air
Force unit contained nine planes by intercepting and reconstructing the network structure
of the German Air Force radio [2].

Coupled with artificial intelligence (AI) technologies, current traffic analysis is used to
support both offence and defence in the cyber security domain. For instance, traffic
analysis has been widely used in automated intrusion detection systems, reconstruction of
adversarial tactical networks, target selection, and smart jamming [3–7]. With adversaries
also employing advanced network security protection, it becomes more challenging to extract
packet contents by directly inspecting intercepted traffic. The importance of traffic analysis
and classification grows and quickly evolves as adversarial capabilities continually advance
and prior analysis methods become obsolete.

2.2 Machine Learning in Traffic Classification

Machine learning is a powerful technique for automating computing tasks without explicit
human instruction. Supervised and unsupervised learning are the most commonly used
machine learning techniques in traffic classification [8]. Supervised learning consists of
two major phases: training and testing. In the training phase, a classification model is
constructed as a function which maps input features from a labelled training dataset to
output classes. Importantly, the classes are predefined in the dataset and known to the
function. In the testing phase, new instances from a unique dataset are passed to the
function, which will output a prediction class. The predictions are then compared with the
predefined labels to evaluate the performance of the classifier.

Research has been undertaken to apply supervised learning to traffic classification and
attack detection. Levent Koc et al. proposed a network intrusion detection system based on
a hidden naive Bayes classifier and applied it to the KDD’99 dataset for detecting probe,
denial-of-service (DoS), user to root (U2R), and remote to local (R2L) attacks [9]. Sarker

DRDC-RDDC-2022-R052 3

CAN UNCLASSIFIED
CAN UNCLASSIFIED

et al. presented an intrusion detection tree for malicious behaviour and anomaly detection
[10]. Raiker et al. applied support vector machine (SVM), nearest centroid, and Gaussian
naive Bayes learning techniques to the traffic from a software defined network (SDN) [11].
Vishwakarma et al. presented a k-nearest neighbours (k-NN) for intrusion detection and
similarly to Levent Koc et al. tested it on the KDD’99 dataset for probe, DoS, U2R, and
R2L attack detection [12].

Unsupervised learning discovers patterns and structures and formulates models from
unlabelled datasets. This allows for the detection of unknown traffic types which is
useful in anomaly detection. Clustering-based and association rule-based techniques are
often applied to the classification and attack detection field. Clustering-based algorithms
include distribution-based (e.g., Gaussian mixture model), density-based (e.g., DBSCAN),
centroid-based (e.g., K-means), and hierarchical/connectivity-based [13–15]. Frequent
pattern growth and apriori algorithms fall under the association rule-based category [16].

2.3 Neural Networks in Traffic Classification

The rising popularity of neural networks (NN) has seen them adopted in numerous fields;
traffic analysis and classification are no exception. Standard multilayer perceptrons (MLP),
recurrent neural networks (RNN), and convolutional neural networks (CNN) have all been
applied to the domain. For instance, Li et al. presented a hash transformation deep MLP to
detect and classify adversarial malware [17]. A wireless network intrusion detection system
was based off an improved CNN by Yang and Wang [18]. It was tested on the NSL-KDD
CUP dataset for probe, DoS, U2R, and R2L attack detection. Kim et al. proposed a
language modelling technique using a variation of an RNN for designing host-based intrusion
detection systems. These examples all employed supervised methods of learning. While
various configurations of neural networks such as MLP, CNN and RNN have been examined
in a number of network traffic classification tasks, in this work we specifically examine RNN
due to its well-documented performance on time-series data.

Several methods were explored in the unsupervised domain as well. Zhang et al. presented an
intrusion detection system for IoT traffic based off a deep belief network [19]. A generative
adversarial deep neural network was implemented by Erpek et al. to launch and mitigate
wireless jamming attacks [20]. Aleroud and Karabatis proposed a similar generative network
for phishing detection [21]. Farahnakian and Keikkon and Javaid et al. both leveraged
auto-encoder derivatives to perform intrusion detection [22, 23].

4 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

3 Background on Recurrent Neural Networks

Neural Networks are powerful and convenient tools for classifying data. They can learn
complex algorithms to describe data that would otherwise require an expert to hand-pick
important relationships among input features. The temporal nature of network traces make
them ideal inputs for an RNN. RNNs extend the more commonly known feed-forward MLP
to handle temporal sequences of variable length. This is achieved by connecting the output
of each neuron to its input. The rolled diagram of an RNN is depicted in Figure 1.

Figure 1: Rolled RNN diagram.

In each timestep, t, an RNN unit, A, is passed a feature vector, x<t> and the activation
from the previous timestep, a<t−1> . Both are multiplied by respective weight matrices,
offset by a bias vector and passed through an activation function. The result is the current
timestep’s activation, a<t> , which is fed back into the RNN for the next timestep along
with the following feature vector x<t+1> . In addition to being passed to the next timestep,
the activation is typically passed through a final non-recurrent layer, resulting in an output
y <t> . To illustrate the temporal nature of an RNN, Figure 2 shows an RNN in its unrolled
form.

Note that although this representation appears to consist of several layers, each layer is
in fact the same RNN unit, A, with the same parameters but at a different position
in time. The most common RNN training algorithm is Backpropagation Through Time
(BPTT). Derivatives of the error with respect to the network’s weights are calculated and
accumulated, starting from the last timestep and working, “backwards through time” until
the first timestep is reached. At this point the weights are adjusted using the accumulated
derivative to minimize the error. As with training a feed-forward MLP, this process is
repeated for every training input and often multiple times over for the entire training set
until the error stabilizes at a minimum.

DRDC-RDDC-2022-R052 5

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 2: Unrolled many-to-many RNN diagram.

The RNN configuration shown in Figure 2 is many-to-many where for each timestep, t, there
is an input feature vector, x<t> , and a corresponding output vector, y <t> . In the domain
of packet classification, each x<t> corresponds to a single packet and its features could,
for example, consist of length, interarrival time, important flags, and/or other pertinent
header or payload information. The output, y <t> , is a vector of probabilities describing the
likelihood that x<t> belongs to each of the specified traffic classes.

Other RNN configurations are also well-suited for packet-by-packet traffic classification. A
many-to-one model takes the same input features as the many-to-many model but only
predicts a single label for the entire series. The label could correspond to a prediction for a
packet at a specific position, the majority traffic type in the entire input series, or any other
determining factor chosen by the implementer. Such a model is portrayed in Figure 3.

Figure 3: Unrolled many-to-one RNN diagram.

6 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

In this example it would be natural to have the neural network output a predicted label for
the last packet. Information from all previous packets in the series is propagated through
the network to help make the prediction. Even if the goal is to classify the first packet in
the sequence, the prediction will not be made until the entire sequence has been traversed.

In the many-to-many configuration, however, since a prediction is made at every timestep,

the predictions at the beginning of the sequence are made without any knowledge of the
packets that come after it. A Bidirectional RNN, shown in Figure 4 accommodates for this
shortcoming by adding an RNN layer that progresses through the sequence in reverse, with
A′ representing the RNN units in the reverse path. Now, regardless of which timestep is
being computed, information from both ends of the sequence will be present. The activation
functions from forwards and backwards units are combined by multiplying them by their
respective weight matrices and summing with the bias in the output layer with softmax
activation function.

Figure 4: Unrolled bidirectional many-to-many RNN diagram.

The various network configurations described thus far can be extended by stacking multiple
RNN layers. The added depth allows for more complex features to be learned; however, as
with MLPs, the deeper the network, the more challenging it is to converge on a global error
minimum. RNNs as a whole are more prone to the problem of vanishing and exploding
gradients. This problem often arises when RNNs attempt to propagate critical information
between inputs separated by a large number of timesteps. A solution to this problem is
provided by long short-term memory (LSTM) units and gated recurrent units (GRU), which
can be used as alternatives to the default RNN; both LSTM and GRU make it easier to learn
long-range dependencies [24,25]. They achieve this by adding network gates that determine
the degree to which the current activation impacts the propagation of previous activations.
Due to this mechanism, LSTMs and GRUs can “remember” important state characteristics
from many timesteps ago.

DRDC-RDDC-2022-R052 7

CAN UNCLASSIFIED
CAN UNCLASSIFIED

4 Methodology
The process of developing an RNN classifier involves obtaining a dataset, performing
preprocessing steps, then training, testing, and tuning the RNN. To develop an RNN
classifier to perform traffic analysis, data was collected using the Defence Research and
Development Canada Traffic Generator (DTG) [26]; the data collected for both testing
and training the RNN is discussed in Section 4.1. Section 4.2 discusses the preprocessing
operations that are performed on the raw dataset before it can be fed into the RNN. An
RNN architecture that suits the format of the data is examined in Section 4.3 as well as
the process of selecting hyperparameters to optimize performance.

4.1 Dataset
Gathering a representative dataset is imperative to successfully training a neural network.
The dataset is typically grouped into training, validation, and test sets. Each could be
collected separately or be a subset of a single dataset. Regardless, the stochastic nature
of the data should be consistent across all three sets so that each is representative of the
problem space.

For this research, the purpose-built DTG software tool was chosen to generate labelled
traffic, due to its ability to reliably produce datasets composed of user-controlled random
distributions of application-layer traffic. The four supported traffic types in DTG are FTP,
HTTP, VoIP, and XMPP. Each are generated according to standardized and accepted
stochastic models. See [26] for more details regarding the models and architecture of the
traffic generator.

To generate the network data for traffic classification analysis, two hosts were directly
networked together. The DTG server software was installed on one host, and the DTG
client software was installed on the other. A third host was used to monitor the link between
the DTG client and server using the Wireshark traffic sniffing application and to save the
data as a pcap (packet capture) file. Running the DTG application on the client and server
allowed for the dynamic creation of traffic files, with statistics as described below.

DTG was used to generate three separate traces for the training, validation, and test sets.
Three separate sets were generated (as opposed to sampling from a single set) in order
to ensure that all sets contained the various protocols’ start-up and shut-down packet
exchanges. Shuffling a single set before dividing it into subsets would also have ensured an
equal distribution of start and end sequences. However, in most traffic classification use
cases, data would likely be classified in real time and therefore, the original order should
be preserved in the test set to replicate this scenario. Tables 2 and 3 breakdown the traffic
compositions of the training, validation, and test sets respectively. The traffic type “Other”
refers to packets captured on the link (between client and server) that were not generated by
one of the main DTG traffic types (FTP, HTTP, VoIP, XMPP). Other traffic was limited, in
general, to network protocols such as simple service discovery protocol (SSDP), multicast

8 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

domain name system (MDNS), and address resolution protocol (ARP). A network time
protocol (NTP) service was run in the background to synchronize the hosts and its packets
also fall under the “Other” class.
Table 1: Train set traffic composition.
Traffic Type FTP HTTP VoIP XMPP Other Total
Num Packets 335,636 58,517 981,181 1,811 3,612 1,380,757
% of Set 24.31 4.24 71.06 0.13 0.26 100

Table 2: Validation set traffic composition.

Traffic Type FTP HTTP VoIP XMPP Other Total
Num Packets 61,624 14,042 192,299 337 787 269,089
% of Set 22.90 5.22 71.46 0.13 0.29 100

Table 3: Test set traffic composition.

Traffic Type FTP HTTP VoIP XMPP Other Total
Num Packets 80,165 14,410 196,780 400 725 292,480
% of Set 27.41 4.93 67.28 0.14 0.25 100

Note that the relative percentages of the sets devoted to each traffic type is consistent
across all three sets. The DTG generates all traffic in parallel so packets corresponding
to each protocol are spread evenly within the datasets. Another notable attribute of the
datasets are the class imbalances. There are significantly more VoIP and FTP packets than
any other traffic type and significantly fewer XMPP and Other packets. This is a result
of using the default parameters of the DTG that are intended to mimic real traffic. VoIP
conversations, for example, naturally result in long streams of voice data packets. Changing
the distributions to balance the classes would undermine the integrity of the representative
nature of the datasets and was therefore avoided. It is well known that class imbalances add
to the difficulty of training a classifier. Machine learners often prioritize majority classes
since performing well on them may have the most substantial impact on minimizing the
loss function.1 This can lead to minority classes being overlooked or altogether ignored. If
necessary, a number of techniques can be used to increase the importance of the minority
classes. Such methods include increasing the weight of minority classes in the loss function
and choosing a performance metric well-suited for class imbalances. In the case of the DTG
classification problem, as can be seen in Section 5, the learners had adequate performance
without having to additionally compensate for the class imbalance.

4.2 Preprocessing
Before the datasets can be passed to an RNN they must be converted from raw traces into
groups of input features and labels. Each packet is labelled according to its destination or
1
The loss function is the function that computes the distance between the current output of the algorithm
and the expected output. The loss function used in this Report is softmax().

DRDC-RDDC-2022-R052 9

CAN UNCLASSIFIED
CAN UNCLASSIFIED

source port number, which are unique to the DTG’s supported traffic types. Any packets
with port numbers that fall outside of the DTG range are labelled as “Other.” Note that
the port number is only used to establish the ground truth and is concealed from the RNN
beyond this step. From Wireshark traces, each packet’s length in bytes and the difference
of arrival time from the previous packet, in seconds, are saved as input features. Thus,
the features for each packet consists of an input tuple of interarrival time2 and length as
well as a class label establishing the ground truth. Since inter-arrival times are often in the
microsecond range, normalization was performed on inputs as a means of reducing training
time [27].

Feeding an entire pcap file into an RNN is not feasible due to packet capture’s large size.
Instead, the datasets are split into windows that are passed to the RNN sequentially.
Two parameters determine how a dataset is sliced into windows. The “window size” is
the number of packets to include in each window and the “stride” indicates by how many
packets the window should be shifted before another set of packets is taken. Figure 5 shows
a subset of a dataset (pcap), where each rectangle represents a single packet, and depicts
how the two parameters influence window selection. Note that the data is processed using
the windows and strides for training and testing of the RNN.

The result is a set of windows, each of length “window size” which will be individually fed
into the RNN. The effect of various window sizes and strides is the topic of Section 5.1.

Figure 5: Visual of stride and window size parameters.

2
The packet interarrival time used in this Report is the time difference between sequential packets received
from Wireshark.

10 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

4.3 Network Architecture and Hyperparameter Tuning

Hyperparameters are chosen before the RNN model is trained; they are not learned by the
neural network. Relevant hyperparameters that must be selected include the number of
layers and nodes in the RNN, number of training epochs, and batch size [28]. Tuning an
RNN is the process of iterating through different combinations of hyperparameters to find
optimum values. Tuning is in itself a challenging optimization problem. Problem spaces are
often unique and what works well for one problem will not necessarily work well elsewhere.
A grid search is the process of selecting a range of values for each hyperparameter and
trying all possible combinations of those values. While this is a thorough approach that
accommodates for dependencies between hyperparameters, it can be time consuming when
working with multiple hyperparameters. Instead, in this work, sensible values were chosen
as starting points and each hyperparameter was iterated through individually while the
others were kept fixed. Although this method does not account for correlation between
hyperparameters, it achieved adequate results at a faster rate.

4.3.1 Layers and Nodes

One of the more influential RNN hyperparameters is the number of layers and the number
of nodes in each layer. Figure 6 depicts the layers and nodes in a bidirectional RNN model.
Increasing the depth (adding layers) allows for the network to learn more complex problems
but at the cost of needing more computational resources, increased training time, and
the possibility of overfitting. The parameter space of layer and node combinations was
explored to find suitable values that performed well on DTG data without adding substantial
complexity for minimal performance gain. Numerous combinations of layers and nodes were
trained and tested with the DTG-generated datasets to converge on optimal values and are
shared in Table 4, where the window size is 10, the batch size is 64, the stride number is 1,
and the RNN unit is GRU with bidirectional model. The well-known F1 metric (described
in further detail in Section 5) is used to evaluate the performance of the learner, with the
average F1 score taken as the averaged score for classification across all four application
types. The Nodes column contains hyphen-separated quantities of nodes for each layer
where the leftmost layer accepts input features and the rightmost layer is the final RNN
layer before the softmax layer.

DRDC-RDDC-2022-R052 11

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 6: Hidden layers and nodes in a bidirectional RNN.

The column “Trainable parameters” refers to the quantity of weights and biases throughout
the entire network where the last trainable parameter is the output layer. The number of
trainable parameter in an RNN layer, p, is determined by Equation (1), where r is a constant
that is unique to the type of RNN unit,3 n is the number of nodes in that layer, and m
is the size of the layer’s input vector. A layer with a large number of nodes significantly
increases the complexity of the network as can be seen in the configuration of a single layer
of 128 nodes.

p = r(n2 + nm + n) (1)

Note that the number of epochs for each training sequence is not fixed. The network
is trained until the validation loss ceases to improve (for some pre-determined number
3
r is the number of feed-forward neural networks in a RNN unit. For example, r is 1 for simply RNN, 4 for
LSTM, and 3 for GRU.

12 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Table 4: Training time and F1 score under various layer and node
combinations using GRU as the RNN unit, 10 as the window size, 64 as
the batch size, and 1 as the stride number.
Hidden Nodes Trainable Training Average
Layers Nodes Parameters Time (min) F1 score
1 8 576-85 88 0.850
1 32 6912-325 35 0.886
1 128 101376-1285 42 0.909
2 8-8 576-1248-85 89 0.895
2 16-8 1920-2016-85 172 0.910
2 8-16 576-3264-165 65 0.897
2 32-16 6912-7872-165 140 0.928
2 32-8 6912-3552-85 126 0.924
2 64-32 26112-31104-325 115 0.938
2 64-64 26112-74496-645 78 0.932
2 128-64 101376-123648-645 124 0.916
3 8-8-8 576-1248-1248-85 132 0.903
3 8-16-8 576-3264-2016-85 203 0.914
3 32-16-8 6912-7872-2016-85 114 0.929
3 32-64-32 6912-49920-31104-325 78 0.927
4 8-8-8-8 576-1248-1248-1248-85 199 0.924
4 8-16-16-8 576-3264-4800-2016-85 231 0.926
4 16-32-32-16 1920-12672-18816-7872-165 212 0.924
4 32-64-64-32 6912-49920-74496-31104-325 219 0.924
5 8-8-8-8-8 576-1248-1248-1248-1248-85 269 0.918

of epochs). This pre-determined number of epochs is another hyperparameter referred

to as the “early stopping patience,” discussed in more detail in Section 4.3.4. The
“Training Time” column in Table 4 refers to the length of time observed for a particular
layer/node combination to stop early (meaning that the model has not improved after the
pre-determined number of epochs). Note that the times in Table 4, and in future tables in
this section, should be evaluated conservatively since they are results from a single instance
of a stochastic process.

“Average F1” score is the most commonly used performance metric throughout this Report.
Section 5 describes in detail why this metric was chosen, what the value represents, and
how to calculate it with Equation (4). In short, the F1 score for a classifier is a measure of
its ability to correctly identify members of a particular class. Its value falls between [0, 1]
where 1.0 is a perfect score. The “Average F1” score is an average of individual class scores,
weighting each class equally. The layer and node combination with the highest F1 score is
2 and 64-32 respectively with a score of 0.938. Increasing the depth of the network past
three layers appears to no longer improve its capacity to learn, while also increasing the
training time. Despite having the deepest configuration, the 5 layer network has relatively

DRDC-RDDC-2022-R052 13

CAN UNCLASSIFIED
CAN UNCLASSIFIED

few parameters because of the low impact of a small number of nodes per layer. This
configuration takes longer to train than shallower networks because the BPTT algorithm
prevents a large portion of calculations from being performed in parallel. The shallowest
configurations with one layer do not learn as well as the multilayer configurations. The best
results are achieved by the two or three layer models with one or more layers having at
least 32 nodes and none of the layers exceeding 64 nodes. The configuration of 2 and 64-32
was chosen due to it having the highest score and a midrange training time.

4.3.2 RNN Unit Type

The simple RNN unit as well as GRU and LSTM units were tested to determine which
is most favourable for the packet classification problem. Two scenarios were examined: a
small window size where long-range dependencies are not critical and a larger window size
were they are. Table 5 combines the results of the scenarios, where the batch size is 64, the
stride number is 1, and the RNN hidden layer model is 64-32, i.e., the first hidden layer
uses 64 nodes and the second hidden layer uses 32 nodes. The simple RNN unit was not
tested on the larger window to avoid excessive training times after it had already performed
poorly on the 10 packet window and since it is not designed for long-range dependencies.
Table 5: Training time and F1 score of different RNN unit types under
10 and 200 window sizes using 64-32 as the RNN hidden layer model,
64 as the batch size, and 1 as the stride number.
Window Size RNN Unit Training Time (min) Average F1 Score
Simple RNN 519 0.891
10 LSTM 96 0.925
GRU 73 0.923
LSTM 321 0.970
200
GRU 339 0.984

The two units designed for long-term dependencies, the GRU and LSTM, trained faster and
performed better than the simple RNN on the smaller window. Between the two, training
times and F1 scores were similar with the GRU performing slightly better on the 200 sized
window. Because of this performance edge, and the fact that LSTMs have an extra gate to
train (impacting memory requirements) the GRU was chosen for the classifier.

4.3.3 Batch Size

The hyperparameter “Batch Size” determines how many training samples to process before
performing BPTT and updating the model’s weights. The largest possible batch size is
equal to the number of samples in the entire training set. Conversely, the smallest batch
size is just a single sample. Generally, a larger batch size will train the model faster at the
cost of the network’s ability to generalize, and results in reduced performance [29]. Table 6
compares results using various batch sizes on both small and large window sizes of 10 and

14 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

200 respectively, where the stride is 1, the RNN unit is GRU, and the RNN hidden layer
model is 64-32.
Table 6: Training time and F1 score of various batch sizes under 10
and 200 window sizes using 64-32 as the RNN hidden layer model, 64 as
the batch size, and 1 as the stride number.
Window Size Batch Size Training Time (min) Average F1 score
32 163 0.916
64 103 0.916
10 128 69 0.923
512 24 0.911
1024 16 0.900
32 177 0.980
64 107 0.975
200 128 44 0.963
512 39 0.958
1024 49 0.949

“Average F1” scores do not deviate considerably across batch sizes on a small window, with
smaller to medium sized batches performing the best. On a larger window size the batch
size has a more significant impact. The spread of “Average F1” scores coincides with the
theory that larger batch sizes reduce a model’s ability to generalize. The impact of batch
size on a small window was less pronounced, so a batch size of 64 was chosen based on the
large window results. A batch size of 32 performed marginally better but took longer to
train and was discarded.

4.3.4 Early Stopping

In the context of training an RNN, an epoch refers to the act of iterating through the entire
training set once. Within an epoch, unless the batch size is set to the length of the training
set, the network’s weights will be updated several times. Sometimes from examining a
learning curve or through trial and error, a reasonable fixed number of epochs can be
selected. For the DTG classifier, the number of epochs necessary to converge on a minimum
loss varies depending on the window size. Using the alternative, early stopping (described
below), allowed for networks trained on different window sizes to be fairly compared.

Early stopping is the process of training RNN for an arbitrarily large number of epochs but
stopping early when a condition is met. The condition is usually that the validation loss4 or
a selected metric has not improved based on a minimum delta value5 for a specific number
4
The validation loss is the loss calculated on the validation dataset with the same RNN model and loss
function as that used in training dataset. In addition, the validation loss is calculated after each epoch but
the training loss is calculated during each epoch.
5
The minimum delta value is set for the change to be considered, i.e., change in the value being monitored
should be higher than the minimum delta value. For example, in this Report, we set the minimum delta
value as 0.001, which means only the change of loss over 0.001 to be considered.

DRDC-RDDC-2022-R052 15

CAN UNCLASSIFIED
CAN UNCLASSIFIED

of epochs. The number of epochs is referred to as the patience. Early stopping is evaluated
on the validation set because the training loss6 can continue to improve even though the
validation loss has stagnated or worsened, wasting time and resources. The weights that
produced the best loss or metric before early stopping are reloaded at the end of training.
The validation loss was used as the early stopping monitor until it was determined that
the minimum loss does not correlate with the maximum average F1 score. A network was
trained evaluating both the validation and test sets every epoch to determine if a high
validation F1 score is replicated with the test set. Figure 7 is a plot of the results, where
the window size is 10, the batch size is 64, the stride number is 1, the RNN unit is GRU
with bidirectional model, the RNN hidden layer model is 64-32.

Figure 7: F1 score learning curve for various early stopping patiences

using GRU as the RNN unit, 64-32 as the RNN hidden layer model,
10 as the window size, 64 as the batch size, and 1 as the stride number.

The test set’s F1 scores were consistently better than the validation set. This likely indicates
that the test set is more similar to the training set than the validation set. Nevertheless, the
performance of the test set closely mimics the validation set confirming that validation F1
score should be used as the early stopping monitor. Several key points are also plotted on
the figure. They are the test scores that the network would have achieved if early stopping
patience (ESP) was set to the values listed in the legend. The minimum delta was set to
0.001, the smallest precision value used in this Report for F1 scores. Both ESPs of 5 and
6
The training loss is the loss calculated on the training dataset with the predefined RNN model as shown
in Table 4 and loss function, i.e., softmax() in this Report.

16 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

10 stopped after not improving past the validation F1 score achieved on epoch 13 and
their respective points overlap on Figure 7. An ESP of 5 was chosen because it would have
stopped training shortly after the steepest portion of the curve. Beyond that point, the
validation F1 score improves slightly but would require using an ESP of 25 which would
lengthen training time by roughly a factor of 4. Stopping earlier increases the uncertainty
in results because the process has less time to reach a stable state. Due to the random seed
from tensorflow not set to a fixed value during training and each run creating a different
random number, training was run three times per each set of hyperparameters on the same
dataset in Section 5 and the results were averaged.

4.4 Model Summary

Table 7 lists a summary of the hyperparameters selected for the final model. Bidirectional
models were used throughout this section because they improve the performance of
predictions made earlier in the window by allowing access to packets from the end of
the window. These hyperparameters are used throughout all of Section 5 and additional
parameters, such as window size and stride, are discussed in the relevant subsections.
Table 7: Summary of hyperparameters.
Hyperparameter Value
Normalize Inputs True
Batch Size 64
Early Stopping Patience 5
Early Stopping Minimum Delta 0.001
RNN Unit Type GRU
Bidirectional True
Layers and Nodes 64-32

Figure 8 is a diagram of the final model using the hyperparameters from Table 7. There are
two fully connected bidirectional GRU layers. The subscripts 64 and 32 refer to how many
nodes are in the units for layers 1 and 2 respectively. The window size parameter discussed
in Section 4.2 determines how long of a sequence the network accepts and is represented by
the subscript ws of the last x input and y output. The operation depicted by joining the lines
of the activations of the first layer’s forward and backwards propagations is concatenation.

DRDC-RDDC-2022-R052 17

CAN UNCLASSIFIED
CAN UNCLASSIFIED

y <t1 > y <t2 > y <t3 > y <tws >

A′32 A′32 A′32 ... A′32

A32 A32 A32 ... A32

A′64 A′64 A′64 ... A′64

A64 A64 A64 ... A64

x<t1 > x<t2 > x<t3 > x<tws >

Figure 8: DTG many-to-many RNN classifier.

18 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

5 Results
Supervised learning predictive analysis often makes use of confusion matrices. Confusion
for classification problems are of shape (n, n) where n is the number of specified classes.
The rows and columns correspond to the predicted and actual classes respectively. Table 8
shows a zero-initialized confusion matrix for the DTG classification problem.
Table 8: Zeroed confusion matrix.
Actual
FTP HTTP VoIP XMPP Other
FTP 0 0 0 0 0
Predicted

HTTP 0 0 0 0 0
VoIP 0 0 0 0 0
XMPP 0 0 0 0 0
Other 0 0 0 0 0

A cell within the matrix is incremented with every packet prediction made on the test set.
For example, if the first packet of the test set is predicted by the model as FTP packet but
is actually HTTP and the next two packets are correctly predicted as HTTP, the confusion
matrix will be updated to look like Table 9.
Table 9: Incremented confusion matrix.
Actual
FTP HTTP VoIP XMPP Other
FTP 0 1 0 0 0
Predicted

HTTP 0 2 0 0 0
VoIP 0 0 0 0 0
XMPP 0 0 0 0 0
Other 0 0 0 0 0

The matrix will continue to fill until the final packet has been predicted. A perfect model will
only have non-zero values along the diagonal. Conversely, a model with poor performance
will result in a distribution of values across all cells. Table 10 is an example of a completed
confusion matrix with artificial values. The real test set has tens of thousands of samples
and is not correlated with this matrix.

Aside from providing a visualization of error, confusion matrices can be used to calculate
per class true positives (TP), false positives (FP), false negatives (FN), and true negatives
(TN). Table 11 highlights these values for the FTP class. Table 12 depicts the same table
but with XMPP as the class of focus in blue.

Numerous predictive analysis metrics can be derived from TP, FP, FN, and TN values.
Precision is a measure of how many packets are correctly classified to a given class compared
to the total predicted data (TP+FP) for that class and is calculated with Equation (2). The
metric recall, represented by Equation (3), determines how many packets were correctly

DRDC-RDDC-2022-R052 19

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Table 10: Example complete confusion matrix.

Actual
FTP HTTP VoIP XMPP Other
FTP 20 1 0 5 0

Predicted
HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15

Table 11: FTP focused confusion matrix.

Actual
FTP HTTP VoIP XMPP Other
FTP 20 1 0 5 0
Predicted

HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15

True Positives
FTP

False Positives
False Negatives
True Negatives

identified compared to the ground truth (TP+FN). Depending on the scenario, one or
the other may be of greater importance. For instance, in a cancer screening test, recall is
likely of greater significance because false positives are less detrimental to someone’s health
and treatment plan than false negatives. In the packet classification domain, the choice of
whether recall or precision is more important depends on the application. The F1 score,
Equation (4), is the harmonic mean of precision and recall. It weighs precision and recall as
equally important without knowing the primary objective of the classification task, and is
a frequently used metric in classification research. Table 13 lists the values of each metric
for the examples in Tables 11 and 12.

TP
precision = (2)
TP + FP

TP
recall = (3)
TP + FN

precision ∗ recall
F1 score = 2 ∗ (4)
precision + recall

20 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Table 12: XMPP focused confusion matrix.

Actual
FTP HTTP VoIP XMPP Other
FTP 20 1 0 5 0

Predicted
HTTP 2 35 0 0 0
VoIP 0 0 90 0 0
XMPP 2 0 0 5 4
Other 0 1 0 4 15

True Positives

XMPP
False Positives
False Negatives
True Negatives

Table 13: Sample metrics.

TP FP FN TN Precision Recall F1 score
FTP 20 6 4 154 0.769 0.833 0.800
HTTP 35 2 2 145 0.946 0.946 0.946
VoIP 90 0 0 94 1.000 1.000 1.000
XMPP 5 6 9 164 0.455 0.357 0.400
Other 15 5 4 160 0.750 0.789 0.769

Average F1 0.783
Accuracy 0.897

Confusion Matrices are excellent tools for viewing the detailed performance of a single
trained model, but are less useful for comparing among several models. A single global
metric that coalesces the information from the confusion matrix, such as the F1 score, is
better suited to this purpose. Table 13 highlights the difficulty in using the total accuracy
metric for cases where a dataset contains imbalanced classes. The total accuracy is computed
as the number of true positives and true negatives divided by the total sample space
(i.e., TP + TN / (TP + TN + FP + FN)). The accuracy computed over all classes is high
due to the influence of the majority class VoIP. FTP and XMPP are in the minority and
their individual performances are not evident when using total accuracy. Averaging class F1
scores as an overall metric for the model accounts for deficiencies in minority class learning.
For these reasons, individual class and averaged F1 scores are used for the remainder of this
section as the baseline performance metric.

5.1 Effects of Window Size and Stride

The results from exploring the parameter space of window size and stride are presented
in this section. Window sizes beyond 200 were not explored due to hardware limitations.

DRDC-RDDC-2022-R052 21

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 9 plots the performance of each tested window size with various strides, where the
batch size is 64, the RNN unit is GRU, and the RNN hidden layer model is 64-32. Note
that a 1-packet window essentially turns the RNN into MLP. No temporal data is made use
of and the model learns solely from the lone input feature vector.

Figure 9: Performance of various window sizes and strides using GRU as the RNN unit,
64-32 as the RNN hidden layer model, and 64 as the batch size.

Increasing the window size and in turn the temporal information available to the classifier
enhances performance. For each window size, we tested values of “stride” up to the maximum
allowable value, where the maximum could not exceed the window size. In general, we
observe that enlarging the stride reduces the average F1 score. There are two factors that
could contribute to this observation. As the stride is increased, the number of windows
available to train with proportionately decreases. Changing the stride from 1 to 5 for
instance will reduce the number of training samples to 20% of what was possible with
the stride set to 1. Decreasing the size of the training set hampers the model’s robustness.
The other possible contributing factor to the performance dip is that (especially for a large
stride equal to the window size) some traffic flows may only be passed to the network as
edge cases. Figures 10 and 11 highlight how a large and small stride affects edge cases
respectively, where the batch size is 64, the RNN unit is GRU, and the RNN hidden layer
model is 64-32.

A pair of XMPP packets are split between the two windows formed in Figure 10. Most
XMPP flows are two packets long consisting of a short message segment followed by an

22 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

XMPP XMPP

window 1

XMPP

window 2

stride
XMPP

Figure 10: XMPP edge case with maximum stride.

acknowledgement packet. Without having a window with the flow intact, the RNN is unable
to observe this information, reducing its ability to correctly learn the behaviour of XMPP.
In Figure 11, however, two of three windows that contain part of the XMPP flow have
it in its entirety. There is more context for the RNN to pull from in this scenario. The
overlapping of windows caused by using a smaller stride enhances the network’s ability to
learn. Larger window sizes compound this effect by increasing the size of flows that can fit
within a window. Figure 12 compares the performance of window sizes with the stride set
to 1 which, for both of the reasons mentioned above, is the best stride for all window size.

The largest window sizes, as intuition would expect, resulted in the highest F1 scores. In
addition to the increased temporal information of a larger window, with a stride of one, each
packet (excluding the first and last window size packets) is passed to the network “window
size” times. So for the 200 window packet, the network will see the same packet 200 times,
which adds to training robustness. For all window sizes the network struggled most with
identifying HTTP. FTP and HTTP share similar traffic behaviour since both clients request
files from a server. The FTP files are larger and get split into more packets at the frame
level. HTTP objects still need to be split at the frame level but make up smaller flows.
Therefore both traffic types make large sequences of 1514-byte (the Maximum Transmission
Unit [MTU]) data packets and respective acknowledgements, eventually terminating with a
single packet of less than or equal to 1514 bytes. The primary distinction between the two
is that HTTP objects are typically smaller and grouped into several objects at a time. This
information will often elude the neural network because the flows can be significantly longer
than the largest window size tested. Since FTP files are larger, there are about 5 times as
many of them in the test set. It appears that as the dominant traffic type between the
two, the RNN leans toward classifying uncertain packets as FTP, possibly resulting in
the reduced HTTP performance. It is encouraging that despite their near identical traffic
patterns, the network can still distinguish between the two with some degree of fidelity.

DRDC-RDDC-2022-R052 23

CAN UNCLASSIFIED
CAN UNCLASSIFIED

XMPP XMPP

window 1

XMPP

window 2

stride
XMPP XMPP

window 3

XMPP XMPP

Figure 11: XMPP edge case with smaller stride.

Of note is the training time for each window and stride configuration. While larger windows
are superior performance-wise, they come with the drawback of being slower to train. In
a tactical environment, if the network needs to be retrained with data recently collected,
training time is of particular importance. Figure 13 plots the “Average F1” score of each
tested configuration against its training time. Each point is annotated with its window size
and stride in that order.

The points closest to the upper left corner of the plot are the most efficient configurations.
The window size of 200 with strides of 25 and 50 achieve reasonable good performance
quickly. Strides of one increase performance but at the cost of training time. The best
configuration is dependent on the test environment; the trade-off between classification
performance and training speed is unavoidable.

5.2 Many-to-Many vs. Many-to-One

A many-to-one configuration was also developed and is shown in Figure 14. The structure is
identical to the many-to-many model of Figure 8 with the exception of how the final RNN
layer’s activations feed into the softmax decision. In the many-to-many configuration, each
timestep’s forward and backward activations were concatenated into a unique softmax to
make a prediction for that timestep. Instead, with the many-to-one configuration, only the
final activations from the forward and backward propagation are concatenated into a single
softmax layer to make a single prediction.

24 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 12: Performance of various window sizes with the stride set to 1, the batch size
set to 64, the RNN unit set to GRU, the RNN hidden layer model set to 64-32.

The output prediction of the softmax layer can correspond to any of the timesteps,
depending upon how the model is trained. In the preprocessing phase, the hyperparameter
prediction position—unique to the many-to-one model—selects which packet in the window
will determine the label for the whole window. Setting it to 0 (for example) means that the
many-to-one model will be trained to predict the whole window with the same class as that
in the first packet of the window since the assumption of this model is that the input window
consists of only one packet type. Figure 15 plots the performance of the many-to-one setup
with various prediction positions along with the standard many-to-many model, where the
window size is 10, the batch size is 64, the stride number is 1, the RNN unit is GRU, and
the RNN hidden layer model is 64-32.

The average F1 scores are slightly better when predicting packets away from the edge of the
window. However, the distinction amongst prediction positions is not pronounced, except
for the case of prediction position 0 for XMPP. Consider a scenario where the last XMPP
packet of a flow is the first packet of a window. The only information the network sees at
the first packet position in the window is a packet that looks like an acknowledgement of
a TCP segment, possibly leading to the network misclassifying the packet as being part
of an FTP or HTTP flow that follows. Scenarios such as this are common at window

DRDC-RDDC-2022-R052 25

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 13: Training times of all tested window size and stride configurations with the
batch size set to 64, the RNN unit set to GRU, the RNN hidden layer model set to 64-32.

borders suggesting that prediction positions at least a few packets in from the window
border are preferable. The performance of the many-to-one configurations were comparable
to the many-to-many configuration with prediction position 2 surpassing the default model.
However, by employing an aggregation strategy on the test set as described in Section 5.5,
the many-to-many model remains superior.

26 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

A′32 A′32 A′32 ... A′32

A32 A32 A32 ... A32

A′64 A′64 A′64 ... A′64

A64 A64 A64 ... A64

x<t1 > x<t2 > x<t3 > x<tws >

Figure 14: DTG many-to-one RNN classifier.

DRDC-RDDC-2022-R052 27

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 15: Comparison of many-to-one packet prediction locations with default

many-to-many model with the window size set to 10, the batch size set to 64, the stride
set to 1, the RNN unit set to GRU, and the RNN hidden layer model set to 64-32.

28 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

5.3 Single Parameter Input Features

In the tests to this point, the RNN model has always accepted two input features: interarrival
time and packet length. To analyze the influence of each on learning, the model was trained
with one feature at a time using a window size of 10 and a stride of 1. Figure 16 plots
the results along with those from the model using both interarrival time and length for
comparison.

Figure 16: Results from training model with single features plotted
alongside default dual features.

The interarrival times between packets seems to have less influence on the outcome than
the packet length. With interarrival time omitted the network learns nearly as well as if
they were included. When length is excluded, however, performance significantly degrades,
especially for the minority classes XMPP and Other. One possible explanation of the
difficulty of learning with just interarrival times is the wide range of interarrival times.
Values range from microseconds to seconds while for packet lengths the range is between
tens and thousands of bytes. Furthermore, values for both ranges do not lie uniformly
between the range’s maximum and minimum. Long sequences of packets sent back-to-back
are followed by long gaps. The distinction with packet lengths is that the few values that
lie in the middle of the range are more informative. A sequence of packets consisting of

DRDC-RDDC-2022-R052 29

CAN UNCLASSIFIED
CAN UNCLASSIFIED

two traffic types are not discernible using interarrival time since both are being sent as
fast as possible. With packet lengths, however, non-MTU length packets can indicate a
transition from one traffic type to another. The model itself may also not be configured
well for learning from interarrival times. Performance has been optimized for length and
hyperparameters could be tuned using just interarrival times to determine if there is a more
suitable configuration.

A final note is that while interarrival time seems to provide only minor additional benefit
to the learner, this is not to say that timing in general is not important. As evidenced by
the improvement when using longer windows, valuable learning comes from packet lengths
and their relative sequence with respect to each other. Looking at a single packet length in
isolation produces poor outcomes, but looking at a sequence of packet lengths over a long
window improves outcomes significantly.

5.4 F1 Score per Packet Location

This section analyzes the effect of packet location in a window on performance for the
many-to-many model. An early stopping patience of 10 epochs was used to increase the
certainty of results and to sharpen distinctions between packet locations in a window.
Increasing the ESP reduces the possibility of the network stopping its training before it has
reached a global validation F1 score maximum. The training process is stochastic and the
validation F1 score may worsen for several epochs before it improves to a new maximum.
By setting the patience to a larger value, the network is more lenient with performance
degradations if they later improve, thus increasing confidence that the maximum is indeed
global. When an ESP of 5 was used, F1 scores varied more across the three test runs which
reduced the clarity of the plot. Figure 17 plots the results from using an ESP of 10 along
with the standard window size of 10 and stride of 1.

Average performance along the window positions loosely shows a peak for positions near
the centre with the exception of the final position. Unlike the many-to-one configuration,
with the many-to-many model predictions are made without information gathered from
traversing the entire window in both directions. Instead activations from the final forward
and backwards layers at the current position are concatenated and then passed to the
softmax. In the many-to-one model, since only one prediction was made for the entire
window, the final concatenation occurred after both the forwards and backwards passes of
the RNN had been fully traversed. For some traffic types this gives an advantage to the inner
window positions. XMPP for example, as mentioned in Section 5.1, may be challenging to
predict when the flow is not padded with other traffic types within the boundary of the
window. Both XMPP and Other classes are best predicted by the middle-most window
positions with the exception of the final position. The last position’s performance may
break this trend because, if a short XMPP or Other flow—which are also usually limited
to a couple of packets—straddles a window border, the last position of the window is more
likely to contain the data packets of the flow while the first position of the next window
will contain the acknowledgements which are less informative.

30 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 17: F1 scores of each position in many-to-many model.

FTP and HTTP performance improves with larger position throughout the window,
indicating that the full forward traversal of the previous activations final layer is important
in distinguishing these types. A flow of MTU sized packets terminating with a non-MTU
data packet may only be informative in the forward direction. In reverse, the important
non-MTU data packet cannot be associated with a traffic type without knowing the length
of the reverse flow. VoIP traffic is easily identified regardless of window position because it
is the majority traffic type and has a distinctly distributed uniform pattern.

5.5 Utilizing Repeated Packet Classifications in the

Many-to-Many Architecture to Improve Test Results
The many-to-many RNN configuration categorizes the same packet multiple times each
time the window is advanced by the stride (i.e., some packets that were classified in the
previous window will still be present in the current window—shifted by the stride length
and will receive another classification decision based on the current window). Referring
back to the example of Figure 5 from Section 4.2, the fifth packet in the original sequence
would be predicted in each of the first three windows. As discussed in Section 5.1, training
on the same packet in several window positions can be beneficial for learning. This section
discusses a possible method of using the repeated classifications.

Up until this point, all classification decisions have been treated as unique, with each one
updating the confusion matrix. While useful for calculating performance metrics, in practice
a single prediction would be required for each packet. It would be ambiguous to have, for

DRDC-RDDC-2022-R052 31

CAN UNCLASSIFIED
CAN UNCLASSIFIED

example, two FTP and two HTTP predictions for a single packet. One solution is to set the
stride of the test set equal to the window size regardless of the value used for the training
and validation sets. This method is fastest but reduces performance since the packet is only
seen in one position (which may not be the most optimal for its traffic type). If optimal
performance is desired, the repeated classifications can be aggregated into a single decision.
One method of combining these is to take the majority vote of the individual decisions.
But this does not take into account the confidence of each prediction. Summing the class
probabilities directly from the softmax and then taking the maximum of the sum remedies
this. Table 14 is an example scenario of a single packet prediction for demonstration purposes
that exemplifies the benefit of using the aggregate method.
Table 14: Probability aggregation benefit analysis.
Traffic Probability 1 Probability 2 Probability 3 Aggregate
FTP 0.35 0.8 0.37 1.52
HTTP 0.4 0.1 0.43 0.93
VoIP 0.05 0.1 0.05 0.2
XMPP 0.1 0.03 0.06 0.19
Other 0.1 0.07 0.03 0.2

The packet’s label is FTP, and in this scenario it has been predicted three times, each
time in a different window position. The orange cells are the maximum values in each
probability column. The default model behaviour is to treat each probability individually.
The packet would be falsely classified as HTTP twice and correctly as FTP once. Taking the
majority prediction from individual probabilities would also wrongly classify the packet as
HTTP. Using the aggregate method, however, the low confidence in the HTTP predictions
is accounted for and the overall dominant probability of FTP becomes apparent. Although
this exact scenario is artificial, FTP and HTTP are often confused by the RNN due to their
shared client-server behaviour and similar instances occur in the test set. Figure 18 plots
the test performance of using the default method, i.e. treating repeated packets as unique,
and with the aggregate method on a 10 packet window that was shifted with a stride of 1.

The aggregate method improves the performance across all traffic types. The improvement
is minimal with VoIP because, as the majority class, its performance is already near perfect
without the assistance of aggregation. When time and computational resources are available,
the aggregate method should be used to enhance the model’s performance.

32 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

Figure 18: Plot of F1 scores for aggregate and default test methods.

DRDC-RDDC-2022-R052 33

CAN UNCLASSIFIED
CAN UNCLASSIFIED

6 Conclusion
This Report demonstrated the feasibility of using recursive neural networks (RNNs) to
perform packet traffic classification by examining only the packet length and timing of
observed packets. RNNs were selected as a candidate for traffic classification due to their
promising performance in other learning tasks in which the sequential order of data must
be retained to influence the classification outcome. A dataset consisting of FTP, HTTP,
VoIP, and XMPP traffic was collected and used to train an RNN, optimize the model’s
hyperparameters, and then investigate its performance.

While specific numerical results are likely unique to the particular use case investigated,
certain results are more generalizable, specifically:

• The performance of the classifier increases when the RNN examines larger window
sizes. This comes at a cost of longer training times, and the performance increase
shows a pattern of diminishing returns as window size increases beyond a certain
point;
• Classifier performance increases for smaller stride lengths, with the best performance
achieved with a stride length of 1. Training time increase for lower stride lengths;
however, if the training time is acceptable for a stride of 1, then this is the
recommended choice;
• Interarrival time measurements have a lesser impact on classifier performance than
packet length;
• Classifier performance improvements can be obtained by combining the results
of packet classification decisions obtained by an RNN using the many-to-many
configuration. The combined-decision many-to-many configuration is recommended
over the many-to-one configuration; and
• The RNN classification performance on a given packet is influenced by the packet’s
position in the examined window; packets near the centre of the window tend to be
classified more accurately.

While this Report demonstrated the feasibility of RNN for a specific use case, more research
is required to gain further insight into the ultimate utility of this strategy. Research can
proceed in several ways, but testing the technique on other datasets should be prioritized.
The datasets generated by DTG in this Report were based on literature best practices,
but data from live networks of interest should also be explored. In addition, while FTP,
HTTP, VoIP, and XMPP are standard Internet traffic types, they may be of less interest
in other domains. For instance, if traffic analysis/classification were desired to be used
for tactical network applications, or smart buildings, or industrial control systems, all of
these would have different application data that would need to be modelled and evaluated.
This research has shown a promising way forward and Defence Research and Development
Canada (DRDC) expects to continue activity in this area.

34 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

References
[1] NSA (1987), The Origination and Evolution of Radio Traffic Analysis: The World
War I Era, Declassified Cryptologic Quarterly Articles.

[2] NSA (1989), The Origination and Evolution of Radio Traffic Analysis: World War II,
Declassified Cryptologic Quarterly Articles.

[3] Qadeer, M. A., Iqbal, A., Zahid, M., and Siddiqui, M. R. (2010), Network Traffic
Analysis and Intrusion Detection Using Packet Sniffer, In 2010 Second International
Conference on Communication Software and Networks.

[4] Butun, I., Morgera, S. D., and Sankar, R. (2014), A Survey of Intrusion Detection
Systems in Wireless Sensor Networks, IEEE Communications Surveys and Tutorials.

[5] Adametz, J. and Panneton, B. (2015), High-Bandwidth Tactical-Network Data

Analysis in a High-Performance-Computing (HPC) Environment: Packet-Level
Analysis, US Army Research Laboratory.

[6] Lu, Z., Wang, C., and Wei, M. (2015), On Detection and Concealment of Critical
Roles in Tactical Wireless Networks, In MILCOM 2015 - 2015 IEEE Military
Communications Conference.

[7] Liu, Y., Bild, D. R., Dick, R. P., Mao, Z. M., and Wallach, D. S. (2015), The Mason
Test: A Defense Against Sybil Attacks in Wireless Networks Without Trusted
Authorities, IEEE Transactions on Mobile Computing.

[8] Nguyen, T. T. and Armitage, G. (2008), A Survey of Techniques for Internet Traffic
Classification using Machine Learning, IEEE Communications Surveys and Tutorials.

[9] Koc, L., Mazzuchi, T. A., and Sarkani, S. (2012), A Network Intrusion Detection
System Based on a hidden Naïve Bayes Multiclass Classifier, Expert Systems with
Applications.

[10] Sarker, I. H., Abushark, Y. B., Alsolami, F., and Khan, A. I. (2020), IntruDTree: A
Machine Learning Based Cyber Security Intrusion Detection Model, Symmetry.

[11] Raikar, M. M., S M, M., Mulla, M. M., Shetti, N. S., and Karanandi, M. (2020),
Data Traffic Classification in Software Defined Networks (SDN) using
Supervised-Learning, Procedia Computer Science.

[12] Vishwakarma, S., Sharma, V., and Tiwari, A. (2017), An Intrusion Detection System
using KNN-ACO Algorithm, International Journal of Computer Applications.

[13] Guo, Z., Shi, D., Quevedo, D. E., and Shi, L. (2019), Secure State Estimation
Against Integrity Attacks: A Gaussian Mixture Model Approach, IEEE Transactions
on Signal Processing.

DRDC-RDDC-2022-R052 35

CAN UNCLASSIFIED
CAN UNCLASSIFIED

[14] Emadi, H. S. and Mazinani, S. M. (2018), A Novel Anomaly Detection Algorithm

Using DBSCAN and SVM in Wireless Sensor Networks, Wireless Personal
Communications.
[15] Liu, Y., Li, W., and Li, Y. (2007), Network Traffic Classification Using K-means
Clustering, In Second International Multi-Symposiums on Computer and
Computational Sciences (IMSCCS 2007).
[16] Hong Van, L. T., Van Huong, P., Thuan, L. D., and Hieu Minh, N. (2020), Improving
the Feature Set in IoT Intrusion Detection Problem Based on FP-Growth Algorithm,
In 2020 International Conference on Advanced Technologies for Communications
(ATC).
[17] Li, D., Baral, R., Li, T., Wang, H., Li, Q., and Xu, S. (2018), HashTran-DNN: A
Framework for Enhancing Robustness of Deep Neural Networks against Adversarial
Malware Samples, CoRR.
[18] Yang, H. and Wang, F. (2019), Wireless Network Intrusion Detection Based on
Improved Convolutional Neural Network, IEEE Access.
[19] Zhang, Y., Li, P., and Wang, X. (2019), Intrusion Detection for IoT Based on
Improved Genetic Algorithm and Deep Belief Network, IEEE Access.
[20] Erpek, T., Sagduyu, Y. E., and Shi, Y. (2019), Deep Learning for Launching and
Mitigating Wireless Jamming Attacks, IEEE Transactions on Cognitive
Communications and Networking.
[21] AlEroud, A. and Karabatis, G. (2020), Bypassing Detection of URL-Based Phishing
Attacks Using Generative Adversarial Deep Neural Networks, In Proceedings of the
Sixth International Workshop on Security and Privacy Analytics.
[22] Farahnakian, F. and Heikkonen, J. (2018), A deep auto-encoder based approach for
intrusion detection system, In 2018 20th International Conference on Advanced
Communication Technology (ICACT).
[23] Javaid, A., Niyaz, Q., Sun, W., and Alam, M. (2016), A Deep Learning Approach for
Network Intrusion Detection System, EAI Endorsed Transactions on Security and
Safety.
[24] Hochreiter, S. and Schmidhuber, J. (1997), Long Short-Term Memory, Neural
Computation.
[25] Cho, K., van Merrienboer, B., Çaglar Gülçehre, Bahdanau, D., Bougares, F.,
Schwenk, H., and Bengio, Y. (2014), Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation, In EMNLP.
[26] Aitken, W. (2021), Defence Research and Development Canada Traffic Generator
(DTG), Defence Research and Development Canada, Reference Document,
DRDC-RDDC-2021-D053.

36 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

[27] Sola, J. and Sevilla, J. (1997), Importance of Input Data Normalization for the
Application of Neural Networks to Complex Industrial Problems, IEEE Transactions
on Nuclear Science.

[28] Reimers, N. and Gurevych, I. (2017), Optimal Hyperparameters for Deep

LSTM-Networks for Sequence Labeling Tasks, In arXiv:1707.06799v2.

[29] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017),
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,
In ICLR.

DRDC-RDDC-2022-R052 37

CAN UNCLASSIFIED
CAN UNCLASSIFIED

List of Abbreviations/Acronyms/Initialisms/Symbols
AI artificial intelligence
ARP address resolution protocol
BPTT Backpropagation Through Time
CAF Canadian Armed Forces
CNN convolutional neural networks
DoS denial-of-service
DRDC Defence Research and Development Canada
DTG DRDC Trafic Generator
ESP early stopping patience
FN false negatives
FP false positives
FTP file transfer protocol
GRU gated recurrent units
HTTP hypertext transfer protocol
IoT Internet of Things
IP Internet Protocol
k-NN k-nearest neighbours
LSTM long short-term memory
MAC media access control
MDNS multicast domain name system
MLP multilayer perceptrons
MTU Maximum Transmission Unit
NN neural networks
NTP network time protocol
RNN recurrent neural network
R2L remote to local
SDN software defined network
SSDP simple service discovery protocol
SVM support vector machine
TN true negatives
TP true positive
U2R user to root
XMPP extensible messaging and presence protocol

38 DRDC-RDDC-2022-R052

CAN UNCLASSIFIED
CAN UNCLASSIFIED

DOCUMENT CONTROL DATA

*Security markings for the title, abstract and keywords must be entered when the document is sensitive.

1. ORIGINATOR (The name and address of the organization preparing 2a. SECURITY MARKING (Overall security marking of
the document. A DRDC Centre sponsoring a contractor’s report, or a the document, including supplemental markings if
tasking agency, is entered in Section 8.) applicable.)

DRDC – Ottawa Research Centre CAN UNCLASSIFIED

3701 Carling Avenue, Ottawa ON K1A 0Z4,
Canada 2b. CONTROLLED GOODS

NON-CONTROLLED GOODS
DMC A

3. TITLE (The document title and subtitle as indicated on the title page.)

Application Traffic Classification using Neural Networks: An Analysis of Traffic Classification

in Adverse Conditions

4. AUTHORS (Last name, followed by initials – ranks, titles, etc. not to be used. Use semi-colon as delimiter.)

Aitken, W.; Brown, D.; Song, R.

5. DATE OF PUBLICATION (Month and year of publication of 6a. NO. OF PAGES (Total 6b. NO. OF REFS (Total
document.) pages, including Annexes, cited in document.)
excluding DCD, covering
and verso pages.)

March 2022 45 29

7. DOCUMENT CATEGORY (e.g., Scientific Report, Contract Report, Scientific Letter)

Scientific Report

8. SPONSORING CENTRE (The name and address of the department project or laboratory sponsoring the research and
development.)

DRDC – Ottawa Research Centre

3701 Carling Avenue, Ottawa ON K1A 0Z4, Canada

9a. PROJECT OR GRANT NO. (If appropriate, the applicable 9b. CONTRACT NO. (If appropriate, the applicable contract
research and development project or grant number under number under which the document was written.)
which the document was written. Please specify whether
project or grant.)

Tactical Network Operations (TNO)

10a. DRDC PUBLICATION NUMBER 10b. OTHER DOCUMENT NO(s). (Any other numbers which may
be assigned to this document either by the originator or by
DRDC-RDDC-2022-R052 the sponsor.)

11a. FUTURE DISTRIBUTION WITHIN CANADA (Approval for further dissemination of the document. Security classification must also
be considered.)

Public release

11b. FUTURE DISTRIBUTION OUTSIDE CANADA (Approval for further dissemination of the document. Security classification must also
be considered.)

Public release

CAN UNCLASSIFIED
CAN UNCLASSIFIED

12. KEYWORDS, DESCRIPTORS or IDENTIFIERS (Use semi-colon as a delimiter.)

traffic analysis; traffic classification; recurrent neural network

13. ABSTRACT/RÉSUMÉ (When available in the document, the French version of the abstract must be included here.)

This Scientific Report investigates the efficacy of machine learning techniques on a passive
eavesdropper’s ability to classify encrypted traffic in wireless networks, where the eavesdropper
can observe only the timing, sequence, and duration of packets. Datasets were generated
using a custom-built traffic generation tool to generate encrypted traffic sets consisting of
four application types: file transfer protocol (FTP), hypertext transfer protocol (HTTP), voice
over Internet protocol (VoIP), and extensible messaging and presence protocol (XMPP). The
eavesdropper gathered encrypted packets and used the recurrent neural network (RNN)
machine learning technique to classify the applications of packets in the dataset on a
packet-by-packet basis. In evaluating the efficacy of RNNs for traffic classification, the effect
of a number of parameters on accuracy and training time were examined, including: the window
size of data examined by the RNN; the effect of partial knowledge; different configurations of
RNNs such as many-to-many or many-to-one; and how best to combine the outputs of an RNN
to make a classification prediction.
Le présent rapport scientifique examine l’efficacité des techniques d’apprentissage machine sur
la capacité d’un intercepteur passif à classer le trafic chiffré sur des réseaux sans fil en observant
uniquement le moment, la séquence et la durée de transmission des paquets. Pour ce faire, on
a utilisé un outil générateur de trafic personnalisé afin de créer des ensembles de données
chiffrés transmis au moyen de quatre types de protocoles, soit le protocole de transfert de
fichiers (FTP), le protocole de transfert hypertexte (HTTP), la voix sur protocole Internet (VoIP)
et le protocole XMMP (Extensible Messaging and Presence Protocol). L’intercepteur a capté
des paquets chiffrés et s’est servi d’un réseau de neurones récurrentes (RNR) – une technique
d’apprentissage machine – pour classer un à un les paquets de l’ensemble de données selon
leur type d’application. Dans le but d’évaluer l’efficacité des RNR quant au classement du
trafic, on a étudié l’incidence de divers paramètres sur l’exactitude et le temps d’apprentissage,
notamment l’étendue des données analysées par les RNR, l’effet de connaissances partielles,
les différentes configurations de RNR (plusieurs-à-plusieurs ou plusieurs-à-un) et la meilleure
façon de combiner les extrants d’un RNR pour prédire un classement.

CAN UNCLASSIFIED

299 Online
No ratings yet
299 Online
12 pages
Ting_Ta_Jiun_202111_MAS_thesis
No ratings yet
Ting_Ta_Jiun_202111_MAS_thesis
77 pages
Labonne 2020
No ratings yet
Labonne 2020
122 pages
Explainable AI For IDS Final Report
No ratings yet
Explainable AI For IDS Final Report
94 pages
Shreya Ghosh MS Thesis Final Revised
No ratings yet
Shreya Ghosh MS Thesis Final Revised
64 pages
Labonne 2020
No ratings yet
Labonne 2020
123 pages
Review_Report_3820241089_3820241090
No ratings yet
Review_Report_3820241089_3820241090
3 pages
A Flow-Based Anomaly Detection Approach With Feature Selection Method Against DDoS Attacks in SDNs
No ratings yet
A Flow-Based Anomaly Detection Approach With Feature Selection Method Against DDoS Attacks in SDNs
19 pages
2022 Zhang Ruohao
No ratings yet
2022 Zhang Ruohao
114 pages
WallinErik 2019
No ratings yet
WallinErik 2019
94 pages
Bharath_Thesis_Report
No ratings yet
Bharath_Thesis_Report
120 pages
Technologies 09 00014 v2
No ratings yet
Technologies 09 00014 v2
22 pages
Encrypted Network Traffic Classification Using Deep and Parallel Network-in-Network Models
No ratings yet
Encrypted Network Traffic Classification Using Deep and Parallel Network-in-Network Models
10 pages
Deep Learning Approaches For Network Int
No ratings yet
Deep Learning Approaches For Network Int
116 pages
Deep Learning Models For Spatio-Temporal Forecasting and Analysis
No ratings yet
Deep Learning Models For Spatio-Temporal Forecasting and Analysis
131 pages
A Self-Adaptive Deep Learning-Based System For Anomaly Detection in 5G Networks
No ratings yet
A Self-Adaptive Deep Learning-Based System For Anomaly Detection in 5G Networks
13 pages
Final Report
No ratings yet
Final Report
65 pages
Internship Report
No ratings yet
Internship Report
7 pages
Delhi_Technological_University_Thesis_Template (2)
No ratings yet
Delhi_Technological_University_Thesis_Template (2)
38 pages
Title: Sruthi
No ratings yet
Title: Sruthi
10 pages
TAMS1 de 1
No ratings yet
TAMS1 de 1
108 pages
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
report
No ratings yet
report
5 pages
ZHENG, Ge_Ph.D._2022
No ratings yet
ZHENG, Ge_Ph.D._2022
217 pages
Tesi
No ratings yet
Tesi
110 pages
2019 Ucp 1339
No ratings yet
2019 Ucp 1339
37 pages
جديد
No ratings yet
جديد
54 pages
10.5445IR1000146668
No ratings yet
10.5445IR1000146668
158 pages
1.1 Motivation
No ratings yet
1.1 Motivation
65 pages
Report I
No ratings yet
Report I
47 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
ICCAD25_paper_7737 (2)
No ratings yet
ICCAD25_paper_7737 (2)
5 pages
1 s2.0 S2352864822001845 Main
No ratings yet
1 s2.0 S2352864822001845 Main
17 pages
Deep_Convolutional_Neural_Networks_for_Intrusion_Detection_in_Automotive_Ethernet_Networks
No ratings yet
Deep_Convolutional_Neural_Networks_for_Intrusion_Detection_in_Automotive_Ethernet_Networks
6 pages
Thesis
No ratings yet
Thesis
53 pages
Eecs 2015 243 PDF
No ratings yet
Eecs 2015 243 PDF
73 pages
IRJET-V12I403
No ratings yet
IRJET-V12I403
9 pages
Ddos Detection and Mitigation in SDN Using Onos Controller: Dr. Shashank Srivastava
No ratings yet
Ddos Detection and Mitigation in SDN Using Onos Controller: Dr. Shashank Srivastava
57 pages
89879-Điều văn bản-197101-1-10-20240119
No ratings yet
89879-Điều văn bản-197101-1-10-20240119
14 pages
5G Aviation Networks Using Novel AI Approach for DDoS Detection
No ratings yet
5G Aviation Networks Using Novel AI Approach for DDoS Detection
25 pages
13 在线加密流量分类有无代码
No ratings yet
13 在线加密流量分类有无代码
14 pages
dl-ids
No ratings yet
dl-ids
72 pages
Performance Comparison of Machine Learning and Deep Learning Models in DDoS Attack Detection _ SpringerLink
No ratings yet
Performance Comparison of Machine Learning and Deep Learning Models in DDoS Attack Detection _ SpringerLink
10 pages
TABLE OF CONTENT (1)(2)
No ratings yet
TABLE OF CONTENT (1)(2)
55 pages
ADML_IoT_1-0-1
No ratings yet
ADML_IoT_1-0-1
115 pages
Internship Report
No ratings yet
Internship Report
40 pages
DDoS Detection A Network Guradian A Threat Stopper
No ratings yet
DDoS Detection A Network Guradian A Threat Stopper
10 pages
231907774
No ratings yet
231907774
103 pages
ids deep network belief
No ratings yet
ids deep network belief
15 pages
Major Project Research
No ratings yet
Major Project Research
6 pages
AI Driven VNF Splitting in O RAN For Enhancin Esmaeil Amiri
No ratings yet
AI Driven VNF Splitting in O RAN For Enhancin Esmaeil Amiri
117 pages
CMPE 256- MIDTERM_REPORT
No ratings yet
CMPE 256- MIDTERM_REPORT
3 pages
Network Classification For Traf - Zahir Tari, Adil Fahad, Abdulmo
No ratings yet
Network Classification For Traf - Zahir Tari, Adil Fahad, Abdulmo
276 pages
Multi-Class Intrusion Detection Based On Transformer For IoT Networks Using CIC-IoT-2023 Dataset
No ratings yet
Multi-Class Intrusion Detection Based On Transformer For IoT Networks Using CIC-IoT-2023 Dataset
25 pages
NCFTEAS - 2024 Paper 16
No ratings yet
NCFTEAS - 2024 Paper 16
8 pages
DDoS Detection Using Hybrid Deep Neural Network Approaches
No ratings yet
DDoS Detection Using Hybrid Deep Neural Network Approaches
8 pages
Final PPD Monday
No ratings yet
Final PPD Monday
18 pages
Implementation_of_QOS_in_SDN_and_Distributed_Networks_for_mitigation_of_DDOS_based_attacks_using_Mach
No ratings yet
Implementation_of_QOS_in_SDN_and_Distributed_Networks_for_mitigation_of_DDOS_based_attacks_using_Mach
6 pages
Malicious Activity Detection: An Analysis of Current Tools and Methodologies For Network Defence in Operational Networks
No ratings yet
Malicious Activity Detection: An Analysis of Current Tools and Methodologies For Network Defence in Operational Networks
69 pages
Final
No ratings yet
Final
17 pages
Fake Job Entry Detectionnn
No ratings yet
Fake Job Entry Detectionnn
25 pages
Probabilistic Time Series Forecasting With Implicit Quantile Networks
No ratings yet
Probabilistic Time Series Forecasting With Implicit Quantile Networks
7 pages
Analysis and Prediction of Airbnb Listing Prices
No ratings yet
Analysis and Prediction of Airbnb Listing Prices
12 pages
Artificial Intelligence For Natural Product Drug Discovery
No ratings yet
Artificial Intelligence For Natural Product Drug Discovery
22 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
AIML Feb, March Scheme 2023
No ratings yet
AIML Feb, March Scheme 2023
25 pages
Human Motion Classification Based on Multi-Modal Sensor Data for Lower Limb Exoskeletons
No ratings yet
Human Motion Classification Based on Multi-Modal Sensor Data for Lower Limb Exoskeletons
6 pages
Seminar Report
No ratings yet
Seminar Report
27 pages
Methods For Modeling The Steering Wheel Torque of A Steer by Wire Vehicle
No ratings yet
Methods For Modeling The Steering Wheel Torque of A Steer by Wire Vehicle
12 pages
Zhang MotionTrack End-to-End Transformer-Based Multi-Object Tracking With LiDAR-Camera Fusion CVPRW 2023 Paper
No ratings yet
Zhang MotionTrack End-to-End Transformer-Based Multi-Object Tracking With LiDAR-Camera Fusion CVPRW 2023 Paper
10 pages
Breast CancerMET
No ratings yet
Breast CancerMET
33 pages
Face Mask Detection
No ratings yet
Face Mask Detection
32 pages
Radar and Camera Early Fusion For Vehicle Detection in Advanced Driver Assistance Systems
No ratings yet
Radar and Camera Early Fusion For Vehicle Detection in Advanced Driver Assistance Systems
11 pages
AIML MANUAL IT (1)
No ratings yet
AIML MANUAL IT (1)
58 pages
A Big Data-Driven Hybrid Model For Enhancing Streaming Service Customer Retention Through Churn Prediction Integrated With Explainable AI
No ratings yet
A Big Data-Driven Hybrid Model For Enhancing Streaming Service Customer Retention Through Churn Prediction Integrated With Explainable AI
21 pages
Apparent Temperature As A Modelling Substitute For
No ratings yet
Apparent Temperature As A Modelling Substitute For
11 pages
Segmentation-of-diabetic-retinopathy-images-using-dee_2023_Alexandria-Engine
No ratings yet
Segmentation-of-diabetic-retinopathy-images-using-dee_2023_Alexandria-Engine
19 pages
Final Report 2024 - Merged
No ratings yet
Final Report 2024 - Merged
32 pages
TMI 102 Poster
No ratings yet
TMI 102 Poster
1 page
Artificial Intelligence AI Techniques for Intellig
No ratings yet
Artificial Intelligence AI Techniques for Intellig
11 pages
43-68 The Exchange Rate Disconnect Puzzle
No ratings yet
43-68 The Exchange Rate Disconnect Puzzle
26 pages
NHITS
No ratings yet
NHITS
9 pages
Diabetics Predictions
No ratings yet
Diabetics Predictions
2 pages
Classification of Diabetes Using Deep Learning
No ratings yet
Classification of Diabetes Using Deep Learning
6 pages
Applying AI and Machine Learning For Enhanced Drinking Water Demand Forecasting in Urban Settings
No ratings yet
Applying AI and Machine Learning For Enhanced Drinking Water Demand Forecasting in Urban Settings
9 pages
Action Recognition and Fall Detection System Based On 3D Skeleton Model
No ratings yet
Action Recognition and Fall Detection System Based On 3D Skeleton Model
11 pages
SMS Project Report
No ratings yet
SMS Project Report
13 pages
Deep Learning Applications Volume 3 Advances in Intelligent Systems and Computing 1st Edition M Arif Wani Bhiksha Raj Feng Luo Dejing Dou Editors
No ratings yet
Deep Learning Applications Volume 3 Advances in Intelligent Systems and Computing 1st Edition M Arif Wani Bhiksha Raj Feng Luo Dejing Dou Editors
79 pages
Salesforce - AI Associate
No ratings yet
Salesforce - AI Associate
5 pages
Flower Recognition Using Machine Learning
No ratings yet
Flower Recognition Using Machine Learning
3 pages