0% found this document useful (0 votes)
51 views15 pages

Data Fingerprinting and Visualization For AI

This article discusses the development of AI-enhanced cyber-defense systems using a methodology that incorporates data fingerprinting and visualization to improve detection efficiency against AI-assisted cyber-attacks. The proposed AIECDS methodology aims to address the challenges posed by sophisticated cyber threats by utilizing real-world data and enhancing machine learning models. The study highlights the importance of visualizing data to simplify complex datasets and improve the overall effectiveness of cyber-defense strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views15 pages

Data Fingerprinting and Visualization For AI

This article discusses the development of AI-enhanced cyber-defense systems using a methodology that incorporates data fingerprinting and visualization to improve detection efficiency against AI-assisted cyber-attacks. The proposed AIECDS methodology aims to address the challenges posed by sophisticated cyber threats by utilizing real-world data and enhancing machine learning models. The study highlights the importance of visualizing data to simplify complex datasets and improve the overall effectiveness of cyber-defense strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2024.Doi Number

Data fingerprinting and visualization for AI-


Enhanced Cyber-Defence Systems
Christiaan Klopper1, and J.H.P. Eloff2
Department of Computer Science, University of Pretoria, South Africa, 0002
[email protected], [email protected] (2corresponding author)

ABSTRACT Artificial intelligence (AI)-assisted cyber-attacks have evolved to become increasingly


successful in every aspect of the cyber-defence life cycle. For example, in the reconnaissance phase, AI-
enhanced tools such as MalGAN can be deployed. The attacks launched by these types of tools automatically
exploit vulnerabilities in cyber-defence systems. However, existing countermeasures cannot detect the attacks
launched by most AI-enhanced tools. The solution presented in this paper is the first step towards using data
fingerprinting and visualization to protect against AI-enhanced attacks. The AIECDS methodology for the
development of AI-Enhanced Cyber-defense Systems was presented and discussed. This methodology
includes tasks for data fingerprinting and visualization. The use of fingerprinted data and data visualization
in cyber-defense systems has the potential to significantly reduce the complexity of the decision boundary
and simplify the machine learning models required to improve detection efficiency, even for malicious threats
with minuscule sample datasets. This work was validated by showing how the resulting fingerprints enable
the visual discrimination of benign and malicious events as part of a use case for the discovery of cyber threats
using fingerprint network sessions.

KEYWORDS cyber-defense, cyber security, data fingerprint, data visualization, intelligent system

I. INTRODUCTION majority of these countermeasures are based on machine


One of the biggest challenges of the 21st century is learning models, particularly anomaly detection
defending cyber assets from cyber-attacks [1, 2, 3 and 4]. techniques. Threat actors target instabilities in machine
Worldwide events such as COVID-19 and the Russian learning (ML) and AI by poisoning input data, using
invasion of Ukraine gave threat actors the opportunity to adversarial attacks to confuse models during inference
significantly increase cyber-attacks [1, 3 and 4]. The main [1,6], and using generative adversarial networks (GAN) to
defensive challenges are advanced and complex attacks, enhance cyber-attacks [7]. Cybercriminals have advanced
unprotected data, poor cybersecurity practices, and their ability to execute highly sophisticated AI-enhanced
defenses based on vulnerability management, exposing the attacks, which are increasingly difficult to discover, detect,
cyber-defense perimeter [2,3]. Simultaneously, threat and protect [7,8].
actors are continually finding innovative means to deploy Among the many reasons for the poor performance of
Artificial Intelligence (AI)-enhanced cyber-attacks [3] and UNSW projects, the data problem is noteworthy. This
increase their capabilities to target critical vulnerabilities could be attributed to the fact that the data used were
more quickly [1]. This has led to more frequent cyber- created in a laboratory, not real-world data, and not real-
attacks with greater complexity and smaller time windows time data. These data properties are of prime importance
for performing critical vulnerability patching. when building cyber-defence systems. AI-enhanced ML
Consequently, in 2021, zero-day attacks nearly doubled [2]. cyber-defense solutions should be trained using real-world
A study performed by the University of New South attack data [9]. However, there are multiple shortcomings
Wales (UNSW) and Australian Cybersecurity Center in obtaining real-world data for ML-based cybersecurity
(ACSC) unintentionally demonstrated the ineffectiveness solutions: (i) the lack of availability of large real-world
of state-of-the-art cyber-defense systems in 2015 [5]. The attack datasets, (ii) the sensitive nature of the data, and (iii)

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

security, confidentiality, and privacy concerns [10,11]. development of AI-enhanced cyber-defense systems. For the
Research conducted by [12, 6, 18] has determined that most research conducted for this paper the reconnaissance phase
ML cybersecurity research has not been tested or trained in is of particular interest for the encoding and visualization of
real-time environments. This is critical for determining the data for the development of AI-enhanced cyber-defense
detection efficiency in practical scenarios in which cyber- countermeasures.
defence systems are intended to be deployed. In [13], the Researchers have identified cyber threats that demonstrate
authors (s) concluded that high-quality real-world and real- the use of AI-enhanced attack tools within different phases
time data are required to counter cyber threats. of the cyber-defense lifecycle [15, 7]. Consider, for example,
Training machine learning models on visualized data the reconnaissance phase in which an AI-enhanced tool such
has proven to be more successful than training on raw data as MalGAN can be deployed. MalGAN generates concealed
[14]. This is because researchers have identified that adversarial malware that can successfully bypass “black-
visualizations can represent complex, large, multimodal box” malware detectors [16]. Another tool, DeepLocker
datasets as simple datasets [14], which simplifies the [17], conceals its malware payload to activate it only when
learning task for AI models. This opens up an opportunity triggered. This is achieved by training adversarial samples
for developers of cyber-defense systems to develop AI- that mutate the payload to obfuscate normal appearance.
enhanced tools that can be trained on visualized data. DeepLocker represents advances in AI-enhanced tools that
Furthermore, visualized representations of data create an can be deployed in the Command & Control phases of the
opportunity to extract more meaningful real-world data cyber-defence lifecycle.
from threat-related environments such as computer
networks. B. STATE-OF-THE-ART CYBER-DEFENSE TOOLS
This study represents the first step in addressing certain The ML-based cybersecurity countermeasures detailed in
aspects of the data problem by proposing a methodology the literature show that there is still a significant gap in
for the development of AI-enhanced cyber-defense tools achieving a cyberdefense system to overcome current
that include tasks for data fingerprinting and data cybersecurity threats. Researchers have concluded that
visualization. The remainder of this paper is organized as most datasets used in ML-based detection systems research
follows. First, related work on threats to the cyber-defense are outdated and do not typically reflect real-world traffic
lifecycle and the efficiency of state-of-the-art cyber- or the latest cyber-attacks accurately [18, 19, 6, and 12].
defense machine learning models are discussed. This was The findings indicate that legacy datasets used in Intrusion
followed by the proposal of a methodology for the Detection Software (IDS) ML research represent 88% of
development of AI-enhanced cyber-defence solutions. the dataset distribution [6]. In addition, this study indicated
This methodology includes data fingerprinting tasks that that in ML research on Malware Detection Software
are discussed in more detail. The application of this (MDS), customized datasets represent 33% of the dataset
methodology is demonstrated using a use case that focuses distribution, and 20% of the datasets were created before
on the discovery of cyber threats through fingerprint 2012. Findings in [18] determined that legacy datasets had
network sessions. the highest majority, representing 56% of the dataset
distribution in ML research. According to [18], although
II. RELATED WORK experimental results on legacy datasets are excellent, they
Current cyber-defense research outputs that are important for decrease significantly when tested on more recent datasets,
the research at hand include the following. including real-world datasets.
A The cyber-defense lifecycle. The lack of real-world datasets is compounded by the
B State-of-the-art ML based cyber-defense tools. inability to extract meaningful information from real-world
systems such as computer network environments. Other
A. CYBER-DEFENSE LIFE CYCLE studies [13, 20, 6, and 12] have indicated that most
The cyber-defense lifecycle stipulates that the phases of a experiments for prototyping network-based cyber-defense
cyber-attacker must be completed to infiltrate the systems used simplified calculated features based on data
organization. The cyber-defense lifecycle phases [7] are as telemetry and averaging statistics. According to [6],
follows: Reconnaissance, which collects information and simplified calculated features result in increased inference
intelligence for the planned cyber-attack; Weaponization, sensitivity and time delay for classifying cyber-attacks.
which focuses on the effectiveness of the cyberattack; ML-based IDS countermeasures have evolved from
Delivery, which bypasses existing safeguards; Exploitation, techniques heavily dependent on feature engineering to
which infiltrates; Installation, which opens the network for Deep Learning, which is less dependent on feature
malicious attacks; Command & control- remote control of engineering. This results in more complex models with
the network; Actions that execute the intended malicious incremental performance improvement. However, the
activity. All phases in the cyber-defense life cycle should be detection of threats with minuscule malicious samples has
considered when developing methodologies for the not yet improved. Although several attempts have used
2

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

dataset rebalancing, no advancements have been made in an increase in sophisticated cyber threats, enabling threat
the techniques that perform better in detecting threats with actors to repeat cyberattacks at a greater scale and speed.
minute malicious samples. This, combined with real-time Based on the above discussion, the following
timing differences, is most likely the reason why IDS requirements for methodologies that provide guidelines for
systems perform worse in real-time environments than in the development of AI-enhanced cyber-defence systems
laboratories. In addition, most IDS research has been are considered important.
conducted on outdated datasets with no current threats. • Improve detection rates and reduce detection time.
Similar to ML-based IDS countermeasures, [21] conducted • Employ dynamic self-learning and RL
a comprehensive review of ML-based MDS approaches, approaches.
considering signature, behavior, heuristic, model-checking, • Detect adversarial and unknown cyber-attacks.
DL, cloud, mobile, and Internet-of-Things-based detection. • Detect threats and attacks with minute sample data
According to this study, although there have been sets.
advancements in every approach, no ML-based MDS • Training AI-enhanced countermeasures in real-
detection approach has successfully detected all malware world environments using real-time data.
types. • The focus is on extracting and encoding
Malialis [22] was one of the only researchers to develop meaningful data from real-world systems, which
a Reinforced Learning model for network intrusion is also referred to as fingerprinting.
detection and response in real-time. The input source was • Visualize the data to overcome complexity in
real-time network packets [22]. The authors (s) proposed multimodal threat related data.
the use of a distributed RL defence system to throttle DDoS
attacks. This was modelled using a mesh network with a III. THE AIECDS- METHODOLOGY FOR THE
distributed group of routers configured by the author(s) as DEVELOPMENT OF AI-ENHANCED CYBER-
RL agents. The agents were then trained to limit the amount DEFENSE SYSTEMS
of DDoS traffic that could be passed through the network, Figure 1 depicts a so-called AIECDS (AI-Enhanced Cyber-
based on the flow of packets through each router. defense System)-methodology developed by the same
In a survey by [23] on adversarial ML in cyber warfare, authors of the research at hand and adopted from previous
the author(s) concluded that there are serious concerns research [24]. The AIECDS methodology provides
regarding vulnerabilities in ML-based cyber-defense guidelines for the development of AI-enhanced cyber-
systems. According to the author(s), faulty assumptions defence systems. However, this study presents a high-level
during ML model training are the main cause of overview of the AIECDS methodology and discusses the
vulnerabilities. The author(s) further noted that AI, which fingerprints and visualization of the data in more detail.
confuses models during inference, is a direct result of Furthermore, the application of AIECDS methodology is
assuming that data in datasets are linearly separable and illustrated through a use case study for the discovery of
solvable using linear functions. This was indirectly verified cyber threats in fingerprinted network sessions.
in practice by [3], who reported in 2022 that there will be

4. Fingerprint sessions 5. Threat detection

1. Process extracted network Fingerprint management system


session data to improve
computation 1. Buffer: Insert new and updated
2. Extract features and buffer data 3. Data preparation
fingerprints into buffer
Encode and visualise
1. Network session: Identify 2. Presentation: Present new &
1. Read packet
unique network sessions 2. Header : Encode header with updated fingerprints to DRL
Hilbert curve using relevant model
Until buffer is fingerprinted

features
2. Threat labelling: Determine
2. Extract features threat label for each unique
from the packet session (during training) 3. Protocol Discourse: Encode
Captured Real-time packets / Threat detection DRL model
using tornado graph using
Real-time environment relevant features 1. Predict: Predict optimal
Until buffer is full

3. Additional meta data: action to maximise rewards


Additional meta data can be
3. Append buffer
identified to further enrich 4. Transmitted data: Encode
with packet features 2. Obtain state: Updated
features transmitted with Hilbert curve action, reward, state and next
using relevant features state (during training)

3. Update Weights: Update RL


model weights on experience
5. Update dataframe: Store replay (during training)
fingerprint in dataframe

FIGURE 1: AIECDS-methodology adapted for the use case that fingerprints network sessions.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

port, protocol, ARP p-source, ARP p-destination, and


A. PHASES OF THE AIECDS-METHODOLOGY transmitted data. The extraction process is completed by
extracting data in batches from the PCAP files during
As shown in Figure 1 the AIECDS methodology consists of training, or by buffering real-time packets as they are
the following phases: received.
1) DATASET (REAL-TIME ENVIRONMENT) 3) DATA PREPARATION
2) EXTRACT FEATURES AND BUFFER DATA Data preparation involves the preparation of the extracted
features for fingerprinting. It uses data frames from the
3) DATA PREPARATION
extracted features and the buffer data phase. The fingerprint
4) FINGERPRINT SESSIONS data frame represents extracted real-world data or events.
5) THREAT DETECTION Threat labelling was performed to link the available
A high-level overview of each phase is provided below, with metadata from the features to the extracted dataset (during
detailed attention paid to the fingerprinting phase. training). Additional information may enrich the
1) DATASET (REAL-TIME ENVIRONMENT) understanding of these features.
According to the criteria for AIECDS, the guideline is to Use case: The data preparation phase uses the data frame
train AI-enhanced countermeasures in real-world output extracted from the PCAP dataset, which is
environments using real-time data. subsequently prepared for fingerprinting. The fingerprint
Use case: Network packets are captured via Packet data frame represents the extracted network sessions. Threat
Capture (PCAP) technology, and data are extracted in near labelling was performed to link the available metadata from
real time, limited to information that is available to a firewall. the features to the extracted PCAP dataset. Additional
A continuous stream of packets is processed by a separate information includes applications and services that enrich the
system with minute processing delays, and is trained to selected features. These features are then added to the
detect threats prior to completing data transfer. This enables corresponding fingerprints.
the proposed solution to detect threats within a live network 4) FINGERPRINT SESSIONS
while maintaining a low computational complexity, thereby The following criteria (see Section 2) are specifically
reducing delays in threat detection. The last mentioned is one addressed in the “fingerprint session phase”:
of the criteria for the AIECDS methodology. This is achieved • Extract and encode (fingerprinting) meaningful
using a PCAP dataset, which contains data extracted from a data from real world systems.
real-time network environment, and starts to detect threats as • Visualize the data to overcome complexity in
packets are received. An example of a PCAP dataset is multimodal threat related data.
DARPA’s UNSW-15 dataset [5], which has been used in The following are examples of space-filling [25]
many cyber security machine-learning research projects. encoding and visualization techniques [26]: natural ordering,
These datasets contain attacks, including DoS, worms, line-by-line, column-by-column, Hilbert, and Morton. For
backdoors, fuzzers, and zero-day attacks, among other the AIECDS methodology, Hilbert curves [27] and tornado
threats. For example, the UNSW-15 [5] dataset contains 100 graphs [28] were chosen to encode and visualize the real-
GB of network packets (in PCAP format) with 82 million world data. Briefly, a Hilbert curve is a continuous fractal
network packets in the training dataset alone. In addition, this space-filling curve [27], whereas a tornado graph is a special
dataset contains threat labels for threat categories rather than type of bar chart [28]. The decision to use Hilbert curves is
actual attacks, which is more meaningful for training high- based on the fact that they maintain the relative positions of
performance threat-detection algorithms. data elements within the overall data structure. For example,
2) EXTRACT FEATURES AND BUFFER DATA the positions of network packets within network sessions are
According to AIECDS criteria, packets must be processed as significant. A detailed explanation of the use of Hilbert
they are received. This is achieved by extracting the key curves can be found in [24].
features for each packet received, and storing the results for The detailed design of fingerprint representations
each packet in a buffer. The buffer is periodically or fully depends on the specifics of the use case. As previously
inputted into the data preparation phase, after which it is mentioned, the use case employed to demonstrate the
cleared. construction of a fingerprint for the purpose of this research
Use case: Network session datasets, such as the UNSW- was the discovery of cyber threats using fingerprinted
15 PCAP dataset, contain a wide assortment of packets, network sessions. The fingerprint design for the use case has
including IP and address resolution protocol (ARP), network three distinct sections with different encoding approaches:
protocols, and a wide array of transport protocols. In use the header, protocol discourse, and transmitted data.
cases where the focus is on network sessions, the following Header
are examples of meaningful features: IP source, IP The header of the fingerprint must be unique to each event
destination, IP length, TCP flags, source port, destination or session. The reason that the header section is encoded in a

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

specific manner is to enlarge its prominence within the final Use case: The source and destination IP addresses, ports,
fingerprint, because the significance of behaviors for certain and protocols are sufficient for representing a unique
unique events or sessions may otherwise be missed. network session. This is illustrated in Figure 2.

FIGURE 2: Fingerprint header design

Both TCP and UDP port numbers range from zero to space could be to record the frame sizes for IP, TCP, and
65535, which can be encoded using four colors (from light UDP, or metrics for other transport or application protocols.
gray to black) and two eight × eight Hilbert curves. This was
achieved by counting 255 in the first Hilbert curve for every Protocol discourse
one in the second Hilbert curve. Protocols range from zero to The protocol discourse section of a fingerprint must
255 and are encoded using a similar approach to IP sections. represent the communication sequence among multiple
The last eight × eight Hilbert curve is reserved for future use, hosts. This is achieved by using certain attributes and
and is required to complete the 128 columns required for the features originating from a sequence of communications.
128 × 128 Hilbert curve used for the transmitted data section. The use case is illustrated in Figure 3.
One possible future application for the reserved eight × eight

FIGURE 3: Protocol discourse design

Use case: In Figure 3, the protocol exchange is visualized were then sent and acknowledged (8, 10, and 12). The session
for a TCP session between 59.166.0.7 on port 53421 and is finalized at the end with a finish and acknowledges (14, 15)
149.171.126.4 on port 80. The exchange is initiated with a before closing the exchange (16). The fingerprint has a
request to synchronize (1), which is acknowledged (2), after sufficient capacity to capture 128 interactions between two
which the initial setup is acknowledged, pushed, and hosts, which can contain multiple flows within the same
acknowledged (3, 4, and 5). Large packets (6, 7, 9, 11, and 13) unique session.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

Transmitted data Use case: The data section of the fingerprint must encode the
The fingerprint data section must encode the relevant data packet data for all packets within a unique session, or until the
within a unique session or event until the Hilbert curve is 128 × 128 Hilbert curve is completed. The complete Hilbert
completed. curve is shown in Figure 4.

FIGURE 4: Transmitted data design

Data are transmitted in bytes, which are composed of eight another for training the threat detection DRL (Detection
bits. As a result, each byte can be converted into a decimal Reinforced Learning) model. The purpose of the fingerprint
range from zero to 255, which can be encoded into grayscale management system is to buffer and maintain all
colors. Therefore, each element of the 128 × 128 Hilbert fingerprints. This is achieved by recording a state for each
curve can depict a byte using 256 grayscale colors. A 128 × fingerprint, which should include the available fingerprint
128 Hilbert curve was selected to develop dense transmitted space, when it was last presented to the threat detection DRL
data visualization to limit future changes in the fingerprint model, and when the fingerprint was last updated. The
shape. fingerprint-management system shown in Figure 1 is
5) THREAT DETECTION illustrated in Figure 5. The fingerprint management system
The criteria for AIECDS methodology include the use of involves inserting newly created fingerprints into the buffer
dynamic self-learning and RL. Therefore, the threat and scheduling them to be presented to the threat detection
detection phase of the AIECDS methodology was designed DRL model, as well as routinely scheduling existing
as shown in Figure 1. The threat detection phase consists of fingerprints to be presented to the threat detection DRL
two main tasks, one for managing the fingerprint system and model once updated.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

FIGURE 5: Illustration of the fingerprint management system

are included in the network session for malicious code. A


Finally, once a fingerprint has been classified by threat malicious code is revealed by fingerprinting the entire
detection DRL, the result is stored in the fingerprint, network session at the lowest possible information level,
appropriate actions are taken to mitigate any risk, and the which is at the byte level.
fingerprint is removed from the buffer.
The purpose of the threat-detection DRL model is to IV. EXPERIMENTAL RESULTS
correctly detect cyber threats and threat types with as little A prototype environment for the use case: “discovery of cyber
information available as possible. This refers to one of the threats in fingerprinted network sessions” was set-up and
criteria in the AIECDS methodology, which states that it applied to the UNSW-15 dataset. In total, 10240 network
should be possible to detect threats and attacks within minute sessions were fingerprinted, containing both benign and
sample datasets. This can be achieved by presenting malicious fingerprints.
fingerprints to the threat detection DRL model in
incremental steps as the fingerprints are updated over time. A. MALICIOUS FINGERPRINT ANALYSIS RESULTS
Higher rewards should be allocated to the threat-detection Malicious fingerprints were clustered to obtain the key
DRL model with early detection, and the largest negative fingerprints representing each malware threat category in the
rewards should be allocated to incorrect threats or threat UNSW-15 dataset. Additionally, each of the closest benign
models with early detection. This is illustrated in Figure 5. fingerprints was selected by minimizing the element-wise
The AIECDS criteria “Detect against adversarial and distance between the body of the fingerprint (protocol
unknown cyber-attacks” is achieved by learning the patterns discourse and transmitted data) and malicious fingerprints. As
of adversarial attacks, however indistinct it may be. shown in Figure 6, eight of the nine malicious cyber-attack
categories simulated within the UNSW-15 dataset were
Use case: Over time, a sufficient number of network identified. The differences between malicious fingerprints and
session fingerprints visually profile the boundaries between their closest benign fingerprints are discussed separately for
the benign and malicious network sessions. These the transmitted data and protocol discourse.
fingerprints can then be used by DRL to learn the features
that make each malicious attack type unique and to detect 1) MALICIOUS FINGERPRINT CLUSTERS
unknown or new cyber-attacks. The RL algorithm operates To structure the selection of malicious fingerprint samples for
and detects threats in real time when network packets are analysis, malware threat categories with more than four
received. The proposed solution will have a visualized view fingerprinted network sessions are clustered using k-means
of the transmitted data irrespective of the nature of the clustering. The optimal elbow was identified for each
packets sent, thereby being more resilient against temporal fingerprint with the smallest Euclidean distance from each
fluctuations. In addition, the fingerprint does not compare cluster center. All fingerprints were selected for malware
text, video, or music to existing formats but rather looks at threat categories with four or fewer network-session
what is different at the byte level. At the byte level, malicious fingerprints. The total number of malicious network sessions
intent is likely to be more visible because, in malware, bytes
7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

that were fingerprinted and the key cluster fingerprint totals


are shown in Fig. 6.

FIGURE 6: Fingerprint clustering per threat category

2) FROBENIUS DISTANCE between fingerprints with similar pattern motifs. The


The Frobenius distance measure [29] was determined for each importance of this aspect of the Frobenius distance is
pair of fingerprints analyzed because it measures the illustrated in Figure 7, where both images resemble the same
movement between elements, thereby capturing the difference pattern.

FIGURE 7: Frobenius distance illustration

From the example illustrated in Figure 7, the square difference detection model considers fingerprints based on their element
was calculated from the two 64-element matrices. The square differences.
root of the sum of differences was calculated as the Frobenius To determine the significance of the Frobenius distance for
distance. This is 6.2 times the average element value in the each fingerprint comparison, the gauge shown in Figure 8,
example or, more simply, six additional elements in matrix 1 which is based on the Frobenius distance relative to the
compared to matrix 2. This aspect is required for the average element value for the fingerprints, was used. The
comparison between fingerprints, because the DRL threat significance gauge is relevant to both the transmitted data and
the protocol discourse sections.
8

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

FIGURE 8: Frobenius distance significance

focus here is to present the unique patterns discovered within


3) RESULTS OF TRANSMITTED DATA ANALYSIS the transmitted data, results of the similarity analysis between
the malicious and closest benign fingerprints, and broader
The transmitted data section of the fingerprint consists of a findings uncovered during the analysis. An evaluation guide
grid of 128 × 128 elements that can range from zero to 255, (see Figure 13) was used to identify unique transmitted data
resulting in 4194304 factorial permutations that have infinite patterns from the analysis results. Visual inspection of the
possibilities. Although infinite possibilities exist, it is clear fingerprints revealed 13 unique repeated patterns. The 13
from the analysis that there are a finite number of patterns. The different pattern types are shown in Figure 9.

FIGURE 9: 13 unique patterns within transmitted data


(#), threat category, protocol, port, and pattern guide results
The results of the transmitted data pattern analysis are for malicious fingerprints, and the row number (#), protocol,
presented in Table 1. The malicious and closest benign port, and pattern guide results for the closest benign
fingerprints were located in the same row to easily compare fingerprints. Additionally, the Frobenius distance measure,
their pattern similarities. The columns include the row number mean, standard deviation, and significance were included.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

TABLE I: Transmitted data analysis results

The investigation of each threat type in Table 1 revealed and standard deviation of 28 points. The significance of these
interesting observations. For example, consider the shellcode differences ranged from low to low.
and reconnaissance threat types. With reference to the Overall, the most dominant pattern for both malicious and
shellcode threat type, both fingerprints 5.1 and 5.2 have the benign fingerprints was Pattern 9 (used 20 times for malicious
same transmitted data pattern. However, even though the fingerprints and 13 times for benign fingerprints). The second
transmitted data patterns were similar, the distance between most dominant patterns for malicious fingerprints were
the two was 1128 points. By contrast, fingerprints 6.1 and 6.2 Patterns 7 and 10 for benign fingerprints. Patterns 2 and 3 are
do not have overlapping patterns with a smaller distance of frequently used by benign fingerprints but only once by a
907 points owing to the small size of the transmitted data malicious fingerprint, whereas Pattern 4 is frequently used by
shape. The significance of these differences ranges from malicious fingerprints but only once by a benign fingerprint.
moderate to significant. The malicious reconnaissance and the Finally, Pattern 11 is used by only one malicious fingerprint.
closest benign fingerprints match completely in the The overall pattern analysis is shown in Figure 10.
transmitted data patterns. In addition, their fingerprints (30 –
34) had the smallest distances, with an average of 139 points

10

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

FIGURE 10: Overall pattern analysis

It is clear from the transmitted data similarity analysis that 4) RESULTS OF PROTOCOL DISCOURSE
the proposed solution provides a framework for identifying ANALYSIS
meaningful differences between malware and benign network
sessions, and between malware threat categories. Not a single The protocol discourse section of the fingerprint consists of
malware-transmitted data section was exactly the same as its 128 values that range from -1500 to 1500, which has 384000
closest benign-transmitted data section at zero distance. factorial permutations, resulting in infinite possibilities.
All malware threat types, including malware categories that From this analysis, it is clear that there are set packet ranges
were undetectable in the UNSW-15 simulation (backdoor, and phases that together form patterns. The focus of this
shellcode, and worm), showed differences in patterns that subsection is to present the unique patterns, analysis of the
could make these malware threats detectable using less protocol discourse for a few selected malicious and closest
benign fingerprints, and the broader findings uncovered
complex algorithms. Even reconnaissance malware with the
during the analysis.
smallest differences, which in the UNSW-15 simulation had
Two different pattern guides were used in the Protocol
the smallest detection ratio of 0.2%, had a consistent
Discourse Results section. The first is a packet pattern guide
difference that could aid in the discovery of these threats. In that focuses on packet sizes and phases of engagement,
addition, seven malicious fingerprints (7, 8, 9, 10, 11, 21, and whereas the second is a setup-phase packet length and
23) shared their closest benign fingerprints (two unique sequence analysis for specific ports with repeating setup-
fingerprints) with other fingerprints, further indicating the phase patterns.
advancement of the proposed solution and its promising The following evaluation guide was used to identify
effectiveness in increasing the decision boundary between unique patterns within the protocol discourse comprising the
malware and benign classification. Therefore, visual packet sizes of the phases. The guide for the different types
fingerprints can be developed for the transmitted data to of patterns is shown in Figure 11.
differentiate between malicious and benign fingerprints. .

11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

Length =< 1500

Large Packets
Large Packets = 6
Guide for
packet size Length > 750
Length =< 750

Medium Packets Medium Packets = 2


Length > 250
Length <= 250
Small Packets Small Packets = 24

Guide for
packet phases
Setup phase Transfer phase Teardown phase
Each network session starts The transfer phase starts The teardown phase starts after there
with a setup phase where the when large packets are are no longer any large packets or
session is initiated transmitted acknowledgements for large packets
FIGURE 11: Protocol discourse packet guide

Three different phases were identified that corresponded to transmitted in the transfer phase, and the teardown phase
the typical flow of information and sequence of events in a contained ten small packets and two medium packets.
network session: setup, transfer, and teardown. In the To illustrate and reveal patterns within the repeating set-
example shown in Figure 11, there are ten small packets in up phase sequences, the following guide (Figure 12) can be
the setup phase. Six large and four small packets were used to interpret the annotations, indicating how different
sequences were combined.

The whole sequence Par0al lengths Par0al sequence


Large packets Only packets with Large packets Packets only included in
this addi-onal length these sequences
Par-al length
Whole length

x.x
Medium packets Medium packets

x.x
Small packets Whole length Small packets

x.x

FIGURE 12: Protocol discourse annotation guide

In this example, three different protocol discourse setup located in the same row to easily compare their pattern
phase sequences are overlaid onto one illustration on the left similarities. The columns include row numbers (#), threat
side of Figure 12. Using the annotation guide for the whole categories, protocols, ports, and pattern guide results for
sequence, partial lengths, and partial sequences, three malicious fingerprints and row number (#), protocol, port, and
different sequences were identified, as depicted on the right- pattern guide results for the closest benign fingerprints. In
hand side of Figure 12. addition, the sum of differences, Frobenius distance measure,
mean, standard deviation, and significance are included. The
5) PROTOCOL DISCOURSE PATTERN ANALYSIS sum of the differences was included to highlight the overall
differences, based on the protocol discourse guide.
Table 2 presents the results of protocol discourse pattern .
analysis. The malicious and closest benign fingerprints are

12

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

TABLE II: Protocol discourse analysis results

The following are the conclusions from the results of the distances and zero packet differences, which were
analysis in Table 2. reconnaissance fingerprints (30, 31, 32, and 34), because the
packet sequence and flags matched exactly.
• In the section on reconnaissance, it is clear that there is no In summary, except for four reconnaissance fingerprints
difference between reconnaissance malware fingerprints (30, 31, 32, and 34), protocol discourse data and visual
and their closest benign fingerprints because all packets fingerprints can aid in differentiating between malicious and
have the same sequence and packet size, except for benign fingerprints.
fingerprint 33. In Fingerprint 33, the malicious fingerprint
has the same number of transmitted packets, but the V. CONCLUSION AND FUTURE WORK
sequence is different, leading to a distance of 226 points.
This roughly aligns with the detection efficiency of 0.2% The AIECDS methodology discussed in this paper includes
for the UNSW-15. In addition, all reconnaissance network guidelines for the development of AI-enhanced cyber-defence
sessions used port 111, which was used for remote systems. The focus was on extracting meaningful data and
procedure calls. producing visualized fingerprints. This was achieved through
• In the backdoor and shellcode sections, only small packets the design of a fingerprint that enabled the discovery of hidden
were exchanged and remained in the setup phase. There patterns. Visually comparing malicious fingerprints with the
were differences in the number of packets exchanged in closest benign fingerprints demonstrates a significant
fingerprints 1, 3, 4, and 6, and all distances were nonzero. improvement in detecting malicious threats. Furthermore, the
The average distances were 232 and 125 points with use of fingerprinted data and data visualization in cyber-
standard deviations of 56 and 167 points, respectively. The defense systems can significantly reduce the complexity of the
significance of the differences ranged from minor to decision boundary and simplify the machine-learning models
moderate, except for fingerprint 5, which had minuscule required to improve detection efficiency, even for malicious
significance. threats with minuscule sample datasets.
From the results in Table 2, the protocol discourse analysis Therefore, the contribution of this study is the
identified 14 fingerprints with no packet difference. These had improvement in the development of AI-enhanced cyber-
an average distance of 288 points compared with fingerprints defence systems. Furthermore, the application of AIECDS
with one or more differences (20 of 34), with an average methodology is illustrated through a use case study for the
distance of 2473 points. Only four fingerprints had zero
13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

discovery of cyber threats using fingerprinted data and https://i.blackhat.com/us-18/Thu-August-9/us-18-Kirat-DeepLocker-


visualized network sessions. Concealing-Targeted-Attacks-with-AI-Locksmithing.pdf, Accessed on
27 July 2024.
REFERENCES [18]. Ahmad, Z., Shahid Khan, A., Wai Shiang, C., Abdullah, J. and Ahmad,
[1]. European Union Agency for Cybersecurity (ENISA), 2022. ENISA F., 2021. Network intrusion detection system: A systematic study of
threat landscape report: July 2021 to July 2022 machine learning and deep learning approaches. Transactions on
[2]. Gidi, C., & Skybox Security, 2022. Vulnerability and Threat Trends Emerging Telecommunications Technologies, 32(1), p.e4150.
Report 2022. [19]. Arshad, K., Ali, R.F., Muneer, A., Aziz, I.A., Naseer, S., Khan, N.S.
[3]. Police, A. F., & Australian Criminal Intelligence Commission (ACSC), and Taib, S.M., 2022. Deep Reinforcement Learning for Anomaly
2022. ACSC Annual cyber threat report: July 2021 to June 2022. Detection: A Systematic Review. IEEE Access.
[4]. Sobers, R., 2022. 89 Must-know data breach statistics 2022. Varonis, [20]. Hsu, Y.F. and Matsuoka, M., 2020, November. A deep reinforcement
May 2022. https://www.varonis.com/ blog/cybersecurity-statistics, learning approach for anomaly network intrusion detection system. In
Accessed 29 June 2022. 2020 IEEE 9th international conference on cloud networking
[5]. Moustafa, N. and Slay, J., 2015, November. UNSW-NB15: a (CloudNet) (pp. 1-6). IEEE.
comprehensive data set for network intrusion detection systems [21]. Aslan, Ö.A. and Samet, R., 2020. A comprehensive review on malware
(UNSW-NB15 network data set). In 2015 military communications and detection approaches. IEEE access, 8, pp.6249-6271.
information systems conference (MilCIS)(pp. 1-6). IEEE. [22]. Malialis, K., 2014. Distributed reinforcement learning for network
[6]. Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A. and Xu, M., 2020. intrusion response (Doctoral dissertation, University of York).
A survey on machine learning techniques for cyber security in the last [23]. Duddu, V., 2018. A survey of adversarial machine learning in cyber
decade. IEEE access, 8, pp.222310-222354. warfare. Defense Science Journal, 68(4), p.356.
[7]. Kaloudi, N. and Li, J., 2021. The AI-based cyber threat landscape: A [24]. Klopper, C. and Eloff, J., 2023, February. Fingerprinting Network
survey. ACM Computing Surveys (CSUR), 53(1), pp.1-34. Sessions for the Discovery of Cyber Threats. In International
[8]. Nguyen, T.T. and Reddi, V.J., 2021. Deep Reinforcement Learning for Conference on Cyber Warfare and Security (18(1), pp. 171-180).
Cyber Security. IEEE Transactions on Neural Networks and Learning [25]. Heaney, C., Li, Y., Matar, O., Pain, 2024. Applying Convolutional
Systems., vol. 34, no. 8, pp. 3779–3795, Aug. 2021. Neural Networks to data on unstructured meshes with space-filling
[9]. Kilincer, I.F., Ertam, F. and Sengur, A., 2021. Machine learning curves. Neural Networks, 175: 106198 (2024)
methods for cyber security intrusion detection: Datasets and [26]. Keim, D.A., 1996. Pixel-oriented database visualizations. ACM
comparative study. Computer Networks, 188, p.107840. Sigmod Record, 25(4), pp.35-39.
[10]. Bowles, J.K.F., Silvina, A., Bin, E., Vinov, M., 2020. On Defining [27]. Yi , L., Xiaochong, T., Dali, W., Chunping, Q., He, L. and Youwei , Z.,
Rules for Cancer Data Fabrication. Lecture Notes in Computer Science, 2023. W-Hilbert: A W-shaped Hilbert curve and coding method for
vol 12173, pp.168-176. multiscale geospatial data index. International Journal of Applied Earth
[11]. Lu, Y., Shen, M., Wang, H., Wang, X., van Rechem, C. and Wei, W., Observation and Geoinformation, 118, p.103298
2023. Machine learning for synthetic data generation: a review. arXiv [28]. Eschenbach, T.G., 2006. Constructing tornado diagrams with
preprint arXiv:2302.04062. spreadsheets. The Engineering Economist, 51(2), pp.195-204.
[12]. Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Chen, S., Liu, D. [29]. Weisstein, E. 2024. Frobenius norm. Wolfram research.
and Li, J., 2020. Performance comparison and current challenges of https://mathworld.wolfram.com/FrobeniusNorm.html, Accessed 25
using machine learning techniques in cybersecurity. Energies, 13(10), July 2024.
p.2509.
[13]. Alshaibi, A., Al-Ani, M., Al-Azzawi, A., Konev, A. and Shelupanov,
A., 2022. The comparison of cybersecurity datasets. Data, 7(2), p.22.
[14]. Wu, A., Wang, Y., Shu, X., Moritz, D., Cui, W., Zhang, H., Zhang, D.
and Qu, H., 2022. Ai4vis: Survey on artificial intelligence approaches
for data visualization. IEEE Transactions on Visualization and
Computer Graphics. vol. 28, no. 12, pp. 5049–5070, Dec. 2022
[15]. Threat hunter team, 2022. Daxin Backdoor: In-Depth Analysis, Part
One. Symantec posted on 8 Mar 2022, https://symantec-enterprise-
blogs.security.com/blogs/threat-intelligence/daxin-malware-espio
nage-analysis, Accessed 20 December 2022.
[16]. Hu, W. and Tan, Y., 2022, November. Generating adversarial malware
examples for black-box attacks based on GAN. In International
Conference on Data Mining and Big Data (pp. 409-423). Singapore:
Springer Nature Singapore.
[17]. Kirat, D., Jang, J. and Stoecklin, M.P., 2018. DeepLocker: Concealing
Targeted Attacks with AI Locksmithing. Black Hat USA,

14

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3482728

PROFESSOR DR. JAN ELOFF is appointed as a full Professor in Security) IFIP and was a recipient of the IFIP Silver Core and Outstanding
Computer Science at the University of Pretoria, South Africa. From 2016 to Services Award. He also served as the South African representative on ISO
2021 he was appointed as Deputy Dean Research and Postgraduate studies (the International Standards Organisation) and as a former president of the
and in 2022 as Acting Dean in the Faculty of Engineering, Built South African Institute of Computer Scientists and Information
Environment and IT at the University of Pretoria. From 2008 to 2015 he was Technologists (SAICSIT). In 2017, he received a SAICSIT award
appointed as Research Director for SAP Research in Africa. He holds a B2 recognising him as an individual who has played a pioneering role in
rating from the National Research Foundation in South Africa indicating promoting computer science and information technology as academic
that he receives considerable international recognition for his research in disciplines in South Africa. In 2020 he received the Chancellor’s Medal for
safeguarding platforms against societal and organisational cyber-threats. He Research from the University of Pretoria and in 2021 he is listed as a finalist
is also a leading international scholar in conducting research in the for the NSTF Lifetime award for an exemplary life-long research in
convergence of Cyber-security and AI. He has published widely in leading Cybersecurity.
international journals. In 2018 he published a scholarly book on Software
Failure Investigations. He is an associate editor of Computers & Security, Christiaan Klopper is a master’s student in IT focusing on big data
the world’s leading journal for the advancement of Computer Security. He science at the University of Pretoria, SA. He received his BEng in electronic
is the co-inventor of a number of patents registered in the USA. Jan is a engineering from the University of Pretoria in 2010. His main research areas
member of the governing and advisory board of the International are data science, big data analytics and developing a self-learning cyber
Knowledge Centre for Engineering Sciences and Technology defence system that can discover undetectable threats.
(UNESCO(IKCEST)) in China. During his research career he represented
South Africa as an expert on Technical Committee 11 (Information

15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like