2407 a Coverage-Guided Fuzzing Method for Automatic Software Vulnerability Detection Using Reinforcement Learning-Enabled Multi-Level Input Mutation
2407 a Coverage-Guided Fuzzing Method for Automatic Software Vulnerability Detection Using Reinforcement Learning-Enabled Multi-Level Input Mutation
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
ABSTRACT Fuzzing is a popular and effective software testing technique that automatically generates
or modifies inputs to test the stability and vulnerabilities of a software system, which has been widely
applied and improved by security researchers and experts. The goal of fuzzing is to uncover potential
weaknesses in software by providing unexpected and invalid inputs to the target program to monitor its
behavior and identify errors or unintended outcomes. Recently, researchers have also integrated promising
machine learning algorithms, such as reinforcement learning, to enhance the fuzzing process. Reinforcement
learning (RL) has been proven to be able to improve the effectiveness of fuzzing by selecting and prioritizing
transformation actions with higher coverage, which reduces the required effort to uncover vulnerabilities.
However, RL-based fuzzing models also encounter certain limitations, including an imbalance between
exploitation and exploration. In this study, we propose a coverage-guided RL-based fuzzing model that
enhances grey-box fuzzing, in which we leverage deep Q-learning to predict and select input variations
to maximize code coverage and use code coverage as a reward signal. This model is complemented by
simple input selection and scheduling algorithms that promote a more balanced approach to exploiting and
exploring software. Furthermore, we introduce a multi-level input mutation model combined with RL to
create a sequence of actions for comprehensive input variation. The proposed model is compared to other
fuzzing tools in testing various real-world programs, where the results indicate a notable enhancement in
terms of code coverage, discovered paths, and execution speed of our solution.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
thorough examination at the cost of greater complexity, and balances the concepts of exploitation and exploration—key
grey-box providing an effective middle ground that combines elements in effective fuzzing. Exploitation delves deep into
ease of implementation with the potential for deep vulnera- the program to reach critical code segments, while explo-
bility detection. ration aims for broad branch coverage to ensure no potential
More specifically, grey-box fuzzing, particularly when fault is overlooked. Achieving a harmonious balance between
augmented with code coverage metrics, represents a promi- these two approaches is essential for optimizing the fuzzing
nent methodology within the fuzzing landscape [6]. This ap- model’s effectiveness, thereby leading to quicker and more
proach necessitates access to the executable files of the target comprehensive identification of software vulnerabilities [9].
application, which are then subjected to fuzzing procedures This evolution in fuzzing marks a significant milestone in
within a specialized environment. Its widespread adoption automated software testing, offering a smarter, more adaptive,
can be attributed to its applicability to closed-source software, and outcome-focused method for detecting and addressing
eliminating the need for source code access. By harnessing security flaws.
insights from code coverage data garnered during the execu- Current approaches like [10], [7] merging RL with fuzzing
tion of the program, grey-box fuzzing enhances its efficacy are intensely focused on developing the algorithms, states,
[7]. Fuzzing frameworks employing this technique initiate rewards, and parameters specific to RL while overlooking
the execution of the program using various test inputs while critical factors like the balance between exploitation and
deploying sophisticated mechanisms to capture detailed code exploration. This emphasis leads to fuzzing models based
coverage information. This crucial data serves as the basis on RL that concentrate excessively on a single code branch,
for evaluating and refining test input criteria, with the over- thereby missing potential errors in other branches. These
arching goal of augmenting the fuzzing process’s effective- models often give priority to selecting mutation actions with-
ness. Through this iterative optimization based on dynamic out incorporating effective mechanisms for the selection and
execution feedback, grey-box fuzzing with code coverage scheduling of inputs, setting them apart from conventional
insight emerges as a powerful tool in identifying software fuzzing tools. Moreover, there’s a significant gap in research
vulnerabilities [8]. offering a comparative analysis of RL models in fuzzing
The fusion of machine learning (ML) with fuzzing method- against the efficiency, strengths, and weaknesses of modern
ologies signifies a groundbreaking advancement in the de- fuzzing tools.
tection of software vulnerabilities, offering a sophisticated, To a certain point, the fuzzing techniques can be classified
intelligent approach to security assessments. This innovative into exploitation and exploration techniques [11]. Therein,
integration empowers fuzzing tools with the ability to learn exploitation and exploration stand as pivotal concepts in
from previous iterations, refining and targeting their search the fuzzing process, with exploitation denoting the capacity
for flaws more effectively. Machine learning algorithms an- of test cases to penetrate deeply and access code segments
alyze patterns from past fuzzing activities to enhance the buried within the program [12], [13]. Conversely, exploration
generation of test inputs, focusing on areas more likely to pertains to test cases achieving extensive branch coverage.
reveal critical vulnerabilities. This not only increases the effi- A fuzzing model with an excessive focus on exploration
ciency of the fuzzing process by prioritizing high-risk code may not generate test cases that effectively pinpoint faulty
paths but also enables the adaptation of fuzzing strategies code segments within the program. Similarly, a model overly
in real time, optimizing the exploration of complex software concentrated on exploitation might only aim to reach the
environments. The result is a more nuanced, context-aware deepest branch within a program, potentially overlooking
approach to vulnerability detection that significantly reduces faults in other branches. Therefore, finding an optimal bal-
the time and computational resources required, marking a ance between exploitation and exploration is essential for
substantial leap forward in the field of software security. the development of an efficient fuzzing model. This balance
Reinforcement Learning (RL), a subset of ML character- ensures that the model can effectively uncover faults across
ized by an agent learning to make decisions through trial and different branches of the code, maximizing the potential for
error to achieve a specific goal, has emerged as a potent tool identifying vulnerabilities within the software [14], [9] [11].
for enhancing fuzzing techniques in the generation of new Hence, in our research, we introduce an innovative fuzzing
test cases. By applying RL principles, fuzzing frameworks model guided by coverage metrics, leveraging RL (RL) to
can dynamically adjust their strategies based on the outcomes enhance input selection and scheduling. This approach aims
of previously executed test cases, effectively learning which to meticulously balance the exploration of new paths and
types of inputs are more likely to induce anomalies or reveal the exploitation of deeper, potentially vulnerable segments
vulnerabilities in the software under test. This integration within a program. Furthermore, our model incorporates a
leverages the outcomes of previously executed test cases to novel multi-level input mutation mechanism, designed to syn-
inform the creation of new inputs, focusing on those more ergize with RL. This mechanism facilitates the generation of
likely to uncover vulnerabilities by navigating unexplored or mutations via a series of deliberate actions, enabling a more
less-tested paths in the application’s codebase. The applica- granular and targeted exploration of the software’s attack
tion of RL not only enhances the efficiency of the fuzzing surface. To validate the effectiveness and efficiency of our
process by learning from past interactions but also expertly proposed model, we undertake a comprehensive comparative
2 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
the model’s performance. Researchers focus on mini- generation, (4) mutation action selection, etc. RapidFuzz [19]
mizing redundant transformation efforts, quickly select- and CGFuzzer [20] employ Generative Adversarial Networks
ing accurate transformations to increase code coverage, (GANs) to learn the structure of complex inputs, aiming to
thereby enhancing the fuzzing process’s efficiency. generate higher similarity patterns for fuzzing protocols or
• Post-fuzzing analysis: In certain programs, the number specific file formats. This approach helps save time by avoid-
of false positives found during fuzzing can be high, ing mutating invalid samples and increasing the likelihood of
or duplicate vulnerabilities might be discovered. This passing structure checks. NeuFuzz by Wang, Y. [21] models
can be cumbersome for manual analysts. Therefore, re- the bug-finding process akin to natural language processing,
searchers also address this by implementing algorithms utilizing deep-learning Long Short-Term Memory (LSTM) to
to detect duplicates, evaluating the return of detected learn the structure of error-containing paths, predicting which
vulnerabilities or employing machine learning algo- paths are more likely to have errors and prioritizing them for
rithms to score the exploitability of vulnerabilities. input scheduling.
The RL approaches have also been applied to fuzzing
B. TRADITIONAL IMPROVEMENT TECHNIQUES IN for the first time in 2018 by Böttinger et al. [22]. They
FUZZING transformed the fuzzing problem into an RL problem, where
These are traditional improvement methods for fuzzing, each the selection of the next mutation action is analogous to
with its advantages and drawbacks. Ji Tiantian et al. intro- choosing the next move in a chess game. Although an op-
duced AFLPro in [15], enhancing input selection and schedul- timal strategy may exist, searching for optimal actions is
ing by combining static analysis with a basic block synthe- performed using the deep Q-learning algorithm. However,
sis model. The goal is to prioritize inputs that reach code their proposed model was specifically designed for PDF files,
segments less explored previously, aiming to increase the lacking objective results when compared to modern fuzzing
coverage of deeper program areas. Tai Yue et al. proposed tools and not addressing the balance between exploitation
EcoFuzz [9], modeling input scheduling as a Multi-Armed and exploration. A. Kuznetsov et al. also employed deep
Bandit problem and presenting a variant of the Adversarial Q-learning to select mutation actions for application testing
Multi-Armed Bandit model to improve it. The common idea [23]. They demonstrated that combining RL can reduce the
of both techniques is to enhance input selection and schedul- time needed to create expected test cases by up to 30%,
ing, prioritizing inputs predicted to have a higher likelihood yet their evaluation method does not suit real-world appli-
of containing errors and leading to better code coverage. cations. S. Reddy et al. improved mutation action selection
However, the issue with using static analysis techniques al- using the Monte Carlo Control algorithm, creating more valid
ways lies in the lack of complete runtime data and often samples for applications with complex input structures [24].
produces low-accurate or false-positive results. Additionally, The results enhanced the rate of passing structure checks for
it might not work effectively on applications utilizing code samples. However, their model skewed towards exploitation
obfuscation or packing mechanisms. Peng Chen et al. created rather than exploration, focusing on generating diverse inputs
Matryoshka [16], using taint analysis to solve conditional with similar features instead of exploring new behaviors. Li,
statements and penetrate deeper into the program. However, X et al. introduced Reinforcement Compiler Fuzzing [25],
this technique demands significant resources and slows down also utilizing deep Q-learning for mutation action selection
the fuzzing process, while taint analysis also faces challenges at the compilation level. Nevertheless, their implementation
with under-tainting and over-tainting. H. Zhang et al. pro- requires source code to work effectively. Drozd et al. com-
posed a lightweight and convenient mechanism to surpass bined Deep Double Q-learning to select mutation actions and
input checks by combining static analysis with mutating key accelerate libFuzzer [26]. However, they acknowledged that
bytes in InsFuzz [17]. They identify bytes influencing condi- it is not sufficient and that further enhancements are needed
tional statement results and then mutate them. However, since in terms of input selection and filtering.
it also employs static analysis, it is still subject to limitations Zheng Zhang and colleagues proposed rlfuzz [10], a
like the techniques mentioned above and requires modifying method to balance exploitation and exploration in a deep-q-
the executable files, causing instability in applications with learning fuzzing model by randomly selecting trial inputs for
integrity checks. subsequent transformations when the model does not expe-
rience an increase in code coverage. However, this selection
C. THE COMBINATION OF ARTIFICIAL INTELLIGENCE method is not yet optimal, as inputs with low code coverage
INTO FUZZING in the queue still have an equal chance of being transformed
Recently, with the explosive development of artificial intel- as those with higher potential. In a different approach, Wang
ligence, machine learning techniques have also been applied Jinghan et al. [7] utilized RL for input scheduling rather than
by researchers to enhance the fuzzing process. Surveys [12], selecting mutation actions like other studies. They proposed
[18] indicate that the application of machine learning tech- a multi-level code coverage model to enhance fuzzing detail
niques to fuzzing is diverse and creative, yielding promising and introduced a scheduling mechanism to support this multi-
results. The steps typically addressed by artificial intelligence level code coverage model using RL. The results showed a
include (1) input selection, (2) input scheduling, (3) input balance between exploitation and exploration in the gener-
4 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
ated test cases, but there was no significant improvement in 1) Seed selection and schedule algorithm: From a queue
selecting more effective mutation actions. of inputs, the algorithm ranks them based on the current
Current combined RL and fuzzing solutions tend to focus coverage and prioritizes inputs with higher coverage,
heavily on designing RL algorithms, states, rewards, and pa- while allocating the number of trials to the model.
rameters without considering factors like the balance between 2) RL model: This model receives inputs and selects the
exploitation and exploration. This leads to RL fuzzing models appropriate action to mutate them, aiming to predict the
delving deep into one code branch, missing opportunities to best coverage improvement.
find vulnerabilities in other branches. Moreover, the focus 3) Multi level input mutation: This component receives
often lies solely on mutation action selection, without mecha- the input from (1) and the action from (2) to perform a
nisms for effective input selection and scheduling that are cru- mutation on that input, generating a list of test cases,
cial for real-world fuzzing tools. Furthermore, no RL fuzzing which are then fed into (4) to obtain results. Multi-
study provides a comparative perspective on performance, level input mutation and early stopping mechanism are
strengths, and weaknesses compared to modern fuzzing tools. applied at this stage.
Recognizing the weaknesses in the exploitation-exploration 4) Coverage observer: takes the responsibility of execut-
imbalance of combined RL fuzzing models, this issue could ing inputted test cases on the target program to obtain
be mitigated by combining effective input selection and results.
scheduling algorithms, a topic that has been extensively
studied in traditional fuzzing. Our work investigates and B. INPUT MUTATION WITH RL
proposes an RL-based guided coverage-aware fuzzing model When applying RL to chess, the state of the RL model can be
to address these weaknesses by integrating it with an effective considered as the positions of the pieces on the chessboard,
input selection and scheduling algorithm. Additionally, we the actions as selecting the next move, and the reward as
propose a multi-level input transformation algorithm that can the advantage gained after making the move (chess computer
be applied to RL-based fuzzing models, coupled with a waste programs have algorithms to evaluate advantages based on
reduction mechanism to improve model efficiency. the chessboard position, which we will not elaborate on here).
Researchers can then apply RL techniques to train the model
III. METHODOLOGY to find the optimal moves by playing numerous chess games
A. THE ARCHITECTURE OF CTFUZZ and accumulating experience (rewards). One notable example
Inherited from the fuzzing model using RL named rlfuzz [10], is AlphaZero, developed by DeepMind, which used a combi-
our CTFuzz is designed with four main components as in nation of RL and deep learning to play chess automatically.
Fig. 2. The main improvements of our CTFuzz compared It improved its chess abilities by playing thousands of games
to its predecessor include replacing the random selection against itself, accumulating experience, and has become one
approach with a balanced input selection and scheduling algo- of the strongest chess-playing tools in the world.
rithm, which ensures prioritization of inputs with higher code Similarly, we can apply a similar concept to fuzzing, where
coverage. Additionally, we have implemented a multi-level the test inputs become like the positions of chess pieces and
input transformation model that can be combined with RL become the input for the RL model. Selecting transformation
to enhance the long-term performance of the fuzzing model, actions will be similar to choosing the next move on the
especially when multiple consecutive actions are needed to chessboard. Meanwhile, code coverage will be analogous to
transform inputs effectively. the advantage on the chessboard and becomes the reward for
Each component in CTFuzz is responsible for different the RL model. Consequently, we can train the RL model to
tasks to enhance the effectiveness of the fuzzing process. select the best transformation actions for each input based on
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
the accumulated experience. The reward mechanism does not TABLE 1. Actions for mutation
necessarily depend solely on code coverage but can incorpo-
Action Description
rate other factors based on what we want to improve in the Mutate_EraseBytes Erase byte
model, such as execution time, length of test cases, or their Mutate_InsertByte Insert byte
combinations. Mutate_InsertRepeatedBytes Insert a sequence of bytes
Mutate_ChangeByte Change byte
Based on a similar concept, we model the fuzzing process Mutate_ChangeBit Change a bit of a byte
as a Markov decision problem, where the number of newly Mutate_ShuffleBytes Shuffle the order of bytes within a range
discovered code coverage is considered as the reward, aiming Mutate_ChangeASCIIInteger Change integer
Mutate_ChangeBinaryInteger Change binary integer
for the model to seek new code portions within the program. Mutate_CopyPart Perform byte copy and insertion
The detailed definitions of states, actions, and rewards in
our proposed RL model will be further elaborated in the
subsequent sections. 3) Reward
One of the most crucial steps in designing an RL model is
1) State devising the reward mechanism, as it significantly impacts
For the state space, we represent all test inputs as byte ar- the performance and outcomes of the RL model. Thus, the
rays with a maximum length of 65,536 (0x10000) bytes, as reward mechanism needs to be well-designed and aligned
in Fig. 3. Bytes in each input are converted into values in with the desired goals of the model. In the context of our
the range of [0, 255]. Such a state representation limits the model during fuzzing operations, our aspiration is for the
maximum length of the test input to a constraint of 65,536 model to maximize code coverage, aiming to explore code
bytes due to the consideration that a too-large state would regions that have not been touched before. Consequently,
slow down the processing of the RL model. With the initial new or increased code coverage becomes the criterion within
byte length, based on the chosen mutate action, the input our reward mechanism. Particularly, considering the target
values are modified accordingly by adding bytes, truncating program as a graph with various code blocks as nodes and
bytes, or permuting bytes (new states). Hence, our aim is their relationships as edges, we aim to discover as many edges
for the fuzzing model to select actions based solely on the as possible. The number of newly explored edges plays the
mutated input and choose the best mutation action based on role of an indication for increased coverage and should result
experience. in a higher reward. Hence, given that total_new_coverage is
indeed the number of newly discovered edges and energy is
2) Action the number of attempts used, the reward in our RL model is
In RL, actions are taken to interact with an environment and defined as in (1).
achieve specific goals. In the context of fuzzing, the actions total_new_coverage
are the mutation applied to the test inputs to feed into the reward = (1)
energy
target programs. Our RL model includes 9 mutate actions on
the bytes of the test input, inspired by libFuzzer [27], due to its This reward mechanism is designed to help optimize the
independence and ease of implementation compared to other discovery of new code segments for the model, as only those
fuzzing engines. The detailed definition of these actions is segments truly make an impact, but not the high code cover-
provided in Table 1. age on previously discovered parts. It also ensures the goal
of our model is to prioritize the ability to find inputs that
contribute to new code coverage, rather than achieving the
highest coverage on each input. In addition, using the number
of attempts in the reward is for compatibility with the input
scheduling mechanism described in Section III-C, where the
number of trials is distributed differently from action to action
and the averaging approach will ensure fairness for inputs
with few trials.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
Algorithm 1 Seed selection and schedule algorithm ineffective waste-prevention mechanism, a value that is too
Require: Seed queue Q small can cause the model to ignore significant opportunities
Ensure: List of seed and mutation energy in order for new code coverage in subsequent trials. Thus, a balanced
1: function SEED_SELECTION_AND_SCHEDULE(Q) value for M is essential, striking a trade-off for the model’s
2: Q′ = Q performance. Moreover, the efficacy of this choice also de-
3: sort(Q′ ) // Order seeds in Q′ by coverage pends on the values of the two constants total_energy and
4: n = len(Q′ ) min_energy used in the trial allocation algorithm discussed
5: for i ← 0 to n − 1 do in Section III-C. Particularly, though adjusting M based on
6: Q′ [i].energy = max(total_energy ∗ (n − the allocated number of trials might yield better results, in
i)/SAP(n), min_energy) our model, we opt for simplicity by using a fixed value for
7: end for M throughout the process. The mechanism can be outlined in
8: return Q′ pseudocode form, as shown in Algorithm 3.
E. COVERAGE OBSERVER
1) Multi-level input mutation
In coverage-guided fuzzing models, a crucial step is the ex-
Many modern fuzzing tools like AFLplusplus [28] have im-
traction of code coverage information during program execu-
plemented the idea of multi-level mutation on input, where the
tion, significantly impacting the model’s effectiveness. In our
fuzzing model needs to execute more than one consecutive
work, this process needs to be rapid, precise, and stable, espe-
action to discover new “noteworthy” samples. Considering
cially when dealing with numerous test cases and independent
this as an essential mechanism to enhance the effectiveness
of access to the source code of the target program.
of the fuzzing process, we designed a simplified multi-level
mutation algorithm that is compatible with the RL-based To meet these requirements, we take the idea of the client-
fuzzing model. server model using a fork server introduced in AFLplus-
Algorithm 2 provides a pseudocode representation of the plus [28]. While the server is responsible for initializing the
proposed multi-level mutation algorithm, with the key ideas target program and employing Frida to inject code as well
as follows. as recording the initial state, the client interacts with it via
• The initial transformation depth is set to 1.
• Each input is associated with a counter table that tracks
Algorithm 2 Multi-level input mutation algorithm
the number of times each transformation action is exe-
Require: Input seed, mutation depth depth
cuted.
Ensure: List of test cases
• Each action is selected no more than C times.
1: function MULTI_LEVEL_MUTATE(seed, depth)
• If an action has been selected more than C times, a
2: input = seed.input
different action that has not reached its threshold is
3: energy = seed.energy
randomly chosen for transformation.
4: for step ← 1 to depth − 1 do
• Once all actions for an input have been selected C times,
5: action = get_action(input) // Get action from RL
the input is removed from the queue Qn and placed into
model
the queue Qn+1 .
6: input = mutate(input, action)
• If the queue Qn is emptied, the model proceeds to a new
7: end for
depth of n + 1 and switches to queue Qn+1 .
8: action = get_action(input)
9: // Get action from RL model
2) Early-stopping mechanism
10: if action_is_maximun_try(seed, action, C) then
Another effective algorithm that contributes to the efficiency
11: // Random pick another valid action
of the fuzzing process of other tools is the "early stopping"
12: action = pick_another_action(seed, C)
mechanism, or so-called early abort. Take AFLplusplus as an
13: end if
example, where this allows the fuzzing system to early stop
14: testcase_list = array[energy]
testing and switch to other inputs upon numerous trials on a
15: for i ← 1 to seed.energy do
specific input without achieving appropriate results, ignoring
16: testcase = mutate(input, action)
the remaining number of available trials. Inheriting this idea,
17: testcase_list.append(testcase)
we simplified and implemented this mechanism for our model
18: end for
by introducing a constant M representing the maximum con-
19: update_mutate_count(seed, action)
secutive unsuccessful trials. When transforming an input fails
20: if all_mutate_reach_maximum_try(seed, C) then
to increase code coverage or find new paths for M times
21: remove_from_queue(Qdepth , seed)
continuously, the model will move on to a different input. This
22: add_to_queue(Qdepth+1 , seed)
approach can save a significant number of futile trial attempts.
23: end if
However, it is noticeable that setting an appropriate value
24: return testcase_list
for M is also crucial. While too large a value leads to an
8 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
Algorithm 3 Early stopping mechanism 1) Initially, the model pushes the initial seed to the queue.
1: Q′ = SEED_SELECTION _AND_SCHEDULE(Q) 2) The model proceeds to mutate the seed in a loop. At the
2: for seed in Q′ do start of a new iteration, all seeds in the queue are sorted
3: last_find = 0 in descending order of code coverage to prioritize seeds
4: for i ← 1 to seed.energy do with higher code coverage.
5: input = mutate(seed.input, action) 3) Next, the scheduling algorithm is invoked to allocate
6: reward = run_target(input) the number of attempts for each seed. This is designed
7: if has_new_cov_or_new_unique_path(reward) to distribute mutate energy reasonably, ensuring that a
then seed with higher potential is given more priority while
8: last_find = i still trying to mutate a seed with lower potential. This
9: end if balances the exploitation and exploration of our model.
10: if i − last_find > M then 4) The model runs a loop, taking each seed and its allo-
11: break // go to next seed in queue cated number of attempts as input for the RL model.
12: end if 5) The RL model predicts and chooses an action that
13: end for maximizes the new code coverage for the selected seed.
14: end for 6) The model mutates the seed using the action chosen by
the RL model and the number of attempts given by the
scheduling algorithm, generating a series of test cases.
shared memory to transfer samples and receive code coverage 7) The target application is executed with the generated
information. To implement our coverage observation mecha- test cases, while the code coverage information is col-
nism, we consider CTFuzz model as an ALFplusplus client. lected. Inputs that result in new code coverage are
Then, we designed an application called ex − frsv, which added to the queue to be used in the next iteration.
is responsible for not only initiating a forkserver-like server During this phase, multi-level mutation and early stop-
but also playing the role of a proxy, enabling the connection ping mechanisms are also implemented to optimize the
between client and server. fuzzing process.
The interaction of the proposed CTFuzz and those compo- 8) After finishing fuzzing on a seed, the reward points are
nents to obtain code coverage is depicted in Fig. 5. For more calculated and returned, and the next seed in the queue
details, ex − frsv receives test cases from the CTFuzz model becomes the next state for the RL model.
and sends them to forkserver, which then returns the code 9) Continue to step (4) until the iteration is completed,
coverage result. Some techniques such as shared memory and then return to step (2) to create a new loop to start a
semaphores are also employed to facilitate communication new iteration.
between the model and ex − frsv to enhance both speed and With the design of the proposed CTFuzz, we expect to
stability. Moreover, being a coverage-guided fuzzing tool, the enable a better fuzzing process. First, the RL algorithm can
coverage observer in our model returns obtained coverage via enable the fuzzing model to select better actions based on
a bitmap, in which each bit corresponds to a basic code block the criterion of increasing discovered code coverage. Second,
and is enabled when that block is hit. the input selection and scheduling algorithm may assist in
distributing test attempts according to the nature of inputs,
F. WORKFLOW OF CTFUZZ prioritizing inputs with higher code coverage, and ensuring
The workflow of our proposed model, which involves the testing for lower-coverage inputs with the minimum number
cooperation of the components mentioned above, is illus- of attempts to balance the exploitation and exploration of the
trated in Fig. 6. As a coverage-oriented grey-box fuzzing model. The multi-level mutation mechanism makes the model
model using RL, equipped with other supporting algorithms flexible and more practically effective for long-term and chal-
to enhance effectiveness and performance, CTFuzz works lenging fuzzing scenarios. Meanwhile, the waste prevention
with the main steps as follows. or early stopping mechanism helps avoid wasting time on
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
inefficient inputs with excessive test attempts. TABLE 2. Hyperparameters for training DQN-based fuzzing model
Hyperparameter Value
IV. IMPLEMENTATION AND EXPERIMENTS
Size of input layer Size of state space: 65536 (0x10000)
A. RESEARCH QUESTION Size of output layer Size of action space: 9
Based on the improvements expected in the model presented Optimizer Adam
Discount factor γ 0.9
above, in this experimental section, we will focus on answer- ϵ for EpsGreedyPolicy 0.7
ing the following questions. Learning rate 0.001
• Question 1: How effective is the model compared to the
previous RL-based fuzzing model? TABLE 3. Settings of the mutation strategy for fuzzing
• Question 2: What is the efficiency of the model com-
pared to a modern fuzzing tool? Parameter Value
total_energy 1,000,000
• Question 3: What is the contribution of the RL model
min_energy 50
used within the entire framework? C - Maximum selection times of an action 10
• Question 4: If the speed disparity between program- M - Maximum consecutive unsuccessful trial 5,000
ming languages is improved, what will be the effective-
ness of the model?
2) Target programs for evaluation
B. ENVIRONMENTAL SETTINGS In the scope of this paper, our investigation involved statis-
1) Implementation setup tical analysis and fuzz testing of two toolsets, specifically
We deploy the proposed model as well as perform experimen- Binutils-2.34 and Poppler-0.86.1, as outlined in Table 4.
tation on a VPS machine equipped with an Intel(R) Xeon(R) These toolsets are commonly used for testing in practical
CPU E5-2660 v4 @ 2.00GHz, boasting 4 cores, 64 GB RAM, fuzzing research due to their open-source and easy-to-use
and running Ubuntu 20.04.1 LTS 64-bit. On this system, we nature, allowing for testing methods requiring source code
installed AFLplusplus version 4.05c, Binutils 2.34, Poppler access. The target applications used in our experiment are
0.86.1, along with requisite libraries such as tensorflow 2.3.3, categorized into 2 main types, including PDF and ELF. Be-
gym 0.10.3, posix-ipc 1.1.1, keras-rl2 1.0.5, xxhash 3.2.0, etc. sides, they are not too excessive or complicated to require a
Moreover, the RL model is implemented with a DQN- dedicated fuzzer with specific customization for the fuzzing
based agent and an environment deployed using gym library process. However, chosen target programs still possess a high
[29]. More details of hyperparameters of the RL model are potential for vulnerabilities because of their complex func-
mentioned in Table 2, in which they are inherited from related tionalities and diverse formats.
work as well as defined based on the best results in multiple In addition, the initial criteria for the model targeted grey-
experiments. Meanwhile, the settings for mutation strategy box fuzzing of toolkits and software, therefore network ser-
are described in Table 3. vices were not included in the objectives. Network services
10 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
TABLE 4. Target fuzzing programs formula, a positive ER indicates that CTFuzz is more
effective than its counterpart and vice versa, which can
ID Target Program Parameters Version Type be used to compare all the above metrics.
1 readelf -a @@ Binutils-2.34 ELF
2 strings -a @@ Binutils-2.34 ELF
3 size -A -x -t @@ Binutils-2.34 ELF Value of CTFuzz − Value of model X
4 objdump -a -f -x -d @@ Binutils-2.34 ELF ER = × 100
Value of model X
5 nm -C @@ Binutils-2.34 ELF (4)
6 pdfinfo -box @@ Poppler-0.86.1 PDF
7 pdfimages -list -j @@ Poppler-0.86.1 PDF
Where:
8 pdfdetach -list @@ Poppler-0.86.1 PDF -- Value of CTFuzz: The value, which can be one
9 pdftotext -htmlmeta @@ Poppler-0.86.1 PDF of the above 3 metrics, of the CTFuzz model to be
10 pdftohtml -stdout @@ Poppler-0.86.1 PDF
11 pdftoppm -mono @@ Poppler-0.86.1 PDF
compared.
-- Value of model X: The value, which can be one of
the above 3 metrics, of the other models.
receive inputs differently and require a different approach
4) Experimental scenarios
when developing fuzzing tools. Contemporary fuzzing tools
To address these four questions, we conducted experiments to
also have specific features for fuzzing network services. In
evaluate the effectiveness of our model. In our experiments,
terms of image processing tools, fuzzing via parsing software
our proposed CTFuzz is put side by side with 3 other fuzzing
is quite similar to PDF file parsing. However, our tools do
tools, each of which is leveraged to answer one of the above
not yet support more complex operations like interactive
questions. While the first question can be resolved by compar-
image editing in terms of fuzzing based on events during UI
ing CTFuzz with rlfuzz [10], another tool called AFLplusplus
application interactions.
is used for the second one. Moreover, to address the third
question, we compared CTFuzz with a randomly generated
3) Evaluation metrics mechanism designed to separate the RL model from the over-
Our approach is evaluated via the four following metrics. all model and evaluated the performance difference between
• Code coverage is calculated by the total number of selecting transformation actions based on RL and selecting
edges found by each model with the edge-based cover- them randomly. In the case of the final question, which is
age computation method. Each traversed edge is only related to the performance of a programming language, we
counted once, ignoring later duplications during the observe some metrics after 200,000 initial trials.
fuzzing process. As mentioned above, a bitmap is re- To sum up, 2 different scenarios are conducted as follows.
turned to indicate the observed coverage to make up a • Scenario 1: We compared the effectiveness of our
global bitmap to record the edges found after each trial. model against various fuzzing models, including rlfuzz
A higher code coverage represents a better performance. [10], AFLplusplus [28], and a random mechanism within
• Unique paths are measured for each test case, based the same execution time of 6 hours via all metrics and
on the set of edges it traverses (without taking the fre- their ERs.
quency of each edge traversed into account). This is an • Scenario 2: This scenario aims to answer the last ques-
essential metric to evaluate the path-discovery capability tion, where all models are compared using code cover-
of fuzzing models, reflecting the diversity of executed age and unique path after completing 200,000 trials.
paths. Once again, more unique paths found indicate Note that, in each scenario, each model is tested 5 times to
better effectiveness in our model. obtain averaged results.
• Execution rate is a parameter evaluating the speed of
the models, which is identified by time-related infor- C. EXPERIMENTAL RESULTS AND ANALYSIS
mation when trying inputs on the target programs. For 1) Scenario 1: Performance after 6 hours
example, it can be the total number of trials each model Table 5 summarizes the number of edges found by each
performed in a specific amount of time or the time it model after a 6-hour experimentation. In general, CTFuzz
takes to finish a particular number of trials. In fact, achieves lower code coverage compared to AFLplusplus
given the inherently random nature of fuzzing, more when fuzzing most of the tested applications, while it
trials conducted can increase the chances of discovering beats rlfuzz regardless of target programs. For more de-
vulnerabilities or new paths as well as performance. In tails, the most significant decreased performance compared
general, this metric can be calculated by the total number to AFLplusplus is observed in the readelf application, with
of executed trials per second. Hence, the higher speed 73.3%, while the best increase in code coverage can be ob-
of models is reflected via the higher number of trials served in an application called objdump with the ER of 39.5%.
performed or the less time-consuming. Especially, in comparison with rlfuzz, our CTFuzz can even
• Enhancement rate (ER): is used to determine the in- achieve a climb of 315.7%, which is 4 times more effective
crease or decrease in performance of CTFuzz compared than its counterpart in terms of seeking new edges. On aver-
to other models, calculated by (4). According to this age of 11 applications, CTFuzz achieves lower coverage of
VOLUME 11, 2023 11
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
4.0% compared to AFLplusplus after 6 hours. In comparison Despite the relatively high ratio, the actual difference is not
with the random transformation model, CTFuzz experiences substantial (9.8 compared to 6.2). On average, CTFuzz lags
a slight increase in code coverage by 6.1% overall. However, behind AFLplusplus by 37.3%. In comparison to the random
the code coverage of the two models is nearly equivalent transformation model, CTFuzz experiences a slight increase.
across all tested applications, except for the nm application, The most significant boost is observed in the nm application
where the difference reaches up to 54% in favor of CTFuzz. with 306.3%, while the lowest decrease is -16.7% in pdfinfo.
Overall, compared with rlfuzz, CTFuzz exhibits a higher code On average, CTFuzz surpasses the random model by 55.8%
coverage with an average increase of 82.1%. in terms of discovered path count across 11 applications.
Concerning the rlfuzz model, CTFuzz once again outperforms
When it comes to finding unique paths, Table 6 con- all test applications. The highest superiority is in pdfinfo with
tains information about the number of discovered paths for 1,053.8%, and the lowest is 15.8% in pdfimages. The average
each model after 6 hours. Once again, CTFuzz trails behind across the 11 applications is 349.1%.
AFLplusplus almost entirely, with the largest gap in the read-
elf application at -95.8%. However, there is an exception in In terms of execution speed, let’s delve into the comparison
the pdftoppm application, where CTFuzz increases by 58.1%. in terms of execution speed, as illustrated in Table 7, based on
TABLE 7. Execution speed (number of performed trials per second) after 6 hours
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
the number of trial attempts per second. CTFuzz experiences within the first 200,000 execs as shown in Table 9, CTFuzz
a significant speed disadvantage, with an average number continues to demonstrate a slight advantage over AFLplus-
of trial attempts per second being 86.2% lower than that of plus, exhibiting an increase of 8.2%. The disparity in im-
AFLplusplus. As for the random transformation model, the provement varies across applications. The largest boost was
difference is marginal at -1.7%. However, CTFuzz wins over seen in objdump at 97.6%, and the most marginal in pdftotext
rlfuzz model with 163.6% faster. at -34.6%. Compared to the random transformation mecha-
nism, CTFuzz surpasses the average by 33.4%, achieving the
2) Scenario 2: Performance after 200,000 trials highest increase in the readelf application at 169.8%, and
Given a fixed 200,000 initial trials, Table 8 summarizes the lowest in pdfdetach at -15.4%. For the rlfuzz model, the
the performance of models in the fuzzing process in terms average increase across 11 applications is 350.4%. Notably,
of discovered edges. In this context, CTFuzz outperforms in the pdftotext application, there is a substantial difference
AFLplusplus, exhibiting an improvement of over 6.5% on av- of 1,560% due to rlfuzz discovering only one path compared
erage in terms of the number of edges discovered. The largest to 16.6 paths found by CTFuzz.
disparity is observed in the objdump application, where CT- Moreover, we proceed to compare the execution speeds of
Fuzz achieves a lead of more than 47.3%. For the rlfuzz the four models within the first 200,000 trials in the form
model, CTFuzz continues to win over when fuzzing most of the number of executed trials per second, as illustrated
applications, showing an average improvement of 82.5%. The in Table 10. The increased ratios of CTFuzz in comparison
highest figure is seen in the pdftotext application, with an to AFLplusplus, the random transformation mechanism, and
increase of 314%, without any lower performance observed. rlfuzz are -86.3%, -3.5%, and 154.6%, respectively.
With the random transformation mechanism, the overall dif-
ference remains relatively small, and the coverage achieved V. DISCUSSION
in various applications is nearly comparable. Notably, in the According to the nature of designed experiments, it is essen-
case of the nm application, CTFuzz exhibits a significant tial to provide a few general observations about the experi-
increase of 70.8% in edge coverage compared to the ran- mental results, as follows.
dom mechanism. On average, across 11 applications, CTFuzz • The 6-hour timeframe is relatively short to have a
achieves more than 9.4% higher edge coverage than the ran- comprehensive comparison of the effectiveness of the
dom mechanism after 200,000 executions. fuzzing models in a real-world scenario, where practical
In terms of the number of discovered execution paths fuzzing projects can span over weeks, utilizing multiple
TABLE 8. Total code coverage (number of found edges) after 200,000 trials
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
TABLE 10. Execution speed (number of performed trials per second) after 200,000 trials
machines with significantly higher speeds. However, performance, it has the potential to be highly useful.
this duration is sufficient to highlight differences be- With the rising popularity of ChatGPT and its applica-
tween the models, meanwhile, running each application tion, the potential of ChatGPT-assisted fuzzers can be also
three times helps mitigate the impact of luck. a promising solution. Despite its huge database and diverse
• Many applications achieved similar values across all knowledge obtained from various sources, there are still lim-
four models and did not show significant improvement itations when considering ChatGPT in our approach. For
regardless after 200,000 trials or 6 hours. This could instance, the provided APIs of ChatGPT require expenses and
be due to the less effective initial inputs used or the may lead to the proposed fuzzing tool dependent on network
relatively short timeframe. speed. Besides, the fuzzing process may define various re-
• Some comparison values exhibit minor differences in straints such as data and source code privacy and security,
effectiveness in terms of improvements. However, since then it can be risky when based on a third-party approach like
the base values being compared against were low, the ChatGPT.
resulting improvement ratios appear substantial, impact- Question 3: What is the trade-off regarding speed that
ing the final average numbers. yields performance improvements?
When it comes to addressing the three pre-defined research In the 6-hour and 200,000-trial contexts, the speed differ-
questions, the above experiment results lead us to the follow- ence between CTFuzz and the random mechanism remains
ing answers. marginal, at approximately 0.7%. Despite this small delay,
Question 1: How effective is the model compared to the CTFuzz enhances code coverage and path discovery slightly,
previous RL-based fuzzing model? highlighting the value of the RL mechanism. However, the
This question can be answered using the result of the speed trade-off does not appear to be significant.
comparison of CTFuzz and rlfuzz, another RL-based model. Question 4: How would the model’s effectiveness
Clearly, in both 6-hour and 200,000-trial experiments, it is change if the speed gap between languages is reduced?
evident that the CTFuzz model consistently outperforms the Comparing the execution speeds of the four models reveals
rlfuzz model in all terms of code coverage, unique path count, a considerable discrepancy between AFLplusplus (imple-
and execution speed. CTFuzz’s code coverage improves by mented in C) and the other models (implemented in Python).
approximately 80% compared to rlfuzz, and even surpasses However, the speed difference between CTFuzz and the ran-
rlfuzz in the number of discovered paths by 411.8% in the dom mechanism is minor, around 0.7%. This indicates that
initial 200,000 trials and by 658.6% within the 6-hour time- the primary contributor to the speed difference is the Python
frame. In terms of execution speed, CTFuzz exhibits approx- language rather than the RL component. Considering the
imately 160% faster performance in both contexts. These slight advantage of CTFuzz over AFLplusplus in per-run
results demonstrate the higher effectiveness of CTFuzz than effectiveness, optimizing the model’s execution speed could
that of rlfuzz in both the exploitation and exploration aspects yield promising results.
of the model. In conclusion, the achieved results partially reflect our
Question 2: How does the model compare to a state-of- expectations for the fuzzing model. CTFuzz boasts a higher
the-art fuzzing tool? execution speed, greater code coverage, and an increased
Take AFLplusplus as an example of a modern fuzzing tool, number of found paths compared to the rlfuzz model in both
the effectiveness of CTFuzz slightly falls behind in all evalu- the 6-hour and 200,000-trial experiments. Although CTFuzz
ation metrics. However, when examining the effectiveness on operates slightly slower than the random mechanism, it si-
a per-run basis, CTFuzz achieves slightly better results than multaneously improves code coverage and path count. This
AFLplusplus. This indicates that if the speed of the CTFuzz suggests the RL mechanism has some positive effects, though
model can be improved without compromising its overall not overwhelmingly substantial. While the discussion has
14 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
provided valuable insights, further detailed comparisons and GPT not only achieves greater coverage but also identifies
evaluations between different RL models and their hyper- numerous previously undetected bugs, leading to a collabo-
parameters are necessary for optimal and efficient fuzzing. ration with the Syzkaller team to incorporate these automati-
Additionally, some applications reached stagnation in code cally inferred specifications. This development underscores
coverage and path discovery, indicating either the initial input the potential of LLMs to transform the scope and efficacy
ineffectiveness or the need for longer experimentation peri- of kernel fuzzing practices, providing a more automated and
ods. systemic approach to securing operating systems against a
Regarding time constraints, we have not conducted com- broad spectrum of threats.
prehensive comparisons and performance evaluations across In terms of protocol fuzzing, pretrained LLMs have shown
different RL models, nor have we thoroughly assessed the significant potential in advancing fuzzing protocols, partic-
hyperparameters. Consequently, the choices may not have ularly in overcoming the limitations of traditional mutation-
resulted in optimal and efficient fuzzing. Moreover, several based fuzzing techniques. By leveraging the extensive knowl-
applications exhibited coverage and path discovery stagna- edge embedded within LLMs in ChatAFL approach by Ruijie
tion, which suggests that either the initial inputs employed Meng et al. [31], which have been trained on vast amounts
were ineffective for those applications or the experimentation of human-readable protocol specifications, it becomes pos-
duration was insufficient for the model to explore new paths. sible to extract machine-readable information that aids in
generating valid and diverse test inputs. This is especially
VI. FUTURE DIRECTIONS useful given that protocol implementations often lack formal,
Much more research effort is needed for RL-based fuzzing machine-readable specifications, relying instead on extensive
models to be effectively applicable in practice. One of the natural language documentation. The LLM-guided approach
significant challenges is speed, which can be possibly im- enhances state and code coverage by constructing grammar
proved by transitioning the model to other faster program- for various message types within a protocol and predicting
ming languages, like Rust. Moreover, our proposed fuzzer can subsequent messages in a sequence.
be redesigned in a multithreaded processing manner to speed The potential of LLMs like GPT in enhancing fuzzing tech-
it up significantly. Besides, a comprehensive comparison with niques extends beyond kernel security to help fuzzers encom-
other RL models with more complicated architecture or using pass userland applications in binary software and protocols
various RL algorithms is also a promising direction. Besides, [31]. LLMs demonstrate a profound capability to understand
instead of focusing on input selection and scheduling, the and generate complex code patterns, which can be leveraged
process of input creation can also be enhanced by deploying to automate the generation of fuzzing inputs for userland
state-of-art data generation techniques, such as Generative binary applications. This is particularly valuable in scenarios
adversarial networks (GANs) in RapidFuzz [19], CGFuzzer where conventional fuzzing struggles due to the complexity
[20]. Such GANs-based approaches can be used to learn data of the input structures required by these applications. By
structures to effectively create input, which is useful for deal- automating input generation, LLMs can uncover vulnerabil-
ing with complicated input types or strict input verification in ities that might be missed by more traditional methods, thus
some target programs. broadening the scope of security testing in userland environ-
In the evolving landscape of operating system security, in ments. Furthermore, the adaptability of LLMs to understand
addition to software fuzzing in userland mode, kernel fuzzing context from documentation and prior code enables them to
has emerged as a critical technique for uncovering vulnerabil- tailor fuzzing approaches to the specific nuances of userland
ities that could potentially impact billions of devices globally. binaries. This enhances the efficacy and coverage of fuzz
One of the foundational tools in this domain is Syzkaller, tests, pushing the boundaries of what can be achieved with
which utilizes a domain-specific language, syzlang, to metic- current technology in identifying and mitigating potential
ulously define system call (syscall) sequences and their in- software vulnerabilities.
terdependencies. Despite the progress in automating kernel In summary, while ChatGPT and other LLMs can en-
fuzzing, the generation of Syzkaller specifications has largely hance certain aspects of the fuzzing process, especially in
remained a manual endeavor, with a significant number of test case generation and automation [32], its effectiveness in
syscalls still not effectively covered. Recognizing this gap, fuzzing binary software is limited by its lack of specialized
the recent study of Chenyuan Yang et al. introduced in Kernel- knowledge in low-level computing and potential scalability
GPT approach [30], marks a significant advancement. It har- issues. Hence, to make it more suitable for fuzzing, LLMs
nesses the capabilities of Large Language Models (LLMs) to need to be customized, fine-tuned and pretrained in specific
infer Syzkaller specifications, leveraging the extensive kernel large-scale datasets. In our future works, it can be best used
code, documentation, and use case data encoded during the as a supplementary tool alongside more specialized fuzzing
pre-training of these models. The KernelGPT model utilizes tools and frameworks. Specifically, in the test case generation
an iterative method to derive and refine syscall specifications, task for fuzzing, ChatGPT or other LLMs can significantly
integrating feedback mechanisms to enhance the accuracy enhance the process by leveraging its language processing
and comprehensiveness of the generated sequences. Prelim- capabilities to create diverse and contextually appropriate
inary findings highlighted in the study indicate that Kernel- inputs. It can generate various forms of user-like data, develop
VOLUME 11, 2023 15
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
complex user scenarios, and create semantic variants of input [9] Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and
data, which are critical in exploring different execution paths Xu Zhou. Ecofuzz: Adaptive energy-saving greybox fuzzing as a variant
of the adversarial multi-armed bandit. In Proceedings of the 29th USENIX
in software. Additionally, LLMs can aid in automating test Conference on Security Symposium, pages 2307–2324, 2020.
script writing and can integrate and interpret outputs from [10] Zheng Zhang, Baojiang Cui, and Chen Chen. Reinforcement learning-
other tools to suggest relevant test cases, thereby enriching based fuzzing technology. In Innovative Mobile and Internet Services in
Ubiquitous Computing: Proceedings of the 14th International Conference
the depth and coverage of the fuzzing process in uncover- on Innovative Mobile and Internet Services in Ubiquitous Computing
ing potential vulnerabilities. We also intend to improve our (IMIS-2020), pages 244–253. Springer, 2021.
work with LLM model in the future, for boosting fuzzing [11] Jiaxi Ye, Ruilin Li, and Bin Zhang. Rdfuzz: Accelerating directed fuzzing
with intertwined schedule and optimized mutation. Mathematical Prob-
performance in many open-source software projects, kernel lems in Engineering, 2020:1–12, 2020.
applications and protocols. [12] Yan Wang, Peng Jia, Luping Liu, Cheng Huang, and Zhonglin Liu. A
systematic review of fuzzing based on machine learning techniques. PloS
one, 15(8):e0237749, 2020.
VII. CONCLUSION [13] Jianye Hao, Tianpei Yang, Hongyao Tang, Chenjia Bai, Jinyi Liu,
Combining RL and current fuzzing techniques holds the po- Zhaopeng Meng, Peng Liu, and Zhen Wang. Exploration in deep re-
tential to accelerate fuzzing effectively. However, there are inforcement learning: From single-agent to multiagent domain. IEEE
Transactions on Neural Networks and Learning Systems, pages 1–21, 2023.
still significant limitations that hinder its practicality. These [14] Jinghan Wang, Chengyu Song, and Heng Yin. Reinforcement Learning-
include the imbalance between exploitation and exploration based Hierarchical Seed Scheduling for Greybox Fuzzing. In The Network
in the model, slow speed, and the absence of comparative and Distributed System Security (NDSS) Symposium 2022, 2022.
[15] Tiantian Ji, Zhongru Wang, Zhihong Tian, Binxing Fang, Qiang Ruan,
studies regarding practical effectiveness against real-world Haichen Wang, and Wei Shi. AFLPro: Direction sensitive fuzzing. Journal
fuzzing tools. In this article, we introduce an RL-based of Information Security and Applications, 54:102497, 2020.
fuzzing model with code coverage that can balance the ex- [16] Peng Chen, Jianzhong Liu, and Hao Chen. Matryoshka: Fuzzing deeply
nested branches. In Proceedings of the 2019 ACM SIGSAC Conference on
ploitation and exploration aspects by efficient input selection Computer and Communications Security, pages 499–513, 2019.
and scheduling algorithms. Moreover, to improve efficiency, [17] Hanfang Zhang, Anmin Zhou, Peng Jia, Luping Liu, Jinxin Ma, and Liang
multi-level input mutation algorithms and early termination Liu. InsFuzz: Fuzzing binaries with location sensitivity. IEEE Access,
7:22434–22444, 2019.
mechanisms are also implemented. The effectiveness of our [18] Craig Beaman, Michael Redbourne, J Darren Mummery, and Saqib Hakak.
proposed CTFuzz model has been demonstrated via experi- Fuzzing vulnerability discovery techniques: survey, challenges and future
ment results, where it is compared with other modern fuzzing directions. Computers & Security, page 102813, 2022.
[19] Aoshuang Ye, Lina Wang, Lei Zhao, Jianpeng Ke, Wenqi Wang, and
tools or RL-based fuzzing models in the capability of discov- Qinliang Liu. RapidFuzz: accelerating fuzzing via generative adversarial
ering new paths, increasing code coverage as well as time- networks. Neurocomputing, 460:195–204, 2021.
effectiveness. Despite its potential outcomes, the contribu- [20] Zhenhua Yu, Haolu Wang, Dan Wang, Zhiwu Li, and Houbing Song.
CGFuzzer: A fuzzing approach based on coverage-guided generative ad-
tion of CTFuzz is still bounded, such as insufficiently rapid versarial networks for industrial IoT protocols. IEEE Internet of Things
speed, limited length of test cases, only aiming to binary Journal, 9(21):21607–21619, 2022.
software and insignificant improvement compared to the ran- [21] Yunchao Wang, Zehui Wu, Qiang Wei, and Qingxian Wang. Neufuzz:
Efficient fuzzing with deep neural network. IEEE Access, 7, 2019.
dom mechanism, all of which need to be considered in future [22] Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. Deep rein-
efforts. forcement fuzzing. In 2018 IEEE Security and Privacy Workshops (SPW),
pages 116–122. IEEE, 2018.
ACKNOWLEDGMENT [23] Alexandr Kuznetsov, Yehor Yeromin, Oleksiy Shapoval, Kyrylo Chernov,
Mariia Popova, and Kostyantyn Serdukov. Automated Software Vulner-
This research was supported by The VNUHCM-University of ability Testing Using Deep Learning Methods. In IEEE 2nd UKRCON,
Information Technology’s Scientific Research Support Fund. 2019.
[24] Sameer Reddy, Caroline Lemieux, Rohan Padhye, and Koushik Sen.
Quickly Generating Diverse Valid Test Inputs with Reinforcement Learn-
REFERENCES
ing. In 2020 IEEE/ACM 42nd International Conference on Software
[1] Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Engineering (ICSE), pages 1410–1421, 2020.
Manuel Egele, Edward J Schwartz, and Maverick Woo. The art, science, [25] Xiao Liu, Rupesh Prajapati, Xiaoting Li, and Dinghao Wu. Reinforcement
and engineering of fuzzing: A survey. IEEE Transactions on Software compiler fuzzing. In ICML 2019 Workshop, 2019.
Engineering, 47(11):2312–2331, 2019. [26] William Drozd and Michael D Wagner. Fuzzergym: A competitive frame-
[2] Fayozbek Rustamov, Juhwan Kim, Jihyeon Yu, and Joobeom Yun. Ex- work for fuzzing and learning. arXiv preprint arXiv:1807.07490, 2018.
ploratory Review of Hybrid Fuzzing for Automated Vulnerability Detec- [27] libfuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/
tion. IEEE Access, 9:131166–131190, 2021. LibFuzzer.html.
[3] Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. Fuzzing: a [28] American Fuzzy Lop plus plus (AFL++). https://github.com/AFLplusplus/
survey for roadmap. ACM Computing Surveys, 54:1–36, 2022. AFLplusplus.
[4] Sanoop Mallissery and Yu-Sung Wu. Demystify the fuzzing methods: A [29] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John
comprehensive survey. ACM Computing Surveys, 56(3):1–38, 2023. Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
[5] Xiaoqi Zhao, Haipeng Qu, Jianliang Xu, Xiaohui Li, Wenjie Lv, and Gai- [30] Chenyuan Yang, Zijie Zhao, and Lingming Zhang. Kernelgpt: En-
Ge Wang. A systematic review of fuzzing. Soft Computing, 2023. hanced kernel fuzzing via large language models. arXiv preprint
[6] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. Coverage- arXiv:2401.00563, 2023.
Based Greybox Fuzzing as Markov Chain. IEEE Transactions on Software [31] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury.
Engineering, 45(5):489–506, 2019. Large language model guided protocol fuzzing. In Proceedings of the
[7] Jinghan Wang, Chengyu Song, and Heng Yin. Reinforcement learning- 31st Annual Network and Distributed System Security Symposium (NDSS),
based hierarchical seed scheduling for greybox fuzzing. In Network and 2024.
Distributed Systems Security (NDSS) Symposium 2021, 2021. [32] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. Chatgpt vs sbst:
[8] Hongliang Liang, Xiaoxiao Pei, Xiaodong Jia, Wuwei Shen, and Jian
A comparative assessment of unit test suite generation. IEEE Transactions
Zhang. Fuzzing: State of the Art. IEEE Transactions on Reliability,
on Software Engineering, pages 1–19, 2024.
67(3):1199–1218, 2018.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3421989
Van-Hau Pham et al.: A Coverage-guided Fuzzing for Software Vulnerability Detection using RL-enabled Multi-Level Input Mutation
VAN-HAU PHAM obtained his bachelor’s degree PHAM THANH THAI graduated in Information
in computer science from the University of Natural Security with Honor Program in 2023 at the Uni-
Sciences of Hochiminh City in 1998. He pursued versity of Information Technology, Vietnam Na-
his master’s degree in Computer Science from the tional University, Ho Chi Minh City, Vietnam
Institut de la Francophonie pour l’Informatique (UIT-VNU-HCM). Since 2022, he has been ac-
(IFI) in Vietnam from 2002 to 2004. Then he did tively engaged as a student member at the Informa-
his internship and worked as a full-time research tion Security Laboratory (InSecLab), UIT-VNU-
engineer in France for 2 years. He then persuaded HCM, focusing on projects related to information
his Ph.D. thesis on network security under the security and AI-driven security. His primary re-
direction of Professor Marc Dacier from 2005 to search focuses on cybersecurity, AI cybersecurity
2009. He is now a lecturer at the University of Information Technology, Viet- endeavors, particularly in fuzzing techniques for binary exploitation with AI.
nam National University Ho Chi Minh City (UIT-VNU-HCM), Hochiminh
City, Vietnam. His main research interests include network security, system
PHAN THE DUY received the B.Eng. and M.Sc.
security, mobile security, and cloud computing.
degrees in Software Engineering and Informa-
tion Technology, respectively from the University
DO THI THU HIEN received the B. Eng. degree of Information Technology (UIT), Vietnam Na-
in Information Security from the University of tional University Ho Chi Minh City (VNU-HCM),
Information Technology, Vietnam National Uni- Hochiminh City, Vietnam in 2013 and 2016, re-
versity Ho Chi Minh City (UIT-VNU-HCM) in spectively. Currently, he is pursuing a Ph.D. degree
2017. She received an M.Sc. degree in Information majoring in Information Technology, specializing
Technology in 2020. From 2017 until now, she in Cybersecurity at UIT, Hochiminh City, Vietnam.
has worked as a member of a research group at From 2016, he also worked as a researcher member
the Information Security Laboratory (InSecLab) in the Information Security Laboratory (InSecLab), UIT-VNU-HCM after 5
at UIT. Her research interests are malware analy- years in the industry, where he joined and created several security-enhanced
sis and detection, Information security & privacy, and large-scale teleconference systems. His main research interests concen-
Software-defined Networking, and its related security-focused problems. trate on Cybersecurity & privacy problems, including Software Security,
Software-Defined Networking (SDN), Malware and Cyber threat detection,
Digital forensics, Machine Learning and Adversarial Machine Learning in
NGUYEN PHUC CHUONG graduated in Infor-
Cybersecurity domains, Private Machine Learning, and Blockchain.
mation Security with Honor Program in 2023 at the
University of Information Technology, Vietnam
National University, Ho Chi Minh City, Vietnam
(UIT-VNU-HCM). Since 2022, he has been ac-
tively engaged as a student member at the Informa-
tion Security Laboratory (InSecLab), UIT-VNU-
HCM, focusing on projects related to information
security and AI-driven security. His primary re-
search focuses on software security, automatic ex-
ploitation, and AI cybersecurity endeavors, particularly in fuzzing techniques
for binary exploitation with AI.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4