Machine Learning-Based Fuzz Testing Techniques A Survey

Uploaded by

Adel El-Shahat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Machine Learning-Based Fuzz Testing Techniques A Survey

Uploaded by

Adel El-Shahat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Received 4 November 2023, accepted 19 December 2023, date of publication 26 December 2023,

date of current version 31 January 2024.

Digital Object Identifier 10.1109/ACCESS.2023.3347652

Machine Learning-Based Fuzz Testing

Techniques: A Survey
AO ZHANG 1, YIYING ZHANG 1, YAO XU 1, CONG WANG 1, AND SIWEI LI2
1 College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China
2 State Grid Information and Telecommunication Company Ltd., Beijing 102200, China

Corresponding author: Yiying Zhang ([email protected])

ABSTRACT Fuzz testing is a vulnerability discovery technique that tests the robustness of target programs by
providing them with unconventional data. With the rapid increase in software quantity, scale and complexity,
traditional fuzzing has revealed issues such as incomplete logic coverage, low automation level and
insufficient test cases. Machine learning, with its exceptional capabilities in data analysis and classification
prediction, presents a promising approach for improve fuzzing. This paper investigates the latest research
results in fuzzing and provides a systematic review of machine learning-based fuzzing techniques. Firstly,
by outlining the workflow of fuzzing, it summarizes the optimization of different stages of fuzzing using
machine learning. Specifically, it focuses on the application of machine learning in the preprocessing phase,
test case generation phase, input selection phase and result analysis phase. Secondly, it mentally focuses on
the optimization methods of machine learning in the process of mutation, generation and filtering of test cases
and compares and analyzes its technical principles. Furthermore, it analyzes the performance gains brought
by applying machine learning techniques to fuzzing, mainly including coverage, vulnerability detection
capability, efficiency and effectiveness of test cases. Lastly, it concludes by summarizing the challenges
and difficulties in combining machine learning with fuzzing and presents prospects for future trends in this
field.

INDEX TERMS Vulnerability discovery, fuzzing, machine learning.

I. INTRODUCTION data. The value of fuzzing has been explored, black-box [3],
In recent years, there has been a proliferation of network white-box [4] and gray-box fuzzers [5] have appeared one
attacks and a rapid increase in the number of vulnerabili- after another. Countless scholars have carried out continuous
ties, leading to potential risks such as information leakage improvement and enhancement, and the coverage rate and
or loss. Vulnerability discovery techniques aim to identify anomaly triggering ability have been improved to different
and patch vulnerabilities before they are exploited by attack- degrees. However, traditional fuzzing still faces several chal-
ers [1], effectively reducing security threats and maintaining lenges, such as an insufficient number of existing test cases,
the secure operation of networks. Fuzz testing, as an effec- weak ability of generated test cases to trigger vulnerabilities,
tive method for vulnerability discovery, attempts to trigger the lack of differentiation between test case weights during
program anomalies by automatically or semi-automatically input selection, and a relatively high degree of blindness
generating test cases, monitoring target program execution during the testing process.
and providing feedback to adjust the generation of test cases. With the remarkable performance of machine learning
It offers the advantages of easy deployment and broad appli- techniques in statistical learning, natural language processing
cability. The concept of fuzz testing was initially proposed and pattern recognition, researchers have applied these tech-
by Miller in 1990 [2], who designed a tool called Fuzz to niques to the field of cybersecurity, including the detection of
test the robustness of target programs using unconventional malicious code [6] and intrusion detection [7]. Machine learn-
ing can automatically learn grammar rules that conform to
The associate editor coordinating the review of this manuscript and syntax specifications from a large number of samples, effec-
approving it for publication was Xinyu Du . tively addressing classification problems in fuzzing, such as

2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 14437
A. Zhang et al.: Machine Learning-Based Fuzz Testing Techniques: A Survey

FIGURE 1. Basic flow of fuzzing.

determining the validity of generated test cases and the usabil- machine learning techniques at different stages of fuzzing,
ity of seeds for mutation. Furthermore, machine learning can comparing and analyzing the strengths and technical princi-
reduce manual effort and minimize the time overhead of ples of different fuzzing tools. Furthermore, in Section IV,
fuzzing. Therefore, combining machine learning with fuzzing the performance gains to fuzzing from different machine
provides new ideas and methods for alleviating the bottle- learning approaches are theorized. Next, the challenges faced
necks of traditional fuzzing techniques. How to balance the by existing fuzzing techniques are analyzed, and in Section V,
advantages of both to better enhance vulnerability detection the existing liberation schemes are presented as well as an
is still an area that requires further research. This paper insight into the future trends in the field. Finally, Section VI
focuses on the background of machine learning, analyzes summarizes and concludes the work presented in this paper.
and reviews a large body of literature on the combination of
machine learning and fuzzing. Taking the basic process of II. OVERVIEW OF FUZZING
fuzzing as a vein, it introduces various improved methods A. BASIC FLOW OF FUZZING
of fuzzing implementation based on different machine learn- Fuzzing involves constructing a large number of illegal
ing models, comparing and analyzing their enhancements test inputs, fuzz testing the target program, monitoring its
and improvements. Furthermore, it introduces the perfor- execution, observing and recording any abnormal behavior,
mance gain of different machine learning methods for fuzzing analyzing the cause of abnormality or crashes, and finally
and demonstrates the effectiveness of machine learning for detecting vulnerabilities. The basic flow of fuzzing can be
fuzzing improvement. Moreover, it identifies existing issues divided into six parts: pre-processing, test case generation,
in applying machine learning techniques to fuzzing and pro- input selection, test execution, evaluation and result analy-
vides insights into future development trends. sis [8], as shown in Figure 1.
The primary contributions of our work can be summarized The preprocessing stage primarily involves collecting rel-
as follows: evant information about the target program and specifying
(1) This paper refers to and examines a large amount of the strategy for fuzzing to assist the fuzzing tool in detecting
relevant literature and highlights the latest research results in or observing the target program. This stage typically relies
the past five years, which can better grasp the future direction on program analysis techniques such as instrumentation [9]
of the fuzzing field. Not only that, this paper analyzes and [10], symbolic execution [11], [12] and taint analysis [13],
organizes research on fuzzing in different areas, such as [14]. Existing research efforts have focused on integrating
fuzzing in the Internet of Things, web applications, compilers one or more of these techniques into hybrid fuzzing to
and deep learning models, which encompasses common areas improve overall performance. For example, Risk-AFL [10]
where fuzzing can be used. proposes a risk-guided seed selection method based on AFL.
(2) This paper focuses on the workflow of fuzzing and During program operation, the risk fitness of the seeds is
introduces the application of machine learning methods in calculated based on the risky functions and function calls
four different stages: preprocessing, test case generation, on the program execution path by means of the instru-
input selection and result analysis. It compares and contrasts ment, and the seed selection strategy of AFL is improved
various improvement techniques, explaining their underlying accordingly. Intriguer [11] optimizes symbolic execution by
technical principles and the resulting optimization enhance- utilizing field-level knowledge to more effectively simulate
ments. Finally, it provides a comprehensive summary of symbolically relevant instructions. TaintPoint [14] applies to
the performance gains achieved through the utilization of the seed mutation stage of general fuzzing and obtains more
machine learning algorithms. It facilitates readers to better accurate taint analysis results to guide mutation.
understand the overall workflow of fuzzing and helps them The test case generation phase is mainly to obtain a large
to carry out in-depth research. number of test inputs, and based on the relevant information
(3) By comparing different improvement methods, the obtained in the preprocessing phase, select the appropri-
problems and challenges in this field are analyzed and sum- ate generation method or mutation strategy to construct a
marized, and the possible hot research directions in the field large number of test cases which are suitable for the target
of fuzzing in the background of machine learning are put program. The test case generation phase consists of seed
forward. selection, mutation strategy scheduling and test case gener-
Section II provides a brief overview of the basic pro- ation. Seed selection is a process of evaluating the likelihood
cess of fuzzing. Section III introduces the application of that a seed could trigger a program anomaly and prioritizing