Verse - Systems-Using LLMs To Generate Fuzz Generators
Verse - Systems-Using LLMs To Generate Fuzz Generators
verse.systems/blog/post/2024-03-09-using-llms-to-generate-fuzz-generators/
LLMs seem surprisingly good at many things. So much so that not a week goes by
without someone coming up with yet another use-case for this technology, often to solve
tasks quickly that traditionally took a non-trivial amount of human work to complete.
This success is at first glance very pleasing: writing a good fuzzer for a non-trivial input
format is time-consuming. Being able to automate this process is therefore super
appealing. Brendan’s success here might even seem surprising. It certainly was to me
when I first saw his tweet. Having thought about it, with the benefit of hindsight, I’ve
started to understand why Brendan might have expected it to work in the first place.
I hold the opinion that we shouldn’t expect LLMs to be very good at static analysis. After
all, they are stochastic machines trained to produce the output that is statistically most
likely to come next given the LLM’s training data. In other words, to generate output
statistically “close to” the expected output. This is why they are of course known to be
imprecise, or to hallucinate and so on. But effective static analysis requires precision
above all else: if the static analyser tells you there is a bug in the code, it has to be very
sure the bug is real. Otherwise, developer time drowns in a sea of false bug reports.
1/7
An effective fuzzer generates semi-valid inputs that are “valid enough” in that they
are not directly rejected by the parser, but do create unexpected behaviors deeper
in the program and are “invalid enough” to expose corner cases that have not
been properly dealt with.
In other words, a fuzzer should generate inputs that are “close to” what is expected by
the program. Therefore, we might expect LLMs to be a better fit at generating a “close
enough” fuzzer for a given program.
I therefore decided to test Claude on an unknown input format. Fortunately for me I had
one lying around. 8 years ago, in 2016, when I first started teaching at Melbourne, I set
my students an assignment in which I created a fictitious input format and asked them to
write fuzzers for this input format.
To make the assignment challenging, the input format was protected by a CRC32
checksum. Specifically, each input is a packet, whose structure looks like the following
and includes a (useless) sequence number, a two-byte length field, and a data payload
(whose maximum size is 4096 bytes):
The data payload consisted of a sequence of instructions (each one byte in size) for a
simple arithmetic stack machine that operated over a stack of signed integers (a bit like
the venerable UNIX dc utility. The machine instructions included ones to push small
integers (in the range [0,9]) onto the stack; to pop the stack; to ADD, SUBTRACT, MULTIPLY
or DIVIDE the two items on the top of the stack, popping them from the stack and
pushing the result; as well as operations to READ and WRITE to arbitrary positions in the
stack (identified using the top-of-stack operand as an offset downwards from the top of
the stack).
I wrote a parser for the input format and an interpreter for the stack machine language.
The C code that implements this functionality is under 400 SLOC. I seeded a range of
vulnerabilities in this code, namely:
2/7
A Heartbleed style out-of-bounds memcpy in the parsing code (blindly trusting the
input data’s length field to use as the length to pass to memcpy, even though the
source and destination buffers’ sizes are both 4096 bytes while the maximum 2-
byte length value is 65535);
Unchecked division-by-zero when the second argument to the DIVIDE instruction
was zero;
Unchecked signed overflow when stack operands grow too large due to repeated
application of arithmetic operations;
Out-of-bounds read due to unchecked offset in the implementation of the READ
instruction;
Out-of-bounds write similarly in the WRITE instruction.
I asked Claude Sonnet (the default, free Claude model) to generate a Python fuzzer for
the C code. The prompt I crafted gave Claude some hints about the code structure,
telling it that there was an outer parser implemented by the parse_packet function,
followed by an interpreter implemented by the execute function. My prompt (linked
above) included the relevant bits of the code along with these explanations and the
request to the model to generate the Python fuzzer.
Claude generated a pretty useful fuzzer, although I had to ask it to fix a type error in the
code it generated (which it did first try after I gave it the error message).
3/7
import random
import struct
import sys
import binascii
# Constants
MAX_DATA_LEN = 0xfff
PACKET_LEN = MAX_DATA_LEN + 10
CRC_OFFSET = 0
SEQ_NUM_OFFSET = 4
LEN_OFFSET = 8
DATA_OFFSET = 10
# Instructions
INSTRUCTION_ADD = ord('a')
INSTRUCTION_SUBTRACT = ord('s')
INSTRUCTION_READ = ord('r')
INSTRUCTION_MULT = ord('m')
INSTRUCTION_DIVIDE = ord('d')
INSTRUCTION_POP = ord('p')
INSTRUCTION_WRITE = ord('w')
INSTRUCTIONS = [INSTRUCTION_ADD, INSTRUCTION_SUBTRACT, INSTRUCTION_READ,
INSTRUCTION_MULT,
INSTRUCTION_DIVIDE, INSTRUCTION_POP, INSTRUCTION_WRITE]
# Generate packets
for _ in range(10000):
# Generate packet data
data_len = random.randint(0, MAX_DATA_LEN)
data = bytearray(random.randrange(256) for _ in range(data_len))
padding = [random.choice(INSTRUCTIONS + list(range(ord('0'), ord('9') + 1)))
for _ in range(MAX_DATA_LEN - data_len)]
data += bytes(padding)
# Construct packet
packet = bytearray(PACKET_LEN)
struct.pack_into(">I", packet, CRC_OFFSET, 0) # Placeholder for CRC32
struct.pack_into(">I", packet, SEQ_NUM_OFFSET, seq_num)
struct.pack_into(">H", packet, LEN_OFFSET, data_len)
packet[DATA_OFFSET:] = data
I compiled the C program with Clang’s Address Sanitizer and Undefined Behaviour
Sanitizer turned on. Running the Python code to produce inputs for the C program, and
running the compiled C program against the inputs produced, I found that Claude’s
fuzzer is able to trigger the out-of-bounds read and write bugs only.
4/7
However, the fuzzer is unable to trigger the other bugs. I shared this code with Brendan
who also played with it and got similar results.
Notice how the packet data it generates is totally random bytes. It then generates what it
calls padding which contains entirely valid instructions. So of course the correctly-
formatted instruction data is hidden behind a mass of noise.
Instead the data and padding should be calculated e.g. like this:
(With this change, the fuzzer can trigger the division-by-zero. However, the other bugs
remain out of reach because the out-of-bounds read and write are far too easy to trigger,
making the odds of being able to trigger the others without triggering one of those
incredibly small. This is not the fuzzer’s fault, per se. Although is an inherent limitation
with fuzzing in this manner.)
Also, Claude’s fuzzer always correctly reports the packet length in the len field:
This means the Heartbleed style vulnerability in the parser can never be triggered (and
also means the correctly formatted instruction data is ignored by the interpreter, meaning
that that correctly formatted input has no ability to trigger bugs in the code).
5/7
Note: doing so isn’t very scientific. We are comparing a method (fuzzing) in which false
alarms are impossible (and therefore all alerts are useful information) to another (static
analysis by LLM) in which it is very difficult to trust the generated bug alerts. So we are
hardly comparing apples with apples. But let’s proceed anyway.
So it found 3 out of 5 vulnerabilities and one additional one. Whether you consider the
uninitialised stack a real vulnerability is debatable. It matters only in the presence of the
out-of-bounds read vulnerability (because without that, the uninitialised portions of the
stack can never be accessed). On balance I therefore consider the uninitialised stack a
false positive.
1. LLMs seem interestingly good at being able to analyse code and write a fuzzer to
generate “close enough” inputs to exercise that code and find bugs. Though they
have some limitations.
2. LLMs can do static analysis but suffer from false positives and so on.
This suggests, as my colleague Thuan Pham notes, that perhaps we should consider
combining the two approaches.
1. Ask the LLM to identify vulnerabilities in the code (i.e., to statically analyse it)
2. For each vulnerability the LLM identifies, ask it to generate a directed fuzzer that
generates inputs to try to trigger (just) that vulnerability.
A quick experiment with Claude suggests this approach could be promising (with some
prompting, Claude was able to generate a program to generate an input to trigger the
Heartbleed-style vulnerability mentioned above). But of course further work is needed to
validate this approach and work out what challenges need to be overcome to make it
practical (if indeed it can be made practical).
That’s certainly more work than can be squeezed into the odd free moment on a
heatwave weekend.
6/7
Conclusion
What are we to make of this, above any other unscientific experiment with an LLM?
In terms of fuzzing, LLMs have been used to generate fuzz drivers and there is much
interest in that topic at present (which is distinct from using them to generate stand-
alone fuzzers, the topic of this blog post). Indeed, DARPA sees so much potential in
using LLMs for vulnerability discovery, exploitation and patching that last year it decided
to revisit its 2016 Cyber Grand Challenge, launching the AIxCC competition, to
understand the impact of LLM-related technology for these tasks.
With all this in mind, I’m curious to see whether any of the AIxCC competitors attempt to
automate vulnerability discovery by fuzz generator generation (i.e., the topic of this
post), [Updated: Sun 10 Mar 2024 14:15:51 AEDT to add] whether directed or not. We
should certainly expect many to try fuzz-driver generation and static-analysis via LLM.
Like so many other experiments with LLMs, this one served to simultaneously surprise
and disappoint, [Updated: Sun 10 Mar 2024 14:15:51 AEDT to add] though I remain
optimistic about the value of exploring this approach further.
In the meantime, much thanks of course to Brendan whose work inspired this post.
← Previous Post
7/7