Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
Alperen Yildiz1 , Sin G. Teo2 , Yiling Lou3 , Yebo Feng4 , Chong Wang4 * , Dinil M. Divakaran2
1
Nanyang Technological University, Singapore
2
Agency for Science, Technology and Research (A*STAR), Singapore
3
Fudan University, China
4
Nanyang Technological University, Singapore
You are a security researcher tasked with identifying vulnerabilities in a codebase. You have been given a function to analyze.
The function may or may not be vulnerable.
If you think it is vulnerable reply with @@VULNERABLE@@, otherwise reply with @@NOT VULNERABLE@@
If you think the function is vulnerable, please provide the CWE number that you think is most relevant to the vulnerability in
the form of @@CWE: <CWE_NUMBER>@@
For example:
@@VULNERABLE@@
@@CWE: CWE-1234@@
Example Detections:
- Vulnerable Example 1
- Benign Example 1
…
FS Examples
- Vulnerable Example 10
- Benign Example 10
Solve this problem step by step. Carefully break down the reasoning process to arrive at the correct solution. Explain your
reasoning at each step before providing the final answer. CoT Instruction
Observation
“The callers of X are Y”
get_callers
get_callees
get_definition
Code Repository Action Tools
Answer the following questions as best you can. You have access to the following tools:
- get_callers: Get callers for a function, function names are returned
- get_callees: Get callees for a function, function names are returned
- get_definition: Get definition code of a function based on the function name.
Begin!
Question: {input}
Thought:{agent_scratchpad}
# Code # Code
char* trimTrailingWhitespace(char *strMessage, int length) { char* trimTrailingWhitespace(char *strMessage, int length) {
char *retMessage; char *retMessage;
char *message = malloc(sizeof(char)*(length+1)); char *message = malloc(sizeof(char)*(length+1));
// copy input string to a temporary string // copy input string to a temporary string
char message[length+1]; char message[length+1];
int index; int index;
for (index = 0; index < length; index++) { for (index = 0; index < length; index++) {
message[index] = strMessage[index]; message[index] = strMessage[index];
} }
message[index] = '\\0’; message[index] = '\\0’;
// trim trailing whitespace // trim trailing whitespace
int len = index-1; int len = index-1;
while (isspace(message[len])) { while (len >= 0 && isspace(message[len])) {
message[len] = '\\0’; message[len] = '\\0’;
len--; len--;
} }
// return string without trailing whitespace // return string without trailing whitespace
retMessage = message; retMessage = message;
return retMessage; return retMessage;
} }
# Explanation # Explanation
In the code, a utility function is used to trim trailing whitespace from a In the code, a utility function is used to trim trailing whitespace from a
character string. The function copies the input string to a local character character string. The function copies the input string to a local character
string and uses a while statement to remove the trailing whitespace by string and uses a while statement to remove the trailing whitespace by
moving backward through the string and overwriting whitespace with a moving backward through the string and overwriting whitespace with a
NULL character. However, this function can cause a buffer underwrite NULL character. This function avoids a buffer underwrite by
if the input character string contains all whitespace. On some systems incorporating the boundary check `len >= 0` in the while loop
the while statement will move backwards past the beginning of a condition. This ensures that the loop does not move past the beginning
character string and will call the `isspace()` function on an address of the character string or call the `isspace()` function on an address
outside of the bounds of the local buffer. outside the bounds of the buffer.
Figure 5: A FS example of “CWE-787: Out-of-bounds Write”, including both vulnerable version and benign
version.
B Tool Invocation Distribution
Figure 6 illustrates the distribution of tool invocations for ReAct Agent with GPT-4o and vanilla prompt-
ing. The data shows that, in most cases, ReAct Agent invokes the tools one to three times to retrieve the
necessary callers or callees.
0 1 2 3 4 5 6 7 8 9
Figure 6: Distribution of tool invocations for ReAct Agent with GPT-4o and vanilla prompting.
C Llama-3.1 Results
Table 3 presents the results of Llama3.1-8B on J IT V UL. ReAct Agents using Llama3.1-8B show signifi-
cantly lower performance, with the execution process often failing due to formatting and parsing issues.
As a result, the agents frequently default to the ben label.
Method F1 pAcc
Plain LLM
- vanilla 58.05 0.84
- w/ CoT 49.79 10.92
- w/ FS 54.48 1.68
- w/ CoT+FS 29.55 14.29
Dep-Aug LLM
- vanilla 40.48 15.17
- w/ CoT 21.18 8.39
- w/ FS 27.37 10.88
- w/ CoT+FS 16.46 7.42
ReAct Agent
- vanilla 9.09 4.20
- w/ CoT 14.67 3.36
- w/ FS 3.28 0.84
- w/ CoT+FS 3.28 1.68
D Case Study
We provide several examples to illustrate the inputs and outputs of the detection methods for a better
understanding of the analysis.
D.1 CVE-2019-15164
Figure 7 illustrates the case study derived from CVE-2019-15164 (details at
https://nvd.nist.gov/vuln/detail/CVE-2019-15164), with the left side showing the vulnerable code
and the detection methods’ responses, and the right side depicting the benign version and its corre-
sponding responses. The vulnerable version of the function daemon msg open req is susceptible
to a “CWE-918: Server-Side Request Forgery (SSRF)” vulnerability due to the lack of validation
for source before opening the device, which is read from the network socket. The benign version
addresses this vulnerability by adding an if-condition to validate whether source is a valid URL, as
highlighted in the figure.
Label Predictions. When using Plain LLM with GPT-4o and vanilla prompting, the analyses of
both the vulnerable and benign versions focus on buffer operations and misclassify the benign as
vulnerable. In contrast, when using the ReAct Agent, the predictions for both versions are correct.
The agent is able to retrieve and analyze additional context, such as understanding its caller function
daemon serviceloop and surrounding function bodies. This contextual information enables the
agent to better comprehend how the daemon msg open req function is used within the broader code-
base and recognize the risk introduced by the unvalidated URL input. Key points in the analysis process
are highlighted to show the improved detection capability provided by the ReAct Agent.
CWE Predictions. However, upon examining the specific vulnerability categories predicted by Plain
LLM and ReAct Agent, some fine-grained issues emerge. Plain LLM incorrectly predicts “CWE-120:
Buffer Copy without Checking Size of Input” for both the vulnerable and benign versions, which is
entirely inaccurate. On the other hand, ReAct Agent predicts “CWE-20: Improper Input Validation”
for the vulnerable version. While this is not the correct classification, it is somewhat related to the
ground-truth vulnerability of Server-Side Request Forgery (SSRF). The SSRF vulnerability arises from
the improper validation of the source parameter before opening device, which the ReAct Agent’s
prediction partially captures, indicating a closer alignment to the actual issue.
Analysis Patterns. When delving into the detailed analysis processes, we observe that the ReAct
Agent does not maintain consistent analysis patterns across both versions. For the vulnerable version,
the agent focuses on buffer operation and input validation, while for the benign version, it conducts a
more comprehensive check. However, in this case, the analysis patterns should be more similar, suggest-
ing that the LLM behind the ReAct Agent lacks sufficient robustness to capture the actual vulnerability
characteristics. This indicates a deficiency in its reasoning capabilities for accurate vulnerability reason-
ing.
D.2 CVE-2019-3877
Figure 8 illustrates the case study derived from CVE-2019-3877 (details at
https://nvd.nist.gov/vuln/detail/CVE-2019-3877), with the left side showing the vulnerable code
and the detection methods’ responses, and the right side depicting the benign version and its correspond-
ing responses. The vulnerable version of the function am check url is susceptible to a “CWE-601:
URL Redirection to Untrusted Site” vulnerability due to the insufficient validation for url. The benign
version addresses this vulnerability by adding an if-condition to validate whether backslash exists in
url.
Label Predictions. Plain LLM with GPT-4o and vanilla prompting produces incorrect predictions,
often reversing the labels.
Analysis Patterns. A closer look at the analysis process reveals that Plain LLM fails to identify the
insufficient validation for the url in the vulnerable version, where an assert is highlighted. In contrast,
it over-analyzes the validation requirements for the benign version. These lead to incorrect predictions
for both versions.
Vulnerable Version Benign Version
static int daemon_msg_open_req(uint8 ver, struct daemon_slpars *pars, uint32 static int daemon_msg_open_req(uint8 ver, …, char *source, size_t sourcelen) {
plen, char *source, size_t sourcelen) { …
… nread = sock_recv(pars->sockctrl, source, plen,…);
nread = sock_recv(pars->sockctrl, source, plen,…); …
… source[nread] = '\0’;
source[nread] = '\0’; plen -= nread;
plen -= nread; …
… if (is_url(source)){
// Open the selected device pcap_snprintf(errmsgbuf, PCAP_ERRBUF_SIZE, "Source …");
… goto error;
} }
// Open the selected device
…
CWE-918: Server-Side Request Forgery (SSRF) }
The function daemon_msg_open_req is vulnerable to a buffer overflow due to The function daemon_msg_open_req has a buffer overflow vulnerability due to
improper handling of the source buffer. The issue arises because the function reads improper handling of the source buffer. The issue arises because the function reads
data into source using sock_recv without ensuring that the buffer is large enough to data into source using sock_recv without ensuring that the buffer is large enough to
accommodate the incoming data. Although the function includes a check at the store the incoming data. While there is a check to ensure that plen is not greater than
beginning to ensure that plen is not greater than sourcelen - 1, this does not fully sourcelen - 1, this does not fully prevent an overflow when sock_recv writes to source.
prevent an overflow when sock_recv writes to source. If plen exceeds sourcelen, the function could still write beyond the buffer’s limits,
… causing a buffer overflow.
…
Given these points, the function is vulnerable due to insufficient input validation,
which aligns with CWE-20 (Improper Input Validation).