0% found this document useful (0 votes)
3 views

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

The document introduces J IT V UL, a benchmark for just-in-time (JIT) vulnerability detection in code repositories, addressing limitations in existing benchmarks that primarily focus on function-level analysis. It highlights the importance of interprocedural context and pairwise evaluation for effective vulnerability detection, demonstrating that ReAct Agents outperform LLMs in distinguishing between vulnerable and benign code. The findings emphasize the need for further refinement of LLM-based methods and the exploration of advanced prompting strategies for improved vulnerability analysis.

Uploaded by

Jes Jes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

The document introduces J IT V UL, a benchmark for just-in-time (JIT) vulnerability detection in code repositories, addressing limitations in existing benchmarks that primarily focus on function-level analysis. It highlights the importance of interprocedural context and pairwise evaluation for effective vulnerability detection, demonstrating that ReAct Agents outperform LLMs in distinguishing between vulnerable and benign code. The findings emphasize the need for further refinement of LLM-based methods and the exploration of advanced prompting strategies for improved vulnerability analysis.

Uploaded by

Jes Jes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Benchmarking LLMs and LLM-based Agents

in Practical Vulnerability Detection for Code Repositories

Alperen Yildiz1 , Sin G. Teo2 , Yiling Lou3 , Yebo Feng4 , Chong Wang4 * , Dinil M. Divakaran2
1
Nanyang Technological University, Singapore
2
Agency for Science, Technology and Research (A*STAR), Singapore
3
Fudan University, China
4
Nanyang Technological University, Singapore

Abstract Table 1: Comparison of J IT V UL with existing bench-


marks for repository-level vulnerability detection.
Large Language Models (LLMs) have shown Benchmark # CVEs # CWEs Pairwise Agents Eval
promise in software vulnerability detection,
arXiv:2503.03586v1 [cs.CR] 5 Mar 2025

ReposVul 6,134 236 ✗ ✗


particularly on function-level benchmarks like VulEval 4,196 5 ✗ ✗
Devign and BigVul. However, real-world de- J IT V UL (ours) 879 91 ✓ ✓
tection requires interprocedural analysis, as
vulnerabilities often emerge through multi-
hop function calls rather than isolated func- However, a significant gap exists between these
tions. While repository-level benchmarks like widely used benchmarks and the requirements for
ReposVul and VulEval introduce interproce- real-world vulnerability detection in code repos-
dural context, they remain computationally ex- itories (Wang et al., 2024; Wen et al., 2024b).
pensive, lack pairwise evaluation of vulner- These benchmarks primarily focus on function-
ability fixes, and explore limited context re- level vulnerability detection, where a single func-
trieval, limiting their practicality.
tion is input to a detector for label prediction
We introduce J IT V UL, a JIT vulnerability de- without considering the broader repository con-
tection benchmark linking each function to text. In contrast, real-world vulnerabilities—such
its vulnerability-introducing and fixing com-
as null pointer dereference (NPD)—often arise
mits. Built from 879 CVEs spanning 91
vulnerability types, J IT V UL enables compre-
within multi-hop function call chains, and not in
hensive evaluation of detection capabilities. isolated functions. Detecting such vulnerabilities
Our results show that ReAct Agents, leverag- requires tracing interprocedural call relationships
ing thought-action-observation and interproce- and understanding the relevant code elements like
dural context, perform better than LLMs in branch conditions (Risse and Böhme, 2024a).
distinguishing vulnerable from benign code. To address these limitations, recent studies
While prompting strategies like Chain-of-
have shifted towards repository-level detection
Thought help LLMs, ReAct Agents require
further refinement. Both methods show in- scenarios, enabling more realistic benchmark-
consistencies, either misidentifying vulnera- ing of LLMs for vulnerability detection. Re-
bilities or over-analyzing security guards, in- posVul (Wang et al., 2024) and VulEval (Wen
dicating significant room for improvement. et al., 2024b) are two benchmarks that enhance the
detection process by extracting callers and callees
1 Introduction for a target function from the code repository,
providing interprocedural context. These callers
Given the success of large language models
(functions that call the target function) and callees
(LLMs) across various application domains, re-
(functions called by the target function) are selec-
searchers have begun exploring their effective-
tively fed into LLMs to assess the vulnerability of
ness in software vulnerability detection. On well-
the target function. Findings from these bench-
known vulnerability detection benchmarks such
marks show that while LLMs benefit from the ad-
as Devign (Zhou et al., 2019) and BigVul (Fan
ditional interprocedural context, they still exhibit
et al., 2020), LLMs—particularly those fine-tuned
low effectiveness in real-world scenarios, particu-
on code—have shown promising results, suggest-
larly when fine-tuning is not applied.
ing their potential for real-world applications.
Although existing works have provided valu-
* Chong Wang is the corresponding author. able insights into the effectiveness of LLMs
for repository-level vulnerability detection, sev- ity detection. Our evaluation on J IT V UL uncovers
eral key limitations remain in achieving more several key findings. A higher F1 score doesn’t
comprehensive benchmarking. First, in exist- always reflect a method’s ability to capture vul-
ing repository-level vulnerability detection ap- nerability characteristics, highlighting the need for
proaches, all functions within a code repository pairwise evaluation. ReAct Agents, using their
are treated as target functions for vulnerability de- thought-action-observation framework and inter-
tection. This approach becomes computationally procedural context, better differentiate between
expensive and impractical, particularly for large vulnerable and benign versions. While strategies
repositories like the Linux kernel. Second, the like CoT and few-shot examples boost LLMs’ per-
benchmarks do not effectively assess the capabil- formance, ReAct Agents need more tailored de-
ity of LLMs in distinguishing between vulnera- signs. Both LLMs and ReAct Agents often ex-
ble functions and those where the vulnerability hibit inconsistent analysis patterns between the
has been patched. As highlighted by a recent vulnerable and benign versions, indicating a lack
study (Risse and Böhme, 2024b), this is a critical of robustness in vulnerability analysis. These find-
limitation of machine learning-based vulnerability ings highlight key areas for future research in vul-
detection methods. Finally, the integration of in- nerability detection: (i) developing more com-
terprocedural context (i.e., callers and callees) has prehensive evaluation guidelines for benchmark-
mostly been limited to retrieval-based strategies, ing LLMs and LLM-based agents in JIT vulner-
leaving many potential approaches underexplored. ability detection, (ii) exploring advanced prompt-
LLM-based agentic methods, such as ReAct (Yao ing strategies for improving LLM-based agents in
et al., 2022), offer the potential for on-demand, it- vulnerability detection, and (iii) designing robust
erative acquisition of interprocedural context, en- reasoning models tailored to vulnerability analy-
abling more adaptive analysis. sis that capture the true essence of vulnerabilities,
To bridge the gap, we target the task of just-in- rather than relying on speculation.
time (JIT) vulnerability detection (Lomio et al., This paper makes the following contributions:
2022), a more practical approach for identifying
• We introduce J IT V UL, a benchmark for just-
vulnerabilities in code repositories. Unlike prior
in-time (JIT) vulnerability detection in code
methods that analyze all functions in a reposi-
repositories, consisting of 1,758 pairwise com-
tory, JIT vulnerability detection is triggered only
mits spanning 91 vulnerability types.
for functions modified in a commit, with inter-
procedural context provided. Inspired by prior re- • We implement ReAct agents with various
search (Risse and Böhme, 2024b), we construct a prompting strategies and foundation models
pairwise benchmark called J IT V UL for JIT vul- and evaluate their effectiveness in leveraging
nerability detection, where each target function interprocedural context for JIT detection.
is linked to both a vulnerability-introducing com-
mit and a vulnerability-fixing commit. To achieve • Our experimental results provide valuable in-
this, we first select 879 Common Vulnerabili- sights into the application of LLMs and LLM-
ties and Exposures (CVE) entries, each repre- based agents for real-world vulnerability de-
senting a unique vulnerability, from PrimeVul, a tection. We explore the necessity of pairwise
high-quality function-level detection dataset (Ding evaluation, the advantages and disadvantages
et al., 2024). We then extract target functions from of LLMs and ReAct agents, and the impact of
the vulnerability-fixing commits explicitly refer- prompting designs and foundation models.
enced in the CVE entries, obtaining both their vul-
• We release all code and data at this repository.
nerable and patched versions. Finally, we ana-
lyze the commit history of each vulnerable func- 2 Related Work
tion to identify the corresponding vulnerability-
introducing commit. The resulting J IT V UL com- 2.1 Vulnerability Detection Benchmarks
prises 1,758 paired commits spanning 91 Common Several benchmarks have been proposed
Weakness Enumerations (CWEs). for function-level vulnerability detection.
We implement LLMs and ReAct Agents with BigVul (Fan et al., 2020) collects C/C++
various prompting strategies and foundation mod- vulnerabilities from the CVE database, filtering
els to assess their effectiveness in JIT vulnerabil- out entries without public Git repositories, and
labels functions as vulnerable or non-vulnerable thought (CoT) prompting (Wei et al., 2022) is one
based on commit fixes. MegaVul (Ni et al., 2024) of the most widely used techniques, where in-
improves upon existing benchmarks by using code structions like “Let’s think step by step” guide the
parsing tools for accurate function extraction and model to break problems into sub-problems. Few-
de-duplicating functions referenced by multiple shot prompting (Brown et al., 2020) is another
CVEs. DiverseVul (Chen et al., 2023) ensures common method, providing example traces to en-
data quality by filtering vulnerability-introducing able in-context learning without modifying model
commits with specific keywords and dedupli- weights. Both CoT and few-shot prompting are
cating function bodies with hash functions. frequently employed in LLM-based vulnerability
PrimeVul (Ding et al., 2024) addresses data detection (Zhou et al., 2024b; Wen et al., 2024a).
quality challenges by proposing filtering rules to Agentic architectures are among the most
handle noise labels and duplicated functions. promising state-of-the-art technologies but remain
For repository-level vulnerability detection, Re- underexplored in vulnerability detection (Zhou
posVul (Wang et al., 2024) addresses issues with et al., 2024a). (Yao et al., 2022) introduce Rea-
tangled and outdated patches, using trace-based soning and Acting (ReAct) agents, which gener-
filtering to ensure data quality and integrating ate reasoning and action traces in an interleaved
repository-level features to provide richer con- manner to interact with their environment and an-
text for detection. VulEval (Wen et al., 2024b) alyze resulting observations. This iterative pro-
provides a framework that collects high-quality cess continues until the agent determines a final
data from sources like Mend.io Vulnerability answer. Reflexion agents (Shinn et al., 2023)
Database (Mend.io, 2025) and National Vulnera- are similar to ReAct agents however they focus
bility Database (NIST, 2025), including contex- on self-reflection and dynamic memory updates
tual information like caller-callee relationships. alongside with reinforcement learning. Self-refine
agents (Madaan et al.) are another type of agents
2.2 LLM-based Vulnerability Detection where the same language model is instructed to
Recent studies have explored the use of Large Lan- provide feedback based on the output. There have
guage Models (LLMs) for vulnerability detection, also been several multi-agent systems, such as
highlighting their ability to enhance both the iden- Alpha-Codium (Ridnik et al.), where the agents
tification and explanation of software vulnerabili- are represented as nodes on graphs.
ties. LLM4SA (Wen et al., 2024a) integrates lan-
guage models with SAST tools, leveraging LLMs 3 J IT V UL: Just-in-Time Vulnerability
to inspect static analysis warnings and signifi- Detection for Code Repositories
cantly reduce false positives. LLM4Vuln (Sun
et al., 2024) enhances LLM execution by incorpo- In this section, we discuss the requirements of
rating more context through a retrieval-augmented benchmarking LLMs and LLM-based agents for
generation (RAG) pipeline and static analysis. repository-level vulnerability detection and for-
LSAST (Keltek et al., 2024) further explores mulate the task of just-in-time (JIT) vulnerability
context augmentation, employing multiple RAG detection. We also present a benchmark for JIT
pipelines to compare the effectiveness of retrieval detection, derived from real-world vulnerabilities.
augmentation using static analysis outputs, vul-
3.1 Problem Statement
nerability reports, and code abstraction. Simi-
larly, Vul-RAG (Du et al., 2024) constructs a vec- Benchmarking vulnerability detection in real-
tor database of vulnerability reports alongside a world code repositories requires considering three
language model engine. (Zhou et al., 2024b) in- key practicality requirements:
troduce a voting mechanism that combines SAST
tools and LLMs for vulnerability detection. • Interprocedural Context. Many vulnerabil-
ities originate from interprocedural interac-
2.3 LLMs and LLM-based Agents tions, even though their manifestation and re-
Various methods have been explored to enhance quired fixes often occur within individual func-
LLM performance, with prompt augmentation be- tions (Wang et al., 2024). For instance, a
ing a key approach that enriches prompts to im- null pointer dereference (NPD) vulnerability
prove the model’s reasoning process. Chain-of- may arise when a pointer initialized as null in
one function is improperly dereferenced in an- CVE Entry Vulnerable Vul-intro Vul-fix
Version
other function along the execution path. Identi- Repository
Benign
JITVul
Fix Commit Version Commit History
fying such vulnerabilities necessitates analyz- A Pairwise
Step 1: Step 2: Step 3: Benchmark for
ing interprocedural dependencies, as examin- Vulnerability Entry Target Function Pairwise Commits JIT Detection
Selection Extraction Identification (879 CVEs)
ing functions in isolation is insufficient.
Figure 1: Construction process of J IT V UL.
• Scalability. A straightforward application of
learning-based methods to vulnerability de- a target function f modified in a commit, the task
tection for code repositories entails scanning is formulated as:
each function individually and predicting a
binary label. However, in the context of J IT D ETECT : (R, f ) → {vul, ben},
LLMs and LLM-based agents, this approach
where f is classified as either vulnerable (vul) or
becomes computationally infeasible for large-
benign (ben) based on the (interprocedural) con-
scale repositories due to the high processing
text within R. Building on this, we propose a
costs and resource constraints associated with
pairwise benchmark to evaluate LLMs and LLM-
analyzing extensive codebases. A more practi-
based agents.
cal strategy is to focus on a limited set of can-
didate functions. Just-in-time vulnerability de- 3.3 Benchmark Construction
tection (Lomio et al., 2022) exemplify this by We construct a benchmark called J IT V UL for
prioritizing functions that have been newly in- practical Just-In-Time (JIT) vulnerability de-
troduced or modified in commits. tection for code repositories, building on the
• Pairwise Comparison. Traditional evalua- function-level detection dataset PrimeVul (Ding
tion methods for machine learning-based vul- et al., 2024). As presented in Figure 1, the con-
nerability detection present models with la- struction process involves three key steps: Vul-
beled vulnerable code alongside other func- nerability Entry Selection, Target Function Extrac-
tions in the repository. However, recent find- tion, and Pairwise Commit Identification.
ings (Risse and Böhme, 2024b) suggest that Step 1: Vulnerability Entry Selection. We
models struggle to distinguish between vul- begin by selecting CVE entries that meet three
nerable code and its patched, benign version, key criteria: (i) the selected CVEs should cover a
indicating an over-reliance on superficial pat- broad range of CWEs, representing different cat-
terns rather than meaningful vulnerability in- egories of vulnerabilities, (ii) each CVE should
dicators. This highlights the necessity of pair- have an associated GitHub repository with a com-
wise benchmarking to ensure reliable evalua- plete commit history, and (iii) each CVE should
tion of vulnerability detection methods. correspond to a commit that fixes the vulnerability.
To satisfy these criteria, we leverage the Prime-
Although recent works address some of these Vul dataset as our foundation. PrimeVul ensures
requirements, to the best of our knowledge, no high data quality by including only CVEs that
study fully satisfies all three. For example, Prime- were fixed in a single commit modifying a single
Vul (Ding et al., 2024) is a high-quality dataset function. From PrimeVul, we randomly select 879
that provides pairwise evaluation for LLMs, but it CVE entries, ensuring coverage across 91 CWEs,
focuses on function-level detection and does not with a maximum of ten CVEs per CWE.
account for interprocedural context. On the other Step 2: Target Function Extraction. For each
hand, VulEval (Wen et al., 2024b) is a repository- selected CVE, we retrieve the target function and
level detection benchmark, but it lacks pairwise the commit that fixes the vulnerability directly
evaluation and considers only callers and callees from the PrimeVul dataset. The versions before
modified in the same commit as the relevant in- and after the fix correspond to the vulnerable and
terprocedural context when extracting dependen- benign versions of the target function, denoted as
cies—an assumption that is not always valid. fvul and fben , respectively.
Step 3: Pairwise Commits Identification. Un-
3.2 Task Definition like function-level vulnerability detection studies
We define the task of just-in-time vulnerability de- like PrimeVul, our JIT detection requires identi-
tection as follows. Given a code repository R and fying the commits that trigger vulnerability detec-
tion for both fvul and fben , and obtaining the cor- whether it is vulnerable. The detailed prompts
responding repository versions to provide neces- can be found in Appendix Section A.1.
sary context like callers and callees. To achieve
this, we extract the two commits responsible for • Dep-Aug LLM extends the Plain LLM by in-
introducing and fixing the vulnerability, referred to corporating Top-5 similar callers and callees of
as the vul-intro and vul-fix commits, respectively. the target function into the prompt through a
The vul-fix commit can be directly obtained lexical retrieval approach, such as Jaccard Sim-
from the data retrieved in the previous step, and ilarity. This method, proposed and evaluated in
the repository version corresponding to it is de- VulEval (Wen et al., 2024b), is reproduced in
noted as Rf ix . Identifying the precise vul-intro our work based on the original paper. We use
commit is more challenging, as pinpointing the ex- it as a baseline that integrates interprocedural
act code change that introduces the vulnerability context into LLMs in a deterministic manner.
requires considering the complex interactions be- • ReAct Agent performs an iterative thought-
tween functions. Following the methodology from action-observation process as illustrated in
prior JIT detection work (Lomio et al., 2022), we Figure 3 in Appendix, which is equipped with
trace the change history of the target function to three tools for on-demand interprocedural con-
approximate the vul-intro commit. Specifically, text acquisition: (i) get callers returns the
we traverse the commit history backward, exam- function names and line numbers of callers for
ining each commit until we identify the commit the input function; (ii) get callees returns the
where the target function was last modified to be- function names and line numbers of callees
come the fvul version. This commit is then desig- for the input function; (iii) get definition re-
nated as the vul-intro commit, and the correspond- trieves the complete function definition based
ing repository is denoted as Rintro . on the input function name and line number.
In the vul-intro and vul-fix commits, the target Using these three tools, our ReAct agent for
function is modified into the fvul and fben ver- JIT detection follows the workflow outlined
sions, respectively, thereby triggering the JIT de- below. In each iteration, the agent first reasons
tection process. based on the current context and the observa-
Resulting Benchmark. After completing the tions from the previous iteration. It then de-
above steps, J IT V UL includes 1,758 pairwise data cides whether to call the tools for interproce-
samples, with 879 labeled as vulnerable and 879 dural context or to stop the iteration and make
as benign. Each sample consists of a specific code a prediction. After observing the tool outputs,
repository version, a target function, and a ground- the agent proceeds to the next iteration.
truth label (i.e., vul or ben). These samples
are derived from 879 CVE entries, spanning 91 For each category, besides the vanilla version
CWEs. The average number of code files in the with a basic prompt, we also design three variants
repositories is 2,955.94, and the average number based on different prompting strategies: chain-of-
of lines of code in the target functions is 696.40. thought (CoT), few-shot examples (FS), and com-
bination of both CoT and FS.
4 Experimental Setup
• CoT: We use a basic CoT approach by adding
4.1 Studied Methods the instruction “Solve this problem step by
We investigate three categories of detection meth- step...” to prompt the LLM and ReAct agent
ods: Plain LLM, Dependency-Augmented (Dep- to break down the reasoning tasks. We avoid
Aug) LLM, and ReAct Agent. We choose the Re- more complex CoT instructions, as summa-
Act agent over alternatives because its thought- rizing reasoning patterns for vulnerabilities in
action-observation workflow aligns well with the advance is challenging, particularly given the
need for on-demand interprocedural analysis in wide variety of vulnerability types (CWEs).
JIT vulnerability detection.
• FS: We provide several pairwise examples,
• Plain LLM employs a single LLM with a each consisting of a vulnerable code snippet
prompt for vulnerability detection, resembling with a detailed explanation of the vulnerability
a function-level detection approach. The LLM and its patched benign version with an explana-
is given only the target function to determine tion of the applied safeguard. These few-shot
examples are expected to offer context for the Table 2: Results of studied methods on J IT V UL. The
LLM and ReAct agent to distinguish between best and second-best results are highlighted.
vulnerable and benign code, helping them fo-
GPT-4o-mini GPT-4o
cus on the actual vulnerability features. Method
F1 pAcc F1 pAcc

For each variant, we further employ two differ- Plain LLM


ent foundation models: GPT-4o-mini and GPT-4o. - vanilla 56.00 3.36 65.96 1.02
- w/ CoT 65.10 3.36 62.22 15.02
- w/ FS 48.74 7.56 62.77 4.44
4.2 Metrics
- w/ CoT+FS 64.65 11.76 64.44 17.63
For each vul-intro and vul-fix commit pair, we ap- Dep-Aug LLM
ply detection methods to the corresponding repos- - vanilla 52.68 2.05 63.30 1.03
itory (Rintro or Rf ix ) and target function (fvul - w/ CoT 66.05 4.86 62.60 18.66
or fben ), then compare predictions with ground - w/ FS 48.23 7.27 62.03 2.39
- w/ CoT+FS 65.01 4.68 61.12 18.79
truth. In addition to the commonly used F1
score, we also assess effectiveness using pair- ReAct Agent
- vanilla 56.63 12.61 57.77 17.63
wise accuracy (pAcc), inspired by PrimeVul (Ding
- w/ CoT 56.93 16.81 58.07 19.13
et al., 2024). This metric reflects the proportion - w/ FS 56.81 20.17 56.42 18.91
of pairs where both functions are correctly la- - w/ CoT+FS 51.06 14.29 52.61 18.89
beled, i.e., J IT D ETECT(Rintro , fvul ) = vul and
J IT D ETECT(Rf ix , fben ) = ben.
ten pairs of vulnerable and benign examples, each
2 × TP accompanied by its respective explanation. Ap-
F1 = ,
2 × TP + FP + FN pendix Figure 5 provides an illustrative example.
# of Correctly Labeled Pairs Caller and Callee Extraction. Given a code
pAcc = , repository and a function, we extract its callers
# of Total Pairs
and callees using CFlow (GNU, 2025), a widely
where TP is the number of true positives, FN is the used tool for on-demand call graph construction.
number of false negatives, and FP is the number of For each caller or callee identified by CFlow, we
false positives. then use CTags (universal ctags, 2025) to extract
its complete function body from its definition in
4.3 Implementation the code repository. These are implemented as
Few-shot Example Creation. To support few- Python functions using LangChain’s tool decora-
shot variants, we manually create ten example tor for integration into the ReAct workflow.
pairs for both the LLM and the agent. Each pair LLMs and Agents. We use GPT-4o-mini and
consists of a vulnerable example and a benign ex- GPT-4o with the temperature set to 0. LangChain-
ample, sourced from the 2024 CWE Top 25 Most 0.3.14 is used for pipeline construction.
Dangerous Software Weaknesses list1 on the CWE
website. For each weakness in the Top 25 list, we 5 Results and Analyses
review its corresponding web page to identify a Table 2 presents the results of the studied detection
relevant C/C++ code snippet. These snippets typ- methods on J IT V UL.
ically serve as illustrative examples of how the
vulnerability manifests, often accompanied by de- 5.1 Detection Method Comparison
tailed explanations. We use the original example We compare the three categories of detection
as the vulnerable code, making minor edits to the methods based on both F1 and pAcc scores.
explanation for normalization (e.g., merging two Results. The results indicate that ReAct Agents
paragraphs before and after the code into a single achieve higher pAcc scores than other LLM-
cohesive text). Subsequently, we manually mod- based methods across all prompting strategies,
ify the code snippet to address the vulnerability, with improvements ranging from 0.1% to 16.61%,
drafting an explanation of why the modified code with the largest gain occurring with GPT-4o and
is benign, referencing the original explanation of the vanilla prompting. However, Plain LLMs
the vulnerable code. In this manner, we create and Dep-Aug LLMs generally achieve higher F1
1
https://cwe.mitre.org/top25/index.html scores than ReAct under most settings, with im-
provements of 4.15%-13.95%, except when using fixed number of callers and callees.
GPT-4o-mini with the CoT and CoT+FS prompt-
ing strategies. Additionally, Dep-Aug LLMs 5.2 Prompting Strategy Comparison
do not show consistent improvements over Plain We also compare the three prompting strategies
LLMs and even exhibit performance degradation and perform a detailed analysis.
with certain prompting strategies. Results. When applying different prompting
Findings. Two findings emerge from results. strategies, such as CoT and FS, detection meth-
A higher F1 score does not necessarily indicate ods show varying degrees of improvement in pAcc
a detection method’s superior ability to capture scores, ranging from 1.26% to 17.76%. However,
vulnerability characteristics. Vulnerability detec- these prompting strategies do not consistently lead
tion methods exhibit an inconsistent relationship to F1 improvements in the pairwise evaluation of
between pAcc (pairwise accuracy) and F1 (iso- JIT detection.
lated metric). LLM-based methods predict signif-
Findings. We find two key findings.
icantly more vul labels—exceeding 90% in cer-
Popular prompting strategies like CoT and FS
tain settings—compared to ReAct Agents, leading
examples can enhance LLMs’ performance in
to higher recall. However, precision remains sim-
pairwise JIT evaluation. While these strategies
ilar across methods, around 50%, as observed on
sometimes reduce F1 scores, the pairwise metric
our label-balanced benchmark.. These factors ex-
pAcc shows that LLMs can significantly benefit
plain the higher F1 scores of LLM-based methods
from CoT instructions (even something as sim-
on J IT V UL. This also highlights F1’s sensitivity
ple as “Solve this problem step by step...”) and
to the data distribution, emphasizing the need for
pairwise FS examples. This improvement is of-
pairwise evaluation to better capture core vulnera-
ten overlooked when focusing solely on F1 and
bility characteristics (Risse and Böhme, 2024b).
should be considered when designing methods.
The thought-action-observation framework of
ReAct Agents require further design improve-
ReAct Agents, combined with their effective use
ments when using prompting strategies. The cur-
of interprocedural context, enhances their abil-
rent prompts are straightforward and align bet-
ity to capture vulnerability characteristics. Re-
ter with the inference process of LLMs, meaning
Act Agents conduct in-depth, fine-grained analy-
the improvements from these strategies for ReAct
sis by iteratively and adaptively retrieving addi-
Agents are relatively smaller than for LLM-based
tional context, such as callers and callees, rather
methods. For instance, the FS examples con-
than relying on superficial analysis or speculation.
sist of singleton code snippets that do not require
This allows them to differentiate between code
interprocedural analysis, which limits the bene-
versions before and after vulnerability fix, lead-
fit for ReAct Agents that rely on interprocedu-
ing to consistent improvements in pAcc. In con-
ral context. This suggests a research opportunity
trast, while Dep-Aug LLMs incorporate interpro-
in developing agent-oriented prompting strategies
cedural context, they rely on mechanical retrieval
specifically for vulnerability detection.
based on similarity metrics (e.g., Jaccard Similar-
ity), feeding retrieved Top-5 callers and callees all 5.3 Foundation Model Comparison
at once, which may introduce noise. This could
explain why Dep-Aug LLMs sometimes show a To evaluate the effectiveness of different founda-
lower pAcc than Plain LLMs. In comparison, Re- tion models across various detection methods, we
Act agents demonstrate average improvements in conduct a comparative analysis. Additionally, we
pAcc of 9.46% over Dep-Aug LLMs with GPT- include the additional open-source Llama3.1-8B
4o-mini and 8.42% with GPT-4o, across various for further comparison, with the results provided
prompting strategies. To demonstrate the adaptive in Table 3 in the Appendix.
use of interprocedural context of ReAct Agent, we Results. GPT-4o outperforms GPT-4o-mini on
present the distribution of tool invocations for Re- average, while Llama-3.1-8B frequently fails to
Act with GPT-4o and vanilla prompting in Ap- complete the analysis process and defaults to the
pendix Figure 6. It shows that ReAct Agent dy- ben label when integrated into ReAct Agents.
namically invokes the tools one to three times to Findings. Two key findings are identified.
retrieve the necessary callers or callees for most Different foundation models are sensitive to dif-
cases, in contrast to Dep-Aug LLM, which feeds a ferent prompting strategies. The results show
that GPT-4o-mini and GPT-4o exhibit distinct im- often struggle to pinpoint the root causes (e.g.,
provement patterns with CoT and FS examples. insufficient input sanitization) of vulnerabilities
This highlights the need to customize prompting in the vulnerable version. After the vulnerabil-
strategies based on the selected foundation mod- ity is fixed, however, the detection methods tend
els. Moreover, larger models do not always out- to over-analyze the patched guards (e.g., newly
perform smaller models, emphasizing the impor- added sanitization statements) in the benign ver-
tance of carefully designing methods that consider sion, speculating about results and often misiden-
the characteristics of specific foundation models. tifying non-issues. This discrepancy arises be-
The execution of ReAct Agents depends on cause LLMs rely on broad, general patterns with-
the instruction-following capability of foundation out the accurate reasoning capability needed for
models. Inspection of the outputs reveals that the complex contexts. This highlights the need for
failures of Llama-3.1-8B in ReAct Agents are due more fine-grained analysis capabilities, such as re-
to the model’s frequent inability to follow out- liable constraint solving, which should be inte-
put format requirements. This prevents the agents grated with LLMs or enhanced through program
from linking outputs and inputs across compo- analysis techniques. A detailed case study can be
nents, thus failing to perform the thought-action- found in Appendix Section D.2.
observation iterative framework. This is a known LLMs and ReAct Agents often exhibit incon-
issue with some foundation models, which strug- sistent analysis patterns when analyzing pair-
gle to follow instructions effectively (Verma et al., wise target functions. When analyzing pair-
2024), reducing their effectiveness when used in wise target functions—vulnerable and benign ver-
agentic architectures. sions—LLMs display significant inconsistencies
in their evaluation patterns. For some pairs, they
5.4 Pairwise Comparison
may provide thorough analysis for the vulnerable
The failures in pairwise evaluation can be cate- version but neglect important details in the be-
gorized into three types: pairwise vulnerable, nign version, or vice versa. In other cases, they
where both versions are labeled as vulnerable; may exhibit contrasting reasoning or focus on ir-
pairwise benign, where both versions are labeled relevant aspects, leading to discrepancies in how
as benign; and pairwise reversed, where both ver- they interpret the two versions. A clear example
sions are mislabeled. We conduct a more detailed of this inconsistency is the significant difference
pairwise comparison based on these types. in the average number of tool invocations by the
Results. The most prevalent pairwise inaccu- ReAct Agent for the vulnerable and benign ver-
racy is pairwise vulnerable, ranging from approx- sions (e.g., 5.85 vs. 1.79 when using GPT-4o-
imately 40% to 95% for LLMs and from around mini with vanilla prompting). These inconsisten-
35% to 50% for ReAct Agents (except those with cies highlight the challenges LLMs face in han-
Llama-3.1-8B). Typically, the occurrence of pair- dling the complexities of code analysis, where the
wise reversed increases as pAcc improves. differences between a vulnerable and benign ver-
Findings. We identify the following insights. sion can be subtle and require more nuanced eval-
ReAct Agents are more effective at distinguish- uation. A detailed case study can be found in Ap-
ing between pairwise target functions. ReAct pendix Section D.1.
Agents demonstrate a stronger ability to differen-
tiate between a vulnerable function and its patched 6 Conclusion
benign version. This is because they leverage
the thought-action-observation framework to iter- In this work, we introduced J IT V UL, a benchmark
atively and adaptively retrieve additional context, for just-in-time (JIT) vulnerability detection that
such as callers and callees, which allows them to enables a comprehensive, pairwise evaluation of
capture differences between the two versions. A LLMs and LLM-based agents. Our results show
related case study can be found in Appendix Sec- that ReAct Agents, leveraging thought-action-
tion D.1. observation and interprocedural context, demon-
LLMs and ReAct Agents sometimes fail to iden- strate better reasoning but require further refine-
tify the causes in the vulnerable version, while ment, particularly in utilizing advanced prompting
tending to over-analyze the benign version after strategies. Additionally, LLMs and ReAct agents
the vulnerability fix. LLMs and ReAct Agents often misinterpret flaws by either overlooking crit-
ical issues or over-analyzing benign fixes. These Peng, Tao Ma, and Yiling Lou. 2024. Vul-
findings highlight the need for improving agentic rag: Enhancing llm-based vulnerability detec-
tion via knowledge-level rag. arXiv preprint
architectures, prompting techniques, dynamic in-
arXiv:2406.11147.
terprocedural analysis, and robust reasoning mod-
els tailored to vulnerability analysis to enhance au- Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen.
tomated vulnerability detection. 2020. Ac/c++ code vulnerability dataset with code
changes and cve summaries. In Proceedings of the
17th International Conference on Mining Software
7 Limitations Repositories, pages 508–512.
As the early work benchmarking LLM-based GNU. 2025. Gnu cflow: analyzing a collection of c
agents for JIT vulnerability detection in code source files, charting control flow within the pro-
repositories, we acknowledge several limitations. gram.
First, the construction of J IT V UL may not per-
Mete Keltek, Rong Hu, Mohammadreza Fani Sani,
fectly trace the vul-intro commit due to the com- and Ziyue Li. 2024. Lsast–enhancing cybersecu-
plexity of code evolution and function interac- rity through llm-supported static application security
tions. To mitigate this, we followed existing testing. arXiv preprint arXiv:2409.15735.
methodologies (Lomio et al., 2022) and manu-
Francesco Lomio, Emanuele Iannone, Andrea De Lu-
ally inspected selected instances to ensure reliabil- cia, Fabio Palomba, and Valentina Lenarduzzi. 2022.
ity. Second, while we emphasize pairwise evalu- Just-in-time software vulnerability detection: Are
ation for JIT detection, the label-balanced dataset we there yet? Journal of Systems and Software,
may introduce bias in F1 score comparisons. In 188:111283.
the future, we plan to incorporate more benign Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
commits to improve the evaluation in F1 met- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
ric. Third, some important statistics, such as the Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Shashank Gupta, Bodhisattwa Prasad Majumder,
ratio of interprocedural vulnerabilities, are miss- Katherine Hermann, Sean Welleck, Amir Yazdan-
ing. Although we attempted manual annotation, bakhsh, and Peter Clark. SELF-REFINE: Iterative
it is labor-intensive and difficult to scale. We Refinement with Self-Feedback.
plan to leverage commercial annotation platforms
Mend.io. 2025. Mend.io vulnerability database: The
to achieve the annotation and provide more fine- largest open source vulnerability database.
grained evaluations in future work.
Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, and Shao-
hua Wang. 2024. MegaVul: A C/C++ Vulnerability
References Dataset with Comprehensive Code Representations.
In 2024 IEEE/ACM 21st International Conference
Tom Brown, Benjamin Mann, Nick Ryder, Melanie on Mining Software Repositories (MSR), pages 738–
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind 742. ISSN: 2574-3864.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot NIST. 2025. National vulnerability database: Dash-
learners. Advances in neural information processing board.
systems, 33:1877–1901.
Ted Ridnik, Dedy Kredo, and Itamar Friedman. Code
Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Generation with AlphaCodium: From Prompt Engi-
Chen, and David Wagner. 2023. DiverseVul: A New neering to Flow Engineering.
Vulnerable Source Code Dataset for Deep Learn-
ing Based Vulnerability Detection. In Proceed- Niklas Risse and Marcel Böhme. 2024a. Top score
ings of the 26th International Symposium on Re- on the wrong exam: On benchmarking in machine
search in Attacks, Intrusions and Defenses, RAID learning for vulnerability detection. arXiv preprint
’23, pages 654–668, New York, NY, USA. Associa- arXiv:2408.12986.
tion for Computing Machinery.
Niklas Risse and Marcel Böhme. 2024b. Uncovering
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, the limits of machine learning for automatic vulner-
Chawin Sitawarin, Xinyun Chen, Basel Alomair, ability detection. In 33rd USENIX Security Sympo-
David Wagner, Baishakhi Ray, and Yizheng Chen. sium (USENIX Security 24), pages 4247–4264.
2024. Vulnerability detection with code language
models: How far are we? arXiv preprint Noah Shinn, Federico Cassano, Edward Berman, Ash-
arXiv:2403.18624. win Gopinath, Karthik Narasimhan, and Shunyu
Yao. 2023. Reflexion: Language Agents with
Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Verbal Reinforcement Learning. arXiv preprint.
Wentai Deng, Mingwei Liu, Bihuan Chen, Xin ArXiv:2303.11366 [cs].
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei model: How far are we? ACM Transactions on
Ma, Lyuye Zhang, Yang Liu, and Yingjiu Li. 2024. Knowledge Discovery from Data, 18(7):1–34.
LLM4Vuln: A Unified Evaluation Framework for
Decoupling and Enhancing LLMs’ Vulnerability Xin-Cheng Wen, Xinchen Wang, Yujia Chen, Ruida
Reasoning. arXiv preprint. ArXiv:2401.16185. Hu, David Lo, and Cuiyun Gao. 2024b. Vule-
val: Towards repository-level evaluation of soft-
universal ctags. 2025. Universal ctags: A maintained ware vulnerability detection. arXiv preprint
ctags implementation. arXiv:2404.15596.
Mudit Verma, Siddhant Bhambri, and Subbarao Kamb- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
hampati. 2024. On the brittle foundations of re- Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
act prompting for agentic large language models. React: Synergizing reasoning and acting in language
Preprint, arXiv:2405.13966. models. arXiv preprint arXiv:2210.03629.
Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo.
Wen, Yujia Chen, and Qing Liao. 2024. Reposvul: 2024a. Large language model for vulnerability de-
A repository-level high-quality vulnerability dataset. tection and repair: Literature review and the road
In Proceedings of the 2024 IEEE/ACM 46th Interna- ahead. ACM Transactions on Software Engineering
tional Conference on Software Engineering: Com- and Methodology.
panion Proceedings, pages 472–483.
Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, Le, and David Lo. 2024b. Comparison of static ap-
et al. 2022. Chain-of-thought prompting elicits plication security testing tools and large language
reasoning in large language models. Advances in models for repo-level vulnerability detection. arXiv
neural information processing systems, 35:24824– preprint arXiv:2407.16235.
24837.
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning
Cheng Wen, Yuandao Cai, Bin Zhang, Jie Su, Zhiwu Du, and Yang Liu. 2019. Devign: Effective vul-
Xu, Dugang Liu, Shengchao Qin, Zhong Ming, and nerability identification by learning comprehensive
Tian Cong. 2024a. Automatically inspecting thou- program semantics via graph neural networks. Ad-
sands of static bug warnings with large language vances in neural information processing systems, 32.
A Studied Methods
A.1 Prompt Templates in Plain LLM
Figure 2 illustrates the various prompting strategies used with Plain LLM. The Vanilla Prompt serves as
the base prompt included in all variants, while FS Examples and CoT Instruction are selectively applied
according to the specific strategies. In the template, “{target function}” acts as a placeholder for the
target function to be detected.

You are a security researcher tasked with identifying vulnerabilities in a codebase. You have been given a function to analyze.
The function may or may not be vulnerable.

If you think it is vulnerable reply with @@VULNERABLE@@, otherwise reply with @@NOT VULNERABLE@@

If you think the function is vulnerable, please provide the CWE number that you think is most relevant to the vulnerability in
the form of @@CWE: <CWE_NUMBER>@@
For example:
@@VULNERABLE@@
@@CWE: CWE-1234@@

Here is the function:


```c
Vanilla Prompt
{target_function}
```

Example Detections:
- Vulnerable Example 1
- Benign Example 1

FS Examples
- Vulnerable Example 10
- Benign Example 10

Solve this problem step by step. Carefully break down the reasoning process to arrive at the correct solution. Explain your
reasoning at each step before providing the final answer. CoT Instruction

Figure 2: Prompt templates used with Plain LLM.


A.2 ReAct Agent
Figure 3 illustrates the overall workflow of the ReAct Agent for JIT vulnerability detection. Figure 4 dis-
plays the prompt used with the ReAct Agent, implemented with the default LangChain framework. The
“{agent scratchpad}” is a one-time execution memory that holds tool descriptions, along with previous
observations and reasoning traces. The “{input}” variable is used for the user prompt and can be further
enhanced using prompt augmentation techniques.

Observation
“The callers of X are Y”

Target Function Prediction


Thought Action
“I need to look into callers” “Invoke get_callers(X)”

get_callers
get_callees
get_definition
Code Repository Action Tools

Figure 3: Workflow of ReAct Agent for JIT vulnerability detection.

Answer the following questions as best you can. You have access to the following tools:
- get_callers: Get callers for a function, function names are returned
- get_callees: Get callees for a function, function names are returned
- get_definition: Get definition code of a function based on the function name.

Use the following format:


- Question: the input question you must answer.
- Thought: you should always think about what to do
- Action: the action to take, should be one of get_callers, get_callees, and get_definition
- Action Input: the input to the action
- Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
- Thought: I now know the final answer
- Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}

Figure 4: Prompt template used with ReAct Agent.


A.3 Few-shot Example
Figure 5 presents a few-shot example based on the webpage for “CWE-787: Out-of-bounds Write”. It
highlights the key differences between the vulnerable and benign versions.

# Code # Code
char* trimTrailingWhitespace(char *strMessage, int length) { char* trimTrailingWhitespace(char *strMessage, int length) {
char *retMessage; char *retMessage;
char *message = malloc(sizeof(char)*(length+1)); char *message = malloc(sizeof(char)*(length+1));
// copy input string to a temporary string // copy input string to a temporary string
char message[length+1]; char message[length+1];
int index; int index;
for (index = 0; index < length; index++) { for (index = 0; index < length; index++) {
message[index] = strMessage[index]; message[index] = strMessage[index];
} }
message[index] = '\\0’; message[index] = '\\0’;
// trim trailing whitespace // trim trailing whitespace
int len = index-1; int len = index-1;
while (isspace(message[len])) { while (len >= 0 && isspace(message[len])) {
message[len] = '\\0’; message[len] = '\\0’;
len--; len--;
} }
// return string without trailing whitespace // return string without trailing whitespace
retMessage = message; retMessage = message;
return retMessage; return retMessage;
} }

# Explanation # Explanation
In the code, a utility function is used to trim trailing whitespace from a In the code, a utility function is used to trim trailing whitespace from a
character string. The function copies the input string to a local character character string. The function copies the input string to a local character
string and uses a while statement to remove the trailing whitespace by string and uses a while statement to remove the trailing whitespace by
moving backward through the string and overwriting whitespace with a moving backward through the string and overwriting whitespace with a
NULL character. However, this function can cause a buffer underwrite NULL character. This function avoids a buffer underwrite by
if the input character string contains all whitespace. On some systems incorporating the boundary check `len >= 0` in the while loop
the while statement will move backwards past the beginning of a condition. This ensures that the loop does not move past the beginning
character string and will call the `isspace()` function on an address of the character string or call the `isspace()` function on an address
outside of the bounds of the local buffer. outside the bounds of the buffer.

(a) Vulnerable Version (b) Benign Version

Figure 5: A FS example of “CWE-787: Out-of-bounds Write”, including both vulnerable version and benign
version.
B Tool Invocation Distribution
Figure 6 illustrates the distribution of tool invocations for ReAct Agent with GPT-4o and vanilla prompt-
ing. The data shows that, in most cases, ReAct Agent invokes the tools one to three times to retrieve the
necessary callers or callees.

0 1 2 3 4 5 6 7 8 9

Figure 6: Distribution of tool invocations for ReAct Agent with GPT-4o and vanilla prompting.

C Llama-3.1 Results
Table 3 presents the results of Llama3.1-8B on J IT V UL. ReAct Agents using Llama3.1-8B show signifi-
cantly lower performance, with the execution process often failing due to formatting and parsing issues.
As a result, the agents frequently default to the ben label.

Table 3: Results of Studied Detection Methods on J IT V UL with Llama3.1-8B

Method F1 pAcc
Plain LLM
- vanilla 58.05 0.84
- w/ CoT 49.79 10.92
- w/ FS 54.48 1.68
- w/ CoT+FS 29.55 14.29
Dep-Aug LLM
- vanilla 40.48 15.17
- w/ CoT 21.18 8.39
- w/ FS 27.37 10.88
- w/ CoT+FS 16.46 7.42
ReAct Agent
- vanilla 9.09 4.20
- w/ CoT 14.67 3.36
- w/ FS 3.28 0.84
- w/ CoT+FS 3.28 1.68
D Case Study
We provide several examples to illustrate the inputs and outputs of the detection methods for a better
understanding of the analysis.

D.1 CVE-2019-15164
Figure 7 illustrates the case study derived from CVE-2019-15164 (details at
https://nvd.nist.gov/vuln/detail/CVE-2019-15164), with the left side showing the vulnerable code
and the detection methods’ responses, and the right side depicting the benign version and its corre-
sponding responses. The vulnerable version of the function daemon msg open req is susceptible
to a “CWE-918: Server-Side Request Forgery (SSRF)” vulnerability due to the lack of validation
for source before opening the device, which is read from the network socket. The benign version
addresses this vulnerability by adding an if-condition to validate whether source is a valid URL, as
highlighted in the figure.
Label Predictions. When using Plain LLM with GPT-4o and vanilla prompting, the analyses of
both the vulnerable and benign versions focus on buffer operations and misclassify the benign as
vulnerable. In contrast, when using the ReAct Agent, the predictions for both versions are correct.
The agent is able to retrieve and analyze additional context, such as understanding its caller function
daemon serviceloop and surrounding function bodies. This contextual information enables the
agent to better comprehend how the daemon msg open req function is used within the broader code-
base and recognize the risk introduced by the unvalidated URL input. Key points in the analysis process
are highlighted to show the improved detection capability provided by the ReAct Agent.
CWE Predictions. However, upon examining the specific vulnerability categories predicted by Plain
LLM and ReAct Agent, some fine-grained issues emerge. Plain LLM incorrectly predicts “CWE-120:
Buffer Copy without Checking Size of Input” for both the vulnerable and benign versions, which is
entirely inaccurate. On the other hand, ReAct Agent predicts “CWE-20: Improper Input Validation”
for the vulnerable version. While this is not the correct classification, it is somewhat related to the
ground-truth vulnerability of Server-Side Request Forgery (SSRF). The SSRF vulnerability arises from
the improper validation of the source parameter before opening device, which the ReAct Agent’s
prediction partially captures, indicating a closer alignment to the actual issue.
Analysis Patterns. When delving into the detailed analysis processes, we observe that the ReAct
Agent does not maintain consistent analysis patterns across both versions. For the vulnerable version,
the agent focuses on buffer operation and input validation, while for the benign version, it conducts a
more comprehensive check. However, in this case, the analysis patterns should be more similar, suggest-
ing that the LLM behind the ReAct Agent lacks sufficient robustness to capture the actual vulnerability
characteristics. This indicates a deficiency in its reasoning capabilities for accurate vulnerability reason-
ing.

D.2 CVE-2019-3877
Figure 8 illustrates the case study derived from CVE-2019-3877 (details at
https://nvd.nist.gov/vuln/detail/CVE-2019-3877), with the left side showing the vulnerable code
and the detection methods’ responses, and the right side depicting the benign version and its correspond-
ing responses. The vulnerable version of the function am check url is susceptible to a “CWE-601:
URL Redirection to Untrusted Site” vulnerability due to the insufficient validation for url. The benign
version addresses this vulnerability by adding an if-condition to validate whether backslash exists in
url.
Label Predictions. Plain LLM with GPT-4o and vanilla prompting produces incorrect predictions,
often reversing the labels.
Analysis Patterns. A closer look at the analysis process reveals that Plain LLM fails to identify the
insufficient validation for the url in the vulnerable version, where an assert is highlighted. In contrast,
it over-analyzes the validation requirements for the benign version. These lead to incorrect predictions
for both versions.
Vulnerable Version Benign Version
static int daemon_msg_open_req(uint8 ver, struct daemon_slpars *pars, uint32 static int daemon_msg_open_req(uint8 ver, …, char *source, size_t sourcelen) {
plen, char *source, size_t sourcelen) { …
… nread = sock_recv(pars->sockctrl, source, plen,…);
nread = sock_recv(pars->sockctrl, source, plen,…); …
… source[nread] = '\0’;
source[nread] = '\0’; plen -= nread;
plen -= nread; …
… if (is_url(source)){
// Open the selected device pcap_snprintf(errmsgbuf, PCAP_ERRBUF_SIZE, "Source …");
… goto error;
} }
// Open the selected device

CWE-918: Server-Side Request Forgery (SSRF) }

@@VULNERABLE@@ Plain LLM Response @@VULNERABLE@@ Plain LLM Response


CWE-120: Buffer Copy without Checking Size of Input CWE-120: Buffer Copy without Checking Size of Input

The function daemon_msg_open_req is vulnerable to a buffer overflow due to The function daemon_msg_open_req has a buffer overflow vulnerability due to
improper handling of the source buffer. The issue arises because the function reads improper handling of the source buffer. The issue arises because the function reads
data into source using sock_recv without ensuring that the buffer is large enough to data into source using sock_recv without ensuring that the buffer is large enough to
accommodate the incoming data. Although the function includes a check at the store the incoming data. While there is a check to ensure that plen is not greater than
beginning to ensure that plen is not greater than sourcelen - 1, this does not fully sourcelen - 1, this does not fully prevent an overflow when sock_recv writes to source.
prevent an overflow when sock_recv writes to source. If plen exceeds sourcelen, the function could still write beyond the buffer’s limits,
… causing a buffer overflow.

… ReAct Agent Response ReAct Agent Response


To determine if daemon_msg_open_req is vulnerable, I need to analyze its behavior The function daemon_msg_open_req is given, and I need to determine if it is
and interactions with other functions in the codebase. Specifically, I should check for vulnerable. I should first analyze the function body to identify any potential
common vulnerabilities such as buffer overflows, improper input validation, and vulnerabilities. Then, I can use the tools to get more context by examining the callers
misuse of functions. Since the function body is already available, I will examine the and callees of this function.
functions that call daemon_msg_open_req and those that it calls. This will help me
understand its execution context and identify potential vulnerabilities. Let’s start by examining the function body for potential vulnerabilities.

Action: get_callers Buffer Overflow Risk: …


Action Input: “daemon_msg_open_req” Source Length Check: …
Observation: {‘callers’: [‘daemon_serviceloop’]} Socket Operations: …
Device Opening: …
Thought: The function daemon_msg_open_req is called by the function Error Handling: …
daemon_serviceloop. …
Now, let’s use the tools to get more context by examining the callers and callees of this
Action: get_definition function.
Action Input: “daemon_serviceloop”
Observation: <-Function Body-> Action: get_callers
Action Input: daemon_msg_open_req
Thought: The daemon_msg_open_req function is called within the Observation: {‘callers’: [‘daemon_serviceloop’]}
daemon_serviceloop function. …
… Thought: The function daemon_msg_open_req is called by the function
To further analyze potential vulnerabilities, I will check the functions that daemon_serviceloop. …
daemon_msg_open_req calls. This will help me identify if any of those functions
might introduce vulnerabilities. Action: get_definition
Action Input: daemon_serviceloop
Action: get_callees Observation: <-function body->
Action Input: “daemon_msg_open_req”
Observation: {‘callees’: [‘pcap_snprintf’, ‘sock_recv’, ‘rpcapd_log’, Thought: The daemon_msg_open_req function is called by daemon_serviceloop. By
‘pcap_open_live’, ‘sock_bufferize’, ‘rpcap_createhdr’, ‘memset’, ‘htonl’, examining the daemon_serviceloop function, ...
‘pcap_datalink’, ‘pcap_close’, ‘sock_send’, ‘rpcap_senderror’, ‘rpcapd_discard’]}
The daemon_serviceloop function appears to perform several checks and handles
Thought: The daemon_msg_open_req function calls several functions, including different message types, including authentication and error handling. However, the key
pcap_snprintf, . point of concern is how the source string is handled in daemon_msg_open_req.

pcap_snprintf: … Potential Vulnerabilities:


… - Buffer Overflow: ... This is a good practice.
rpcapd_discard: … - URL Injection: The function checks if source is a URL using is_url(source) and
rejects it if true. This is a security measure to prevent URL injection.
The main concern in daemon_msg_open_req is the handling of network data and - Error Handling: …, which helps prevent buffer overflow.
ensuring that all inputs are properly validated. The function does perform some bounds - Use of External Functions: ...
checking, but it is crucial to ensure that all external data is sanitized and validated to
prevent vulnerabilities such as buffer overflows or improper resource access. Conclusion: The function daemon_msg_open_req seems to handle input validation and
error handling appropriately. The primary concerns would be the security of the
Given the analysis, the function appears to have potential vulnerabilities related to external functions it calls and ensuring that the buffers are always managed correctly.
improper input validation and buffer management, particularly with network data.
Therefore, I would classify this function as potentially vulnerable. Given the information available, the function does not exhibit any obvious
vulnerabilities directly in its implementation. Therefore, based on the current analysis:
Final Answer: @@VULNERABLE@@
CWE-20: Improper Input Validation Final Answer: @@NOT VULNERABLE@@

Figure 7: Case study for CVE-2019-15164.


Vulnerable Version Benign Version
int am_check_url(request_rec *r, const char *url) { int am_check_url(request_rec *r, const char *url) {
const char *i; const char *i;
for (i = url; *i; i++) { for (i = url; *i; i++) {
if (*i >= 0 && *i < ' ') { …
/* Deny all control-characters. */ if (*i == '\\') {
AM_LOG_RERROR(APLOG_MARK, APLOG_ERR, HTTP_BAD_REQUEST, r, /* Reject backslash character, as it can be used to bypass
"Control character detected in URL."); * redirect URL validation. */
return HTTP_BAD_REQUEST; …
} return HTTP_BAD_REQUEST;
} }
}
return OK; CWE-601: URL Redirection to Untrusted Site
} return OK;
}

@@NOT VULNERABLE@@ Plain LLM Response Plain LLM Response


@@VULNERABLE@@
CWE-20: Improper Input Validation
The function am_check_url is designed to check a URL for control characters, which
are generally considered invalid in URLs. The function iterates through each character
The function am_check_url is designed to validate a URL by checking for control
in the provided URL and checks if it falls within the range of control characters
characters and backslashes. However, it does not adequately validate the URL against
(ASCII values 0 to 31). If it detects any control character, it logs an error and returns
other potential vulnerabilities, such as:
an HTTP_BAD_REQUEST status.
- Injection Attacks: The function does not check for other potentially dangerous
characters or patterns that could lead to injection attacks (e.g., SQL injection,
The function appears to be handling input validation correctly by explicitly denying
command injection, etc.). While it checks for control characters and backslashes, it
control characters, which is a good practice for security. It does not seem to have any
does not account for other characters that could be harmful.
obvious vulnerabilities such as buffer overflows, injection flaws, or improper input
- Improper Input Validation: The function only checks for control characters and
validation that would lead to security issues.
backslashes, which may not be sufficient for a robust URL validation. There are
many other characters and patterns that could be problematic in a URL context.
While there are always potential edge cases in any code, based on the provided
- Lack of Comprehensive Validation: The function does not implement a
implementation, it does not exhibit any vulnerabilities that would warrant a
comprehensive validation mechanism for URLs, which could lead to various security
classification as vulnerable. Therefore, I conclude that the function is not vulnerable.
issues, including but not limited to open redirects or other forms of URL
manipulation.

Given these points, the function is vulnerable due to insufficient input validation,
which aligns with CWE-20 (Improper Input Validation).

Figure 8: Case study for CVE-2019-3877.

You might also like