0% found this document useful (0 votes)
10 views

Large Language Models are Few-shot Testers Exploring LLM based General Bug Reproduction

The document presents L IBRO, a framework that utilizes Large Language Models (LLMs) to automate the generation of bug reproducing tests from bug reports, addressing a significant gap in existing automated test generation techniques. An empirical study shows that L IBRO can generate valid tests for 33% of studied bugs, indicating its potential to enhance developer efficiency by automating the report-to-test process. The findings suggest that a considerable portion of tests in open-source repositories are derived from bug reports, highlighting the importance of integrating automated test generation into software development workflows.

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Large Language Models are Few-shot Testers Exploring LLM based General Bug Reproduction

The document presents L IBRO, a framework that utilizes Large Language Models (LLMs) to automate the generation of bug reproducing tests from bug reports, addressing a significant gap in existing automated test generation techniques. An empirical study shows that L IBRO can generate valid tests for 33% of studied bugs, indicating its potential to enhance developer efficiency by automating the report-to-test process. The findings suggest that a considerable portion of tests in open-source repositories are derived from bug reports, highlighting the importance of integrating automated test generation into software development workflows.

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Large Language Models are Few-shot Testers:

Exploring LLM-based General Bug Reproduction


Sungmin Kang∗ Juyeon Yoon∗ Shin Yoo
School of Computing School of Computing School of Computing
KAIST KAIST KAIST
Daejeon, Republic of Korea Daejeon, Republic of Korea Daejeon, Republic of Korea
[email protected] [email protected] [email protected]

Abstract—Many automated test generation techniques have detection) to guide the automated process. They are useful
arXiv:2209.11515v3 [cs.SE] 25 Jul 2023

been developed to aid developers with writing tests. To facilitate when new features are being added, as they can generate novel
full automation, most existing techniques aim to either increase tests with high coverage for a focal class.
coverage, or generate exploratory inputs. However, existing
test generation techniques largely fall short of achieving more However, not all tests are added immediately along with
semantic objectives, such as generating tests to reproduce a their focal class. In fact, we find that a significant number of
given bug report. Reproducing bugs is nonetheless important, tests originate from bug reports, i.e., are created in order to
as our empirical study shows that the number of tests added prevent future regressions for the bug reported. This suggests
in open source repositories due to issues was about 28% of that the generation of bug reproducing tests from bug reports
the corresponding project test suite size. Meanwhile, due to
the difficulties of transforming the expected program semantics is an under-appreciated yet impactful way of automatically
in bug reports into test oracles, existing failure reproduction writing tests for developers. Our claim is based on the analysis
techniques tend to deal exclusively with program crashes, a of a sample of 300 open source projects using JUnit: the
small subset of all bug reports. To automate test generation from number of tests added as a result of bug reports was on
general bug reports, we propose L IBRO, a framework that uses median 28% of the size of the overall test suite. Thus, the bug
Large Language Models (LLMs), which have been shown to be
capable of performing code-related tasks. Since LLMs themselves report-to-test problem is regularly dealt with by developers,
cannot execute the target buggy code, we focus on post-processing and a problem in which an automated technique could provide
steps that help us discern when LLMs are effective, and rank significant help. Previous work in bug reproduction mostly
the produced tests according to their validity. Our evaluation of deals with crashes [5], [6]; as many bug reports deal with
L IBRO shows that, on the widely studied Defects4J benchmark, semantic issues, their scope is limited in practice.
L IBRO can generate failure reproducing test cases for 33% of all
studied cases (251 out of 750), while suggesting a bug reproducing The general report-to-test problem is of significant impor-
test in first place for 149 bugs. To mitigate data contamination tance to the software engineering community, as solving this
(i.e., the possibility of the LLM simply remembering the test code problem would allow developers use a greater number of
either partially or in whole), we also evaluate L IBRO against 31 automated debugging techniques, equipped with test cases
bug reports submitted after the collection of the LLM training that reproduce the reported bug. Koyuncu et al. [7] note
data terminated: L IBRO produces bug reproducing tests for 32%
of the studied bug reports. Overall, our results show L IBRO that in the widely used Defects4J [8] bug benchmark, bug-
has the potential to significantly enhance developer efficiency by revealing tests did not exist prior to the bug report being
automatically generating tests from bug reports. filed in 96% of the cases. Consequently, it may be difficult
Index Terms—test generation, natural language processing, to utilize the state-of-the-art automated debugging techniques,
software engineering which are often evaluated on Defects4J, when a bug is first
reported because they rely on bug reproducing tests [9], [10].
I. I NTRODUCTION Conversely, alongside a technique that automatically generates
Software testing is the practice of confirming that software bug-revealing tests, a wide range of automated debugging
meets specification criteria by executing tests on the software techniques would become usable.
under test (SUT). Due to the importance and safety-critical As an initial attempt to solve this problem, we propose
nature of many software projects, software testing is one of the prompting Large Language Models (LLMs) to generate tests.
most important practices in the software development process. Our use of LLMs is based on their impressive performance
Despite this, it is widely acknowledged that software testing on a wide range of natural language processing tasks [11] and
is tedious due to the significant human effort required [1]. To programming tasks [12]. In this work, we explore whether
fill this gap, automated test generation techniques have been their capabilities can be extended to generating test cases from
studied for almost half a century [2], resulting in a number bug reports. More importantly, we argue that the performance
of tools [3], [4] that use implicit oracles (regressions or crash of LLMs when applied to this problem has to be studied
along with the issue of when we can rely on the tests
∗: these authors contributed equally. that LLMs produce. Such questions are crucial for actual
developer use: Sarkar et al. [13] provide relevant examples, recurring part of testing. Existing surveys of developers reveal
showing that developers struggle to understand when LLMs that developers consider generating tests from bug reports to
will do their bidding when used for code generation. To fill be a way to improve automated testing. Daka and Fraser [16]
this gap of knowledge, we propose L IBRO (LLM Induced survey 225 software developers and point out ways in which
Bug ReprOduction), a framework that prompts the OpenAI automated test generation could help developers, three of
LLM, Codex [14], to generate tests, processes the results, and which (deciding what to test, realism, deciding what to check)
suggests solutions only when we can be reasonably confident can be resolved using bug reports, as bug reproduction is a
that bug reproduction has succeeded. relatively well-defined activity. Kochhar et al. [17] explicitly
We perform extensive empirical experiments on both the ask hundreds of developers on whether they agree to the
Defects4J benchmark and a new report-test dataset that we statement “during maintenance, when a bug is fixed, it is good
have constructed, aiming to identify the features that can to add a test case that covers it”, and find a strong average
indicate successful bug reproduction by L IBRO. We find that, agreement of 4.4 on a Likert scale of 5.
for the Defects4J benchmark, L IBRO can generate at least one To further verify that developers regularly deal with the
bug reproducing test for 251 bugs, or 33.5% of all studied report-to-test problem, we analyze the number of test additions
bugs from their bug reports. L IBRO also successfully deduced that can be attributed to a bug report, by mining hundreds
which of its bug reproducing attempts were successful with of open-source Java repositories. We start with the Java-med
71.4% accuracy, and produced an actual bug reproducing test dataset from Alon et al. [18], which consists of 1000 top-
as its first suggestion for 149 bugs. For further validation, we starred Java projects from GitHub. From the list of commits
evaluate L IBRO on a recent bug report dataset that we built, in each repository, we check (i) whether the commit adds a
finding that we could reproduce 32.2% of bugs in this distinct test, and (ii) whether the commit is linked to an issue. To
dataset as well, and verifying that our test suggesting heuristics determine whether a commit adds a test, we check that its diff
work in this different dataset as well. adds the @Test decorator along with a test body. In addition, we
In summary, our contributions are as follows: link a commit to a bug report (or an issue in GitHub) if (i) the
• We perform an analysis of open source repositories to commit message mentions "(fixes/resolves/closes) #NUM", or
verify the importance of generating bug reproducing test (ii) the commit message mentions a pull request, which in turn
cases from bug reports; mentions an issue. We compare the number of tests added by
• We propose a framework to harness an LLM to reproduce such report-related commits to the size of the current (August
bugs, and suggest generated tests to the developer only 2022) test suite to estimate the prevalence of such tests. As
when the results are reliable; different repositories have different issue-handling practices,
• We perform extensive empirical analyses on two datasets, we filter out repositories that have no issue-related commits
suggesting that the patterns we find, and thus the perfor- that add tests, as this indicates a different bug handling practice
mance of L IBRO, are robust. (e.g. google/guava). Accordingly, we analyze 300 repositories,
The remainder of the paper is organized as follows. We as shown in Table I.
motivate our research in Section II. Based on this, we describe
TABLE I: Analyzed repository characteristics
our approach in Section III. Evaluation settings and research
questions are in Section IV and Section V, respectively. Repository Characteristic # Repositories
Results are presented in Section VI, while threats to validity Could be cloned 970
are discussed in Section VIII. Section IX gives an overview Had a JUnit test (@Test is found in repository) 550
Had issue-referencing commit that added test 300
of the relevant literature, and Section X concludes.
II. M OTIVATION We find that the median ratio of tests added by issue-
As described in the previous section, the importance of the referencing commits in those 300 repositories, relative to the
report-to-test problem rests on two observations. The first is current test suite size, is 28.4%, suggesting that a significant
that bug-revealing tests are rarely available when a bug report number of tests are added due to bug reports. We note that this
is filed, unlike what automated debugging techniques often does not mean 28.4% of tests in a test suite originate from bug
assume [9], [10]. Koyuncu et al. [7] report that Spectrum- reports, as we do not track what happens to tests after they are
Based Fault Localization (SBFL) techniques cannot locate added. Nonetheless, it indicates the report-to-test activity plays
the bug at the time of being reported in 95% of the cases a significant role in the evolution of test suites. Based on this
they analyzed, and thus propose a completely static automated result, we conclude that the report-to-test generation problem
debugging technique. However, as Le et al. [15] demonstrate, is regularly handled by open source developers. By extension,
using dynamic information often leads to more precise lo- an automated report-to-test technique that suggests and/or
calization results. As such, a report-to-test technique could automatically commits confirmed tests would help developers
enhance the practicality and/or performance of a large portion in their natural workflow.
of the automated debugging literature. Despite the importance of the problem, its general form
The other observation is that the report-to-test problem remains a difficult problem to solve. Existing work attempts
is a perhaps underappreciated yet nonetheless important and to solve special cases of the problem by focusing on different
aspects: some classify the sentences of a report into categories To make an LLM to generate a test method from a given
like observed or expected behavior [19], while others only bug report, we construct a Markdown document, which is to be
reproduce crashes (crash reproduction) [6], [20]. We observe used in the prompt, from the bug report: consider the example
that solving this problem requires good understanding of both in Listing 1, which is a Markdown document constructed from
natural and programming language, not to mention capabilities the bug report shown in Table II. L IBRO adds a few distinctive
to perform deduction. For example, the bug report in Table II parts to the Markdown document: the command “Provide a
does not explicitly specify any code, but a fluent user in self-contained example that reproduces this issue”, the start
English and Java would be capable of deducing that when of a block of code in Markdown, (i.e., ```), and finally the
both arguments are NaN, the ‘equals’ methods in ‘MathUtils’ partial code snippet public void test whose role is to induce
should return false. the LLM to write a test method.
One promising solution is to harness the capabilities of pre-
TABLE II: Example bug report (Defects4J Math-63).
trained Large Language Models (LLMs). LLMs are generally
Transformer-based neural networks [13] trained with the lan- Issue No. MATH-3701
guage modeling objective, i.e. predicting the next token based Title NaN in “equals” methods
on preceding context. One of their main novelties is that they
In “MathUtils”, some “equals” methods will return true if
can perform tasks without training: simply by ‘asking’ the both argument are NaN. Unless I’m mistaken, this contradicts
Description
LLM to perform a task via a textual prompt, the LLM is often the IEEE standard.
capable of actually performing the task [11]. Thus, one point If nobody objects, I’m going to make the changes.
of curiosity is how many bugs LLMs can reproduce given
a report. On the other hand, of practical importance is to be
able to know when we should believe and use the LLM results, Listing 1: Example prompt without examples.
as noted in the introduction. To this end, we focus on finding 1 # NaN in "equals" methods
2 ## Description
heuristics indicative of high precision, and minimize the hassle 3 In "MathUtils", some "equals" methods will return true if both argument
that a developer would have to deal with when using L IBRO. are NaN.
4 Unless I'm mistaken, this contradicts the IEEE standard.
5 If nobody objects, I'm going to make the changes.
III. A PPROACH
6
7 ## Reproduction
8 >Provide a self-contained example that reproduces this issue.
Prompt Test
T1 Clusters 9 ```
Example 10 public void test
Report LLM Testing 1
T2 T1T3
Test

Example
T3
File
1
T3
We evaluate a range of variations of this basic prompt.
Test Test 2
Bug Report File
2
T2
Tn
Brown et al. [11] report that LLMs benefit from question-

3
answer examples provided in the prompt. In our case, this
Tn Tn
means providing examples of bug reports (questions) and the
(A) Prompt (B) LLM (C) Post- (D) Selection corresponding bug reproducing tests (answers). With this in
Engineering Querying processing & Ranking
mind, we experiment with a varying number of examples,
Fig. 1: Overview of L IBRO to see whether adding more examples, and whether having
examples from within the same project or from other projects,
An overview diagram of our approach is presented in significantly influences performance.
Figure 1. Given a bug report, L IBRO first constructs a prompt As there is no real restriction to the prompt format, we
to query an LLM (Figure 1:(A)). Using this prompt, an initial also experiment with providing stack traces for crash bugs
set of test candidates are generated by querying the LLM (to simulate situations where a stack trace was provided),
multiple times (Figure 1:(B)). Then, L IBRO processes the tests or providing constructors of the class where the fault is
to make them executable in the target program (Figure 1:(C)). located (to simulate situations where the location of the bug
L IBRO subsequently identifies and curates tests that are likely is reported).
to be bug reproducing, and if so, ranks them to minimize Our specific template format makes it highly unlikely that
developer inspection effort (Figure 1:(D)). The rest of this prompts we generate exist verbatim within the LLM training
section explains each stage in more detail using the running data. Further, most reports in practice are only connected to
example provided in Table II. the bug-revealing test via a chain of references. As such, our
format partly mitigates data leakage concerns, among other
A. Prompt Engineering steps taken to limit this threat described later in the manuscript.
LLMs are, at the core, large autocomplete neural networks: B. Querying an LLM
prior work have found that different ways of ‘asking’ the LLM Using the generated prompt, L IBRO queries the LLM to
to solve a problem will lead to significantly varying levels of predict the tokens that would follow the prompt. Due to the
performance [21]. Finding the best query to accomplish the
given task is known as prompt engineering [22]. 1 https://issues.apache.org/jira/browse/MATH-370
nature of the prompt, it is likely to generate a test method, Algorithm 1: Test Postprocessing
especially as our prompt ends with the sequence public void Input: A test method tm; Test suite T of SUT; source code files S
test. We ensure that the result only spans the test method by of SUT;
Output: Updated test suite T ′
accepting tokens until the first occurrence of the string ```, 1 cbest ← findBestMatchingClass(tm, T );
which indicates the end of the code block in Markdown. 2 deps ← getDependencies(tm);
It is known that LLMs yield inferior results when per- 3 needed_deps ← getUnresolved(deps, cbest );
4 new_imports ← set();
forming completely greedy decoding (i.e., decoding strictly 5 for dep in needed_deps do
based on the most likely next token) [11]: they perform 6 target ← findClassDef(dep, S );
better when they are doing weighted random sampling, a 7 if target is null then
8 new_imports.add(findMostCommonImport(dep, S, T ));
behavior modulated by the temperature parameter. Following 9 else
prior work, we set our temperature to 0.7 [11], which allows 10 new_imports.add(target);
the LLM to generate multiple distinct tests based on the exact
11 T ′ ← injectTest(tm, cbest , T );
same prompt. We take the approach of generating multiple 12 T ′ ← injectDependencies(new_imports, cbest , T ′ );
candidate reproducing tests, then using their characteristics to
identify how likely it is that the bug is actually reproduced.
An example output from the LLM given the prompt in
Listing 1 is shown in Listing 2: at this point, the outputs from methods and classes, and is thus lexically related, to other tests
the LLM typically cannot be compiled on their own, and need from that test class. Formally, we assign a matching score for
other constructs such as import statements. We next present each test class based on Equation (1):
how L IBRO integrates a generated test into the existing test
suite to make it executable. simci = |Tt ∩ Tci |/|Tt | (1)

Listing 2: Example LLM result from the bug report described where Tt and Tci are the set of tokens in the generated test
in Table II. method and the ith test class, respectively. As an example,
1 public void testEquals() { Listing 3 shows the key statements of the MathUtilsTest
2 assertFalse(MathUtils.equals(Double.NaN, Double.NaN));
class. Here, the test class contains similar method invocations
3 assertFalse(MathUtils.equals(Float.NaN, Float.NaN));
4 } and constants with those used by the LLM-generated test in
Listing 2, particularly in lines 4 and 6.
C. Test Postprocessing As a sanity check, we inject ground-truth developer-added
bug reproducing tests from the Math and Lang projects of
We first describe how L IBRO injects a test method into an
the Defects4J benchmark, and check if they execute normally
existing suite then how L IBRO resolves the remaining unmet
based on Algorithm 1. We find execution proceeds as usual
dependencies.
for 89% of the time, suggesting that the algorithm reasonably
1) Injecting a test into a suitable test class: If a developer
finds environments in which tests can be executed.
finds a test method in a bug report, they will likely insert it
2) Resolving remaining dependencies: Although many de-
into a test class which will provide the required context for the
pendency issues are resolved by placing the test in the right
test method (such as the required dependencies). For example,
class, the test may introduce new constructs that need to be
for the bug in our running example, the developers added
imported. To handle these cases, L IBRO heuristically infers
a reproducing test to the MathUtilsTest class, where most
packages to import.
of the required dependencies are already imported, including
the focal class, MathUtils. Thus, it is natural to also inject Line 2 to 10 in Algorithm 1 describe the dependency
LLM-generated tests into existing test classes, as this matches resolving process of L IBRO. First, L IBRO parses the generated
developer workflow, while resolving a significant number of test method and identifies variable types and referenced class
initially unmet dependencies. names/constructors/exceptions. L IBRO then filters “already
imported” class names by lexically matching names to existing
Listing 3: Target test class to which the test in Listing 2 is import statements in the test class (Line 3).
injected. As a result of this process, we find types that are not
1 public final class MathUtilsTest extends TestCase { resolved within the test class. L IBRO first attempts to find
2 ... public classes with the identified name of the type; if there is
3 public void testArrayEquals() {
4 assertFalse(MathUtils.equals(new double[] { 1d }, null)); exactly one such file, the classpath to the identified class is
5 assertTrue(MathUtils.equals(new double[] { derived (Line 7), and an import statement is added (Line 11).
6 Double.NaN, Double.POSITIVE_INFINITY, However, either no or multiple matching classes may exist.
7 ...
In both cases, L IBRO looks for import statements ending with
To find the best test class to inject our test methods into, we the target class name within the project (e.g., when searching
find the test class that is lexically most similar to the generated for MathUtils, L IBRO looks for import .*MathUtils;). L IBRO
test (Algorithm 1, line 1). The intuition is that, if a test method selects the most common import statement across all project
belongs to a test class, the test method likely uses similar source code files. Additionally, we add a few rules that allow
Algorithm 2: Test Selection and Ranking same failure output (the same error type and error message)
Input: Pairs of modified test suites and injected test methods ST ′ ; and look at the number of the tests in the same group (the
target program with bug Pb ; bug report BR; agreement max_output_clus_size in Line 8). This is based on the
threshold T hr;
Output: Ordered list of ranked tests ranking;
intuition that, if multiple tests show similar failure behavior,
1 F IB ← set(); then it is likely that the LLM is ‘confident’ in its predictions as
2 for (T ′ , tmi ) ∈ ST ′ do its independent predictions ‘agree’ with each other, and there
3 r ← executeTest(T ′ , Pb );
4 if hasNoCompileError(r) && isFailed(tmi , r) then
is a good chance that bug reproduction has succeeded. L IBRO
5 F IB .add((tmi , r)); can be configured to only show results when there is significant
agreement in the output (setting the agreement threshold T hr
6 clusters ←clusterByFailureOutputs(F IB );
7 output_clus_size ← clusters.map(size); high) or show more exploratory results (setting T hr low).
8 max_output_clus_size ←max(output_clus_size); Once it decides to show its results, L IBRO relies on three
9 if max_output_clus_size ≤ T hr then heuristics to rank generated tests, in the order of increasing
10 return list();
discriminative strength. First, tests are likely to be bug re-
11 F IBuniq ← removeSyntacticEquivalents(F IB );
12 br_output_match ← clusters.map(matchOutputWithReport(BR)); producing if the fail message and/or the test code shows the
13 br_test_match ← F IBuniq .map(matchTestWithReport(BR)); behavior (exceptions or output values) observed and mentioned
14 tok_cnts ← F IBuniq .map(countTokens); in the bug report. While this heuristic is precise, its decisions
15 ranking ←list();
16 clusters ← clusters.sortBy( are not very discriminative, as tests can only be divided into
br_output_match, output_clus_size, tok_cnts); groups of ‘contained’ versus ‘not contained’. Next, we look at
17 for clus ∈ clusters do the ‘agreement’ between generated tests by looking at output
18 clus ← clus.sortBy(br_test_match, tok_cnts);
cluster size (output_clus_size), which represents the ‘consen-
19 for i = 0; i <max(output_clus_size); i ← i + 1 do
20 for clus ∈ clusters do
sus’ of the LLM generations. Finally, L IBRO prioritizes based
21 if i < clus.length() then on test length (as shorter tests are easier to understand), which
22 ranking .push(clus[i]) is the finest-grained signal. We first leave only syntactically
unique tests (Line 11), then sort output clusters and tests within
23 return ranking;
those clusters using the heuristics above (Lines 16 and 18).
As tests with the same failure output are similar to each
other, we expect that, if one test from a cluster is not BRT,
assertion statements to be properly imported, even when there the rest from the same cluster are likely not BRT as well.
are no appropriate imports within the project itself. Hence, L IBRO shows tests from a diverse array of clusters.
Our postprocessing pipeline does not guarantee compilation For each ith iteration in Line 19-22, the ith ranked test from
in all cases, but the heuristics used by L IBRO are capable of each cluster is selected and added to the list.
resolving most of the unhandled dependencies of a raw test
method. After going through the postprocessing steps, L IBRO IV. E VALUATION
executes the tests to identify candidate bug reproducing tests.
This section provides evaluation details for our experiments.
D. Selection and Ranking
A test is a Bug Reproducing Test (BRT) if and only if the A. Dataset
test fails due to the bug specified in the report. A necessary As a comprehensive evaluation benchmark, we use De-
condition for a test generated by L IBRO to be a BRT is that fects4J version 2.0, which is a manually curated dataset of
the test compiles and fails in the buggy program: we call such real-world bugs gathered from 17 Java projects. Each De-
tests FIB (Fail In the Buggy program) tests. However, not fects4J bug is paired to a corresponding bug report2 , which
all FIB tests are BRTs, making it difficult to tell whether makes the dataset ideal for evaluating the performance of
bug reproduction has succeeded or not. This is one factor L IBRO. Among 814 bugs that have a paired bug report, we
that separates us from crash reproduction work [20], as crash find that 58 bugs have incorrect pairings, while six bugs
reproduction techniques can confirm whether the bug has been have different directory structures between the buggy and
reproduced by comparing the stack traces at the time of crash. fixed versions: this leaves 750 bugs for us to evaluate L IBRO
On the other hand, it is imprudent to present all generated on. 60 bugs in the Defects4J benchmark are included in
FIB tests to developers, as asking developers to iterate over the JCrashPack [25] dataset used in the crash reproduction
multiple solutions is generally undesirable [23], [24]. As such, literature; we use this subset when comparing against crash
L IBRO attempts to decide when to suggest a test and, if so, reproduction techniques.
which test to suggest, using several patterns we observe to be As Codex, the LLM we use, was trained with data collected
correlated to successful bug reproductions. until July 2021, the Defects4J dataset is not free from data
Algorithm 2 outlines how L IBRO decides whether to present leakage concerns, even if the prompt format we use is unlikely
results and, if so, how to rank the generated tests. In Line to have appeared verbatim in the data. To mitigate such
1-10, L IBRO first decides whether to show the developer any
results at all (selection). We group the FIB tests that exhibit the 2 Except for the Chart project, for which only 8 bugs have reports
concerns, from 17 GitHub repositories3 that use JUnit, we C. Environment
gather 581 Pull Requests (PR) created after the Codex training All experiments are performed on a machine running
data cutoff point, ensuring that this dataset could not have been Ubuntu 18.04.6 LTS, with 32GB of ram and Intel(R)
used to train Codex. We further check if a PR adds any test Core(TM) i7-7700 CPU @ 3.60GHz CPU. We access OpenAI
to the project (435 left after discarding non-test-introducing Codex via its closed beta API, using the code-davinci-002
ones), and filter out the PRs that are not merged to the main model. For Codex, we set the temperature to 0.7, and the
branch or associated with multiple issues (84 left). maximum number of tokens to 256. We script our experiments
For these 84 PRs, we verify that the bug can be reproduced using Python 3.9, and parse Java files with the javalang
by checking that a developer-added test added by the PR library [26]. Our replication package is online4 .
fails on the pre-merge commit without compilation errors, and
passes on the post-merge commit. We add the pair to our final V. R ESEARCH Q UESTIONS
list only when all of them have been reproduced. After the We aim to answer the following research questions.
final check, we end up with 31 reproducible bugs and their
bug reports. This dataset is henceforth referred to as the GHRB A. RQ1: Efficacy
(GitHub Recent Bugs) dataset. We use this dataset to verify With RQ1, we seek to quantitatively evaluate the perfor-
that trends observed in Defects4J are not due to data leakage. mance of L IBRO using the Defects4J benchmark.
• RQ1-1: How many bug reproducing tests can L IBRO
B. Metrics generate? We evaluate how many bugs in total are
A test is treated as a Bug Reproducing Test (BRT) in our reproduced by L IBRO using various prompt settings.
• RQ1-2: How does L IBRO compare to other tech-
evaluation if it fails on the version that contains the bug
specified in the report, and passes on the version that fixes niques? In the absence of generic report-to-test tech-
the bug. We say that a bug is reproduced if L IBRO generates niques, we compare against EvoCrash, a crash reproduc-
at least one BRT for that bug. The number of bugs that are tion technique. We also compare against a ‘Copy&Paste’
reproduced is counted for each evaluation technique. baseline that directly uses code snippets (identified with
the HTML <pre> tag or via infoZilla [27]) within the bug
We use the PRE_FIX_REVISION and POST_FIX_REVISION versions
report as tests. For code that could be parsed as a Java
in the Defects4J benchmarks as the buggy/fixed versions,
compilation unit, we add the code as a test class and add
respectively. The two versions reflect the actual state of the
JUnit imports if necessary to run it as a test. Otherwise,
project when the bug was discovered/fixed. For the GHRB
we wrap the code snippet in a test method and evaluate
dataset, as we gathered the data based on code changes
it under the same conditions as L IBRO.
from merged pull requests, we use pre-merge and post-merge
versions as the buggy/fixed versions. B. RQ2: Efficiency
EvoCrash [20], the crash reproduction technique we com- With RQ2, we examine the efficiency of L IBRO in terms of
pare with, originally checks whether the crash stack is re- the amount of resources it uses, to provide an estimate of the
produced in the buggy version. For fair comparison, we eval- costs of deploying L IBRO in a real-world context.
uate EvoCrash under our reproduction criterion: EvoCrash-
• RQ2-1: How many Codex queries are required? We
generated tests are executed on the buggy and fixed versions,
estimate how many queries are needed to achieve a
and when execution results change, we treat the test as a BRT.
certain bug-reproduction rate on the Defects4J dataset
To evaluate the rankings produced by L IBRO, we focus on
based on a pool of generated tests.
two aspects: the capability of L IBRO to rank the actual BRTs
• RQ2-2: How much time does L IBRO need? Our tech-
higher, and the degree of effort required from developers to
nique consists of querying an LLM, making it executable,
inspect the ranked tests. For the former, we use the acc@n
and ranking: we measure the time taken at each stage.
metric, which counts the number of bugs whose BRTs are
• RQ2-3: How many tests should the developer inspect?
found within the top n places in the ranking. Additionally,
We evaluate how many bugs could be reproduced within
we report the precision of L IBRO by dividing acc@n with
1, 3, and 5 suggestions, along with the amount of ‘wasted
the number of all selected bugs, representing how often a
effort’ required from the developer.
developer would accept a suggestion by L IBRO. To estimate
developer effort, we use the wef metric: the number of non- C. RQ3: Practicality
reproducing tests ranked higher than any bug reproducing test. Finally, with RQ3, we aim to investigate how well L IBRO
If there are no BRTs, we report wef as the total number of generalizes by applying it to the GHRB dataset.
the target FIB tests in ranking. We also use wef @n, which
• RQ3-1: How often can L IBRO reproduce bugs in the
shows the wasted effort when using the top n candidates.
wild? To mitigate data leakage issues, we evaluate L IBRO
3 These repositories have been manually chosen from either Defects4J on the GHRB dataset, checking how many bugs can be
projects that are on GitHub and open to new issues, or Java projects that reproduced on it.
have been modified since 10th July 2022 with at least 100 or more stars, as
of 1st of August 2022. A list of 17 repositories is available in our artifact. 4 https://anonymous.4open.science/r/llm-testgen-artifact-2753
• RQ3-2: How reliable are the selection and ranking 6.5 lines (excluding comments and whitespace), indicating
techniques of L IBRO? We investigate whether the factors L IBRO is capable of writing meaningfully long tests.
that were used during selecting bugs and ranking tests for
the Defects4J dataset are still valid for the GHRB dataset, TABLE IV: Bug reproduction per project in Defects4J:
and thus can be used for other projects in general. x/y means x reproduced out of y bugs
• RQ3-3: What does reproduction success and failure Project rep/total Project rep/total Project rep/total
look like? To provide qualitative context to our results, Chart 5/7 Csv 6/16 JxPath 3/19
we describe examples of bug reproduction success and Cli 14/29 Gson 7/11 Lang 46/63
Closure 2/172 JacksonCore 8/24 Math 43/104
failure from the GHRB dataset. Codec 10/18 JacksonDatabind 30/107 Mockito 1/13
Collections 1/4 JacksonXml 2/6 Time 13/19
VI. E XPERIMENTAL R ESULTS Compress 4/46 Jsoup 56/92 Total 251/750
A. RQ1. How effective is L IBRO?
1) RQ1-1: Table III shows which prompt/information set- Answer to RQ1-1: A large (251) number of bugs can be
tings work best, where n = N means we queried the LLM replicated automatically, with bugs replicated over a diverse
N times for reproducing tests. When using examples from group of projects. Further, the number of examples in the
the source project, we use the oldest tests available within prompt and the number of generation attempts have a strong
that project; otherwise, we use two handpicked report-test effect on performance.
pairs (Time-24, Lang-1) throughout all projects. We find that
providing constructors (à la AthenaTest [28]) does not help 2) RQ1-2: We further compare L IBRO against the state-
significantly, but adding stack traces does help reproduce crash of-the-art crash reproduction technique, EvoCrash, and the
bugs, indicating that L IBRO can benefit from using the stack ‘Copy&Paste baseline’ that uses code snippets from the bug
information to replicate issues more accurately. Interestingly, reports. We present the comparison results in Figure 2. We find
adding within-project examples shows poorer performance: L IBRO replicates a large and distinct group of bugs compared
inspection of these cases has revealed that, in such cases, to other baselines. L IBRO reproduced 91 more unique bugs
L IBRO simply copied the provided example even when it (19 being crash bugs) than EvoCrash, which demonstrates that
should not have, leading to lower performance. We also find L IBRO can reproduce non-crash bugs prior work could not
that the number of examples makes a significant difference handle (Fig. 2(b)). On the other hand, the Copy&Paste baseline
(two-example n=10 values are sampled from n=50 results from shows that, while the BRT is sometimes included in the bug
the default setting), confirming the existing finding that adding report, the report-to-test task is not at all trivial. Interestingly,
examples helps improve performance. In turn, the number of eight bugs reproduced by the Copy&Paste baseline were not
examples seems to matter less than the number of times the reproduced by L IBRO; we find that this is due to long tests that
LLM is queried, as we further explore in RQ2-1. As the two- exceed the generation length of L IBRO, or due to dependency
example n=50 setting shows the best performance, we use it on complex helper functions.
as the default setting throughout the rest of the paper.
(a) JCP&D4J Bugs (b) JCP&D4J Projects
(all crashes) (with noncrashes) (c) All D4J Projects
TABLE III: Reproduction performance for different prompts Evo- Copy& Copy&
1 Paste Copy& LIBRO
Crash
12 Paste Evo- 12
3 Paste
Crash 14 11
Setting reproduced FIB
14 3 5
8 28 223
No Example (n=10) 124 440
One Example (n=10) 166 417 14 80
One Example from Source Project (n=10) 152 455 LIBRO LIBRO
One Example with Constructor Info (n=10) 167 430
Two Examples (n=10, 5th percentile) 161 386
Two Examples (n=10, median) 173 409 Fig. 2: Baseline comparison on bug reproduction capability
Two Examples (n=10, 95th percentile) 184 429
Two Examples (n=50) 251 570
One Example, Crash Bugs (n=10) 69 153
One Example with Stack, Crash Bugs (n=10) 84 155 Answer to RQ1-2: L IBRO is capable of replicating a large
and distinct group of bugs relative to prior work.
Under the two-example n=50 setting, we find that overall
251 bugs, or 33.5% of 750 studied Defects4J bugs, are B. RQ2. How efficient is L IBRO?
reproduced by L IBRO. Table IV presents a breakdown of the 1) RQ2-1: Here, we investigate how many tests must be
performance per project. While there is at least one bug re- generated to attain a certain bug reproduction performance.
produced for every project, the proportion of bugs reproduced To do so, for each Defects4J bug, we randomly sample x
can vary significantly. For example, L IBRO reproduces a small tests from the 50 generated under the default setting, leaving
number of bugs in the Closure project, which is known to a reduced number of tests per bug. We then check the number
have a unique test structure [29]. On the other hand, the of bugs reproduced y when using only those sampled tests. We
performance is stronger for the Lang or Jsoup projects, whose repeat this process 1,000 times to approximate the distribution.
tests are generally self-contained and simple. Additionally, we The results are presented in Figure 3. Note that the x-axis is
find that the average length of the generated test body is about in log scale. Interestingly, we find a logarithmic relation holds
1000-run simulation of 1000-run simulation of ROC curve of reproducing bugs selection threshold to precision w/ number of bugs
runs to bug reproduction runs to FIB generation 1.0 0.9
250

True Positive Rate


Bugs with at least one FIB
0.8 0.8
500 400

# of Bugs
Reproduced Bugs

Precision
200 0.6 # of reproduced bugs
0.7
400 # of all bugs
150 0.4 200
Precision
0.6
Median
300 Median
100 95% inverval 95% inverval
0.2 0.5
200 ROC curve (area = 0.82)
80% inverval 80% inverval 0.00.0 0
50 50% interval 100 50% interval 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40
False Positive Rate Threshold
20 21 22 23 24 25 20 21 22 23 24 25
LIBRO Runs (log) LIBRO Runs (log)
Fig. 4: ROC curve of bug selection (Left), Effect of thresholds
Fig. 3: Generation attempts to performance. Left depicts bugs to the number of bugs selected and precision (Right)
reproduced as attempts increase, right for FIB
TABLE VI: Ranking Performance Comparison between L I -
BRO and Random Baseline
between the number of test generation attempts and the median
Defects4J GHRB
bug reproduction performance. This suggests that it becomes
acc@n (precision) wef @nagg acc@n (precision) wef @nagg
increasingly difficult, yet stays possible, to replicate more bugs
n L IBRO random L IBRO random L IBRO random L IBRO random
by simply generating more tests. As the graph shows no signs 1 149 (0.43) 116 (0.33) 201 (0.57) 234 (0.67) 6 (0.29) 4.8 (0.23) 15 (0.71) 16.2 (0.77)
of plateauing, experimenting with an even greater sample of 3 184 (0.53) 172 (0.49) 539 (1.54) 599 (1.71) 7 (0.33) 6.6 (0.31) 42 (2.0) 44.6 (2.12)
5 199 (0.57) 192 (0.55) 797 (2.28) 874 (2.5) 8 (0.38) 7.3 (0.35) 60 (2.86) 64.3 (3.06)
tests may result in better bug reproduction results.

Answer to RQ2-1: The number of bugs reproduced in-


creases logarithmically to the number of tests generated, of selected reproduced bugs among all reproduced bugs) is
with no sign of performance plateauing. 0.87(= 219251 ). From the opposite perspective, the selection
process filters out 188 bugs that were not reproduced, while
dropping only a few successfully reproduced bugs. Note that
TABLE V: The time required for the pipeline of L IBRO if we set the threshold to 10, a more aggressive value, we
Prompt API Processing Running Ranking Total
can achieve a higher precision of 0.84 for a recall of 0.42.
Single Run <1 µs 5.85s 1.23s 4.00s - 11.1s
In any case, as Figure 4 presents, our selection technique
50-test Run <1 µs 292s 34.8s 117s 0.02s 444s is significantly better than random, indicating it can save
developer resources.
2) RQ2-2: We report the time it takes to perform each Among the selected bugs, we assess how effective the test
step of our pipeline in Table V. We find API querying takes rankings of L IBRO are over a random baseline. The random
the greatest amount of time, requiring about 5.85 seconds. approach randomly ranks the syntactic clusters (groups of
Postprocessing and test executions take 1.23 and 4 seconds per syntactically equivalent FIB tests) of the generated tests. We
test (when the test executes), respectively. Overall, L IBRO took run the random baseline 100 times and average the results.
an average of 444 seconds to generate 50 tests and process Table VI presents the ranking evaluation results. On the
them, which is well within the 10-minute search budget often Defects4J benchmark, the ranking technique of L IBRO im-
used by search-based techniques [20]. proves upon the random baseline across all of the acc@n
metrics, presenting 30, 14, and 7 more BRTs than the random
Answer to RQ2-2: Our time measurement suggests that
baseline on n = 1, 3, and 5 respectively. Regarding acc@1, the
L IBRO does not take a significantly longer time than other
first column shows that 43% of the top ranked tests produced
methods to use.
by L IBRO successfully reproduce the original bug report on
3) RQ2-3: With this research question, we measure how the first try. When n increases to 5, BRTs can be found in
effectively L IBRO prioritizes bug reproducing tests via its 57% of the selected bugs, or 80% of all reproduced bugs.
selection and ranking procedure. As L IBRO only shows results The conservative threshold choice here, emphasizes recall over
above a certain agreement threshold, T hr from Section III-D, precision. However, if the threshold is raised, the maximum
we first present the trade-off between the number of total bugs precision can rise to 0.8 (for T hr = 10, n = 5).
reproduced and precision (i.e., the proportion of successfully The wef @nagg values are additionally reported by both
reproduced bugs among all selected by L IBRO) in Figure 4. As summing and averaging the wef @n of all (350) selected bugs.
we increase the threshold, more suggestions (including BRTs) The summed wef @n value indicates the total number of non-
are discarded, but the precision gets higher, suggesting one can BRTs that would be manually examined within the top n
smoothly increase precision by tuning the selection threshold. ranked tests. Smaller wef @n values indicate that a technique
We specifically set the agreement threshold to 1, a conser- delivers more bug reproducing tests. Overall, the ranking of
vative value, in order to preserve as many reproduced bugs L IBRO saves up to 14.5% of wasted effort when compared
as possible. Among the 570 bugs with a FIB, 350 bugs to the random baseline, even after bugs are selected. Based
are selected. Of those 350, 219 are reproduced (leading to on these results, we conclude that L IBRO can reduce wasted
219
a precision of 0.63(= 350 ) whereas recall (i.e., proportion inspection effort and thus be useful to assist developers.
Answer to RQ2-3: L IBRO can reduce both the number of
Listing 4: Generated FIB test for AssertJ-Core-2666.
bugs and tests that must be inspected: 33% of the bugs
1 public void testIssue952() {
are safely discarded while preserving 87% of the successful 2 Locale locale = new Locale("tr", "TR");
bug reproduction. Among selected bug sets, 80% of all bug 3 Locale.setDefault(locale);
reproductions can be found within 5 inspections. 4 assertThat("I").as("Checking␣in␣tr_TR␣locale").containsIgnoringCase("i
");
5 }
C. RQ3. How well would L IBRO work in practice?

TABLE VII: Bug Reproduction in GHRB: x/y means x


reproduced out of y bugs the success of bug reproduction, we observe the trend of
max_output_clus_size between the two datasets, with and
Project rep/total Project rep/total Project rep/total without BRTs. In Figure 5, we see that the bugs with no BRT
AssertJ 3/5 Jackson 0/2 Gson 4/7 typically have small max_output_clus_size, mostly under
checkstyle 0/13 Jsoup 2/2 sslcontext 1/2
ten; this pattern is consistent in both datasets.
The ranking results of GHRB are also presented in Table VI.
1) RQ3-1: We explore the performance of L IBRO when They are consistent to the results from Defects4J, indicating
operating on the GHRB dataset of recent bug reports. We find the features used for our ranking strategy continue to be good
that of the 31 bug reports we study, L IBRO can automatically indicators of successful bug reproduction.
generate bug reproducing tests for 10 bugs based on 50 trials,
for a success rate of 32.2%. This success rate is similar to Answer to RQ3-2: We find that the factors used for the
the results from Defects4J presented in RQ1-1, suggesting ranking and selection of L IBRO consistently predict bug
L IBRO generalizes to new bug reports. A breakdown of results reproduction in real-world data.
by project is provided in Table VII. Bugs are successfully
reproduced in AssertJ, Jsoup, Gson, and sslcontext, while they 3) RQ3-3: We present case studies of attempts by L IBRO
were not reproduced in the other two. We could not reproduce to reproduce bugs that either succeeded or failed.
bugs from the Checkstyle project, despite it having a large
number of bugs; upon inspection, we find that this is because TABLE VIII: Bug Report Successfully Reproduced: URLs
the project’s tests rely heavily on external files, which L IBRO are omitted for brevity (AssertJ-Core Issue #2666)
has no access to, as shown in Section VI-C3. L IBRO also Title assertContainsIgnoringCase fails to compare i and I in
does not generate BRTs for the Jackson project, but the small tr_TR locale
number of bugs in the Jackson project make it difficult to draw See org.assertj.core.internal.Strings#assertContainsIgnoringCase
conclusions from it. [url]
I would suggest adding [url] verification to just ban
Answer to RQ3-1: L IBRO is capable of generating bug toLowerCase(), toUpperCase() and other unsafe methods: #2664
reproducing tests even for recent data, suggesting it is not
simply remembering what it trained with. We first present a successfully reproduced bug report, issue
#2685 for the AssertJ-Core project, in Table VIII. The bug
2) RQ3-2: L IBRO uses several predictive factors correlated is that, in tr_TR locale, assertContainsIgnoringCase method is
with successful bug reproduction for selecting bugs and rank- mishandling the letter I. For this bug report, L IBRO gener-
ing tests. In this research question, we check whether the ates the bug reproducing test in Listing 4. Despite the bug
identified patterns based on the Defects4J dataset continue to report not containing any executable code, L IBRO successfully
hold in the recent GHRB dataset. generated a bug reproducing test that compares I and i,
using the containsIgnoringCase method referenced within the
output cluster size distribution (Defects4J) output cluster size distribution (GHRB)
0.6 report. A BRT for this bug is ranked at the second place,
reproduced 0.25 reproduced
not reproduced 0.20 not reproduced suggesting a developer would have been quickly provided with
proportion
proportion

0.4
0.15 a reproducing test.
0.2 0.10
0.05 Listing 5: Generated FIB test for Checkstyle-11365.
0.0 0 0.00 0 10 20 30 40 50
10 20 30 40 50 1 public void testFinalClass() throws Exception {
output cluster size output cluster size 2 final DefaultConfiguration checkConfig =
3 createModuleConfig(FinalClassCheck.class);
Fig. 5: Distribution of the max_output_clus_size values for 4 final String[] expected = CommonUtil.EMPTY_STRING_ARRAY;
reproduced and not-reproduced bugs 5 verify(checkConfig, getPath("InputFinalClassAnonymousClass.java"),
expected);
6 }
Recall that we use the maximum output cluster size as a
measure of agreement among the FIBs, and thus as a selection We now turn to a bug report that is not successfully
criterion to identify whether a bug has been reproduced. To reproduced. Table IX contains issue #11365 in the Checkstyle
observe whether the criterion is a reliable indicator to predict project. The bug is that CheckStyle incorrectly decides that a
TABLE IX: Bug Report Reproduction Failure: Lightly edited a complete class or method but in the form of source code
for clarity (Checkstyle Issue #11365) statements or expressions); finally 41.5% did not contain code
Title FinalClassCheck: False positive with anonymous classes
snippets inside. Considering only the 251 bug reports that
L IBRO successfully reproduced, the portion of containing the
... I have executed the cli and showed it below, as cli describes the
problem better than 1,000 words full snippets got slightly higher (25.1%), whereas the portion
→src cat Test.java of bug reports with partial snippets was 37.9%, and 37.1% did
[...] not have code snippets. When L IBRO generated tests from bug
public class Test {
class a { // expected no violation reports containing any code snippets, we find that on average
private a(){} } } 81% of the tokens in the body of the L IBRO-generated test
[...] methods overlapped with the tokens in the code snippets.
→java [...] -c config.xml Test.java
Starting audit... We note that using full code snippets provided in reports
[ERROR] Test.java:3:5: Class a should be declared as final. does not always reproduce the bug successfully; recall that the
Copy&Paste baseline in Figure 2 succeeded only on 36 bugs.
Although whether a bug report contains full code snippets
class should be declared final, and mistakenly raises an error. or not may affect the success or failure of L IBRO, L IBRO
A FIB test generated by L IBRO is presented in Listing 5, which generated correct bug reproducing tests even from incomplete
fails as the Java file it references in Line 5 is nonexistent. This code, or without any code snippets. Thus, we argue that L IBRO
highlights a weakness of L IBRO, i.e., its inability to create is capable of both extracting relevant code elements in bug
working environments outside of source code for the generated report and synthesizing code aligned with given a natural
tests. However, if we put the content of Test.java from the language description.
report into the referenced file, the test successfully reproduces
the bug, indicating that the test itself is functional, and that VIII. T HREATS TO VALIDITY
even when a test is initially incorrect, it may reduce the amount Internal Validity concerns whether our experiments demon-
of developer effort that goes into writing reproducing tests. strate causality. In our case, two sources of randomness threat
internal validity: the flakiness of tests and the randomness of
VII. D ISCUSSION LLM querying. While we do observe a small number of flaky
A. Manual Analysis of L IBRO Failures tests generated, the number of them is significantly smaller
Despite successfully reproducing 33.5% of the Defects4J (<2%) than the overall number of tests generated, and as
bugs, in many cases L IBRO could not reproduce the bugs from such we do not believe their existence significantly affects
the bug reports. To investigate which factors may have caused our conclusions. Meanwhile, we engage with the randomness
L IBRO to struggle, we manually analyzed 40 bug reports and of the LLM, performing an analysis in RQ2-1.
corresponding L IBRO outputs. The most common cause of External Validity concerns whether the results presented
failure, happening in 13 cases, was a need of helper defini- would generalize. In this case, it is difficult to tell whether
tions: while the developer-written tests made use of custom the results we presented here would generalize to other Java
testing helper functions which at times spanned hundreds of projects, or projects in other languages. While the uniqueness
lines, L IBRO-generated tests were generally agnostic to such of our prompts and our use of GHRB cases provide some
functions, and as a result could not adequately use them. This evidence that L IBRO is not simply relying on the memorization
points to a need to incorporate project-specific information for of the underlying LLM, it is true that L IBRO benefits from
language models to further improve performance. Other failure the fact that the underlying LLM, Codex, has likely seen the
reasons included low report quality in 11 cases (i.e., a human studied Defects4J projects during training. However, our aim is
would have difficulty reproducing the issue as well), the LLM not to assess whether a specific instance of Codex has general
misidentifying the expected behavior in 8 cases, dependency intelligence about testing: our aim is to investigate the extent
on external resources in 6 cases (as was the case in Listing 5), to which LLM architectures augmented with post-processing
and finally insufficient LLM synthesis length in 3 cases. This steps can be applied to the task of bug reproduction. For
taxonomy of failures suggests future directions to improve L IBRO to be used for an arbitrary project with a similar level of
L IBRO, which we hope to explore. efficacy as in our study, we expect the LLM of L IBRO to have
seen projects in a similar domain, or the target project itself.
B. Code Overlap with Bug Report This can be achieved with fine-tuning the LLMs, as studied
As Just et al. point out [30], bug reports can already contain in other domains [31], [32] (note that Codex is GPT-3 fine
partially or fully executable test code, but developers rarely tuned with source code data). As a due diligence, we checked
adopt the provided tests as is. To investigate whether L IBRO how many tests generated from the Defects4J benchmark
relies on efficient extraction of report content or effective verbatim matched developer-committed bug reproducing tests.
synthesis of test code, we analyzed the 750 bug reports from There were only such three cases, and all had the test code
Defects4J used in our experiment. We find that 19.3% of written verbatim in the report as well, suggesting it is likely
them had full code snippets (i.e., code parsable to a class they got verbatim answers from the report rather than from
or method), while 39.2% had partial code snippets (i.e., not memorization. We also report a few general conditions for
which L IBRO does not perform well: it does not generalize to X. C ONCLUSION
tests that rely on external files or testing infrastructure whose In this paper, we first establish that the report-to-test
syntactic structure is significantly different from the typical problem is important, by inspecting relevant literature and
JUnit tests (such as the Closure project in Defects4J). performing an analysis on 300 open source repositories. To
solve this problem, we introduce L IBRO, a technique that uses
IX. R ELATED W ORK a pretrained LLM to analyze bug reports, generate prospective
tests, and finally rank and suggest the generated solutions
A. Test Generation based on a number of simple statistics. Upon extensive
analysis, we find that L IBRO is capable of reproducing a
Automated test generation has been explored since almost significant number of bugs in the Defects4J benchmark, and
50 years ago [2]. The advent of the object-oriented program- perform detailed analyses about the requirements of using the
ming paradigm caused a shift in test input generation tech- technique. We further experiment with a real-world report-
niques, moving from primitive value exploration to deriving to-bug dataset that we have collected: we find that L IBRO
method sequences to maximize coverage [3], [4]. However, a shows similar performance on this dataset when compared
critical issue with these techniques is the oracle problem [33]: to the Defects4J benchmark, demonstrating its versatility. In
as it is difficult to determine what the correct behavior for a test both datasets, L IBRO successfully identifies when the bug is
should be, automated test generation techniques either rely on reproduced by which test, showing that L IBRO can minimize
implicit oracles [3], or accept the current behavior as correct developer effort as well. We hope to expand upon these
- effectively generating regression tests [4], [34]. Swami [35] results and explore the synergy with existing test-generation
generates edge-case tests by analyzing structured specifica- techniques to further help practitioners.
tions using rule-based heuristics; its “rigid”ness causes it to
fail when the structure deviates from its assumptions, whereas ACKNOWLEDGMENT
L IBRO makes no assumptions on bug report structure. This work was supported by the National Research Foun-
Similar to our work, some techniques focus on reproducing dation of Korea (NRF) Grant (NRF-2020R1A2C1013629),
problems reported by users: a commonly used implicit oracle Engineering Research Center Program through the National
is that the program should not crash [33]. Most of existing Research Foundation of Korea (NRF) funded by the Korean
work [5], [6], [36]–[38] aim to reproduce crashes given a stack Government MSIT (NRF-2018R1A5A1059921), and the Insti-
trace, which is assumed to be provided by a user. Yakusu [39] tute for Information & Communications Technology Promo-
and ReCDroid [40], on the other hand, require user reports tion grant funded by the Korean government MSIT (No.2022-
written in specific formats to generate a crash-reproducing test 0-00995).
for mobile applications. All the crash-reproducing techniques
R EFERENCES
differ significantly from our work as they rely on the crash-
based implicit oracle, and make extensive use of of SUT code [1] R. Haas, D. Elsner, E. Juergens, A. Pretschner, and S. Apel, “How can
manual testing processes be optimized? developer survey, optimization
(i.e., they are white-box techniques). BEE [19] automatically guidelines, and case studies,” in Proceedings of the 29th ACM Joint
parses bug reports to classify sentences that describe observed Meeting on European Software Engineering Conference and Symposium
or expected behavior but stops short of actually generating on the Foundations of Software Engineering, ser. ESEC/FSE 2021.
Association for Computing Machinery, 2021, pp. 1281–1291. [Online].
tests. To the best of our knowledge, we are the first to propose Available: https://doi.org/10.1145/3468264.3473922
a technique to reproduce general bug reports in Java projects. [2] W. Miller and D. L. Spooner, “Automatic generation of floating-point
test data,” IEEE Transactions on Software Engineering, vol. 2, no. 3,
pp. 223–226, 1976.
B. Code Synthesis [3] C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random
testing for java,” in OOPSLA ’07: Companion to the 22nd ACM
Code synthesis also has a long history of research. Tradi- SIGPLAN conference on Object oriented programming systems and
applications companion. ACM, 2007, pp. 815–816.
tionally, code synthesis has been approached via SMT solvers [4] G. Fraser and A. Arcuri, “Whole test suite generation,” IEEE Trans.
in the context of Syntax-Guided Synthesis (SyGuS) [41]. As Softw. Eng., vol. 39, no. 2, pp. 276–291, Feb. 2013.
machine learning techniques improved, they showed good [5] M. Soltani, P. Derakhshanfar, A. Panichella, X. Devroey, A. Zaidman,
and A. van Deursen, “Single-objective versus multi-objectivized opti-
performance on the code synthesis task; Codex demonstrated mization for evolutionary crash reproduction,” in Search-Based Software
that an LLM could solve programming tasks based on natural Engineering, T. E. Colanzi and P. McMinn, Eds. Springer International
language descriptions [14]. Following Codex, some found that Publishing, 2018, pp. 325–340.
[6] M. Nayrolles, A. Hamou-Lhadj, S. Tahar, and A. Larsson, “Jcharming:
synthesizing tests along with code was useful: AlphaCode used A bug reproduction approach using crash traces and directed model
automatically generated tests to boost their code synthesis checking,” in 2015 IEEE 22nd International Conference on Software
performance [42], while CodeT jointly generated tests and Analysis, Evolution, and Reengineering (SANER), 2015, pp. 101–110.
[7] A. Koyuncu, K. Liu, T. F. Bissyandé, D. Kim, M. Monperrus,
code from a natural language description [12]. The focus of J. Klein, and Y. Le Traon, “Ifixr: Bug report driven program repair,”
these techniques is not on test generation; on the other hand, in Proceedings of the 2019 27th ACM Joint Meeting on European
L IBRO processes LLM output to maximize the probability of Software Engineering Conference and Symposium on the Foundations
of Software Engineering, ser. ESEC/FSE 2019. Association for
execution, and focuses on selecting/ranking tests to reduce the Computing Machinery, 2019, pp. 314–325. [Online]. Available:
developer’s cognitive load. https://doi.org/10.1145/3338906.3338935
[8] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing benchmark-based evaluation of search-based crash reproduction,” Em-
faults to enable controlled testing studies for java programs,” in Pro- pirical Software Engineering, vol. 25, no. 1, pp. 96–138, 2020.
ceedings of the 2014 International Symposium on Software Testing and [26] C. Thunes, “javalang: Pure Python Java parser and tools,” https://github.
Analysis, ser. ISSTA 2014. ACM, 2014, pp. 437–440. com/c2nes/javalang, 2022.
[9] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang, [27] R. Premraj, T. Zimmermann, S. Kim, and N. Bettenburg, “Extracting
“Precise condition synthesis for program repair,” in 2017 IEEE/ACM structural information from bug reports,” in Proceedings of the 2008
39th International Conference on Software Engineering (ICSE), 2017, international workshop on Mining software repositories - MSR ’08,
pp. 416–426. ACM Press. ACM Press, 05/2008 2008, pp. 27–30.
[10] X. Li, W. Li, Y. Zhang, and L. Zhang, “Deepfl: Integrating [28] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan,
multiple fault diagnosis dimensions for deep fault localization,” in “Unit test case generation with transformers and focal context,” 2020.
Proceedings of the 28th ACM SIGSOFT International Symposium [29] M. Martinez, T. Durieux, R. Sommerard, J. Xuan, and M. Martin,
on Software Testing and Analysis, ser. ISSTA 2019. Association “Automatic repair of real bugs in java: a large-scale experiment on the
for Computing Machinery, 2019, pp. 169–180. [Online]. Available: defects4j dataset,” Empirical Software Engineering, vol. 22, pp. 1936–
https://doi.org/10.1145/3293882.3330574 1964, 2016.
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [30] R. Just, C. Parnin, I. Drosos, and M. D. Ernst, “Comparing developer-
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- provided to user-provided tests for fault localization and automated
els are few-shot learners,” Advances in neural information processing program repair,” in Proceedings of the 27th ACM SIGSOFT International
systems, vol. 33, pp. 1877–1901, 2020. Symposium on Software Testing and Analysis, 2018, pp. 287–297.
[12] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and [31] R. Tinn, H. Cheng, Y. Gu, N. Usuyama, X. Liu, T. Naumann, J. Gao,
W. Chen, “Codet: Code generation with generated tests,” arXiv preprint and H. Poon, “Fine-tuning large neural language models for biomedical
arXiv:2207.10397, 2022. natural language processing,” CoRR, vol. abs/2112.07869, 2021.
[13] A. Sarkar, A. Gordon, C. Negreanu, C. Poelitz, S. Srinivasa Ragavan, [32] L. Wang, H. Hu, L. Sha, C. Xu, K. Wong, and D. Jiang, “Finetuning
and B. Zorn, “What is it like to program with artificial intelligence?” large-scale pre-trained language models for conversational recommen-
arXiv preprint arXiv:2208.06213, 2022. dation with knowledge graph,” CoRR, vol. abs/2110.07477, 2021.
[33] E. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle
[14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
problem in software testing: A survey,” IEEE Transactions on Software
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
Engineering, vol. 41, no. 5, pp. 507–525, May 2015.
language models trained on code,” arXiv preprint arXiv:2107.03374,
[34] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and
2021.
N. Sundaresan, “Unit test case generation with transform-
[15] T.-D. B. Le, R. J. Oentaryo, and D. Lo, “Information retrieval and
ers,” CoRR, vol. abs/2009.05617, 2020. [Online]. Available:
spectrum based bug localization: Better together,” in Proceedings of
https://arxiv.org/abs/2009.05617
the 10th Joint Meeting on Foundations of Software Engineering, ser.
[35] M. Motwani and Y. Brun, “Automatically generating precise oracles
ESEC/FSE 2015. ACM, 2015, pp. 579–590.
from structured natural language specifications,” in 2019 IEEE/ACM 41st
[16] E. Daka and G. Fraser, “A survey on unit testing practices and problems,” International Conference on Software Engineering (ICSE), 2019, pp.
in 2014 IEEE 25th International Symposium on Software Reliability 188–199.
Engineering, 2014, pp. 201–211. [36] N. Chen and S. Kim, “Star: Stack trace based automatic crash re-
[17] P. S. Kochhar, X. Xia, and D. Lo, “Practitioners’ views on good production via symbolic execution,” IEEE transactions on software
software testing practices,” in Proceedings of the 41st International engineering, vol. 41, no. 2, pp. 198–220, 2014.
Conference on Software Engineering: Software Engineering in Practice, [37] J. Xuan, X. Xie, and M. Monperrus, “Crash reproduction via test case
ser. ICSE-SEIP ’19. IEEE Press, 2019, pp. 61–70. [Online]. Available: mutation: Let existing test cases help,” in Proceedings of the 2015 10th
https://doi.org/10.1109/ICSE-SEIP.2019.00015 Joint Meeting on Foundations of Software Engineering, 2015, pp. 910–
[18] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating 913.
sequences from structured representations of code,” in 7th International [38] P. Derakhshanfar, X. Devroey, A. Panichella, A. Zaidman, and A. van
Conference on Learning Representations, ICLR 2019, New Orleans, Deursen, “Botsing, a search-based crash reproduction framework for
LA, USA, May 6-9, 2019, 2019. [Online]. Available: https://openreview. java,” in 2020 35th IEEE/ACM International Conference on Automated
net/forum?id=H1gKYo09tX Software Engineering (ASE), 2020, pp. 1278–1282.
[19] Y. Song and O. Chaparro, “Bee: A tool for structuring and analyzing bug [39] M. Fazzini, M. Prammer, M. d’Amorim, and A. Orso, “Automatically
reports,” in Proceedings of the 28th ACM Joint Meeting on European translating bug reports into test cases for mobile apps,” in Proceedings
Software Engineering Conference and Symposium on the Foundations of the 27th ACM SIGSOFT International Symposium on Software
of Software Engineering, ser. ESEC/FSE 2020. Association for Testing and Analysis, ser. ISSTA 2018. ACM, 2018, pp. 141–152.
Computing Machinery, 2020, pp. 1551–1555. [Online]. Available: [Online]. Available: http://doi.acm.org/10.1145/3213846.3213869
https://doi.org/10.1145/3368089.3417928 [40] Y. Zhao, T. Yu, T. Su, Y. Liu, W. Zheng, J. Zhang, and W. G.J. Halfond,
[20] M. Soltani, P. Derakhshanfar, X. Devroey, and A. van Deursen, “Recdroid: Automatically reproducing android application crashes from
“A benchmark-based evaluation of search-based crash reproduction,” bug reports,” in 2019 IEEE/ACM 41st International Conference on
Empirical Software Engineering, vol. 25, no. 1, pp. 96–138, Jan 2020. Software Engineering (ICSE), 2019, pp. 128–139.
[Online]. Available: https://doi.org/10.1007/s10664-019-09762-1 [41] R. Alur, R. Bodík, E. Dallal, D. Fisman, P. Garg, G. Juniwal,
[21] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large H. Kress-Gazit, P. Madhusudan, M. M. K. Martin, M. Raghothaman,
language models are zero-shot reasoners,” 2022. [Online]. Available: S. Saha, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and
https://arxiv.org/abs/2205.11916 A. Udupa, “Syntax-guided synthesis,” in Dependable Software Systems
[22] L. Reynolds and K. McDonell, “Prompt programming for large language Engineering, ser. NATO Science for Peace and Security Series, D:
models: Beyond the few-shot paradigm,” in Extended Abstracts of the Information and Communication Security, M. Irlbeck, D. A. Peled, and
2021 CHI Conference on Human Factors in Computing Systems, 2021, A. Pretschner, Eds. IOS Press, 2015, vol. 40, pp. 1–25. [Online].
pp. 1–7. Available: https://doi.org/10.3233/978-1-61499-495-4-1
[23] P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on [42] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
automated fault localization,” in Proceedings of the 25th International T. Eccles, J. Keeling, F. Gimeno, A. D. Lago et al., “Competition-level
Symposium on Software Testing and Analysis, ser. ISSTA 2016. code generation with AalphaCode,” arXiv preprint arXiv:2203.07814,
Association for Computing Machinery, 2016, pp. 165–176. [Online]. 2022.
Available: https://doi.org/10.1145/2931037.2931051
[24] Y. Noller, R. Shariffdeen, X. Gao, and A. Roychoudhury, “Trust
enhancement issues in program repair,” in Proceedings of the 44th
International Conference on Software Engineering, ser. ICSE ’22.
Association for Computing Machinery, 2022, pp. 2228–2240. [Online].
Available: https://doi.org/10.1145/3510003.3510040
[25] M. Soltani, P. Derakhshanfar, X. Devroey, and A. Van Deursen, “A

You might also like