2302.06527v4
2302.06527v4
Abstract—Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task,
motivating the need for automation. Large Language Models (LLMs) have recently been applied to various aspects of software
development, including their suggested use for automated generation of unit tests, but while requiring additional training or few-shot
arXiv:2302.06527v4 [cs.SE] 11 Dec 2023
learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for
automated unit test generation without requiring additional training or manual effort. Concretely, we consider an approach where the
LLM is provided with prompts that include the signature and implementation of a function under test, along with usage examples
extracted from documentation. Furthermore, if a generated test fails, our approach attempts to generate a new test that fixes the
problem by re-prompting the model with the failing test and error message. We implement our approach in T EST P ILOT, an adaptive
LLM-based test generation tool for JavaScript that automatically generates unit tests for the methods in a given project’s API.
We evaluate T EST P ILOT using OpenAI’s gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated
tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%. In contrast, the state-of-the feedback-directed
JavaScript test generation technique, Nessie, achieves only 51.3% statement coverage and 25.6% branch coverage. Furthermore,
experiments with excluding parts of the information included in the prompts show that all components contribute towards the
generation of effective test suites. We also find that 92.8% of T EST P ILOT’s generated tests have ≤ 50% similarity with existing tests (as
measured by normalized edit distance), with none of them being exact copies. Finally, we run T EST P ILOT with two additional LLMs,
OpenAI’s older code-cushman-002 LLM and StarCoder , an LLM for which the training process is publicly documented. Overall, we
observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0%
median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM,
but does not fundamentally depend on the specific model.
for software development tasks [34]–[43]. and whether or not they contain assertions that actually
Given the properties of LLMs, it is reasonable to expect exercise functionality from the target package (non-trivial
that they may be able to generate natural-looking tests. Not assertions). We also empirically evaluate the effect of the
only are they likely to produce code that resembles what various components of our prompt-crafting strategy as well
a human developer would write (including, for example, as whether T EST P ILOT is generating previously memorized
sensible variable names), but LLMs are also likely to pro- tests from the LLM’s training data.
duce tests containing assertions, simply because most tests Using OpenAI’s current most capable and cost-effective
in their training set do. Thus, by leveraging LLMs, one model gpt3.5-turbo,1 T EST P ILOT’s generated tests achieve a
might hope to simultaneously address the two shortcom- median statement coverage of 70.2%, and branch coverage
ings of traditional test-generation techniques. On the other of 52.8%. We find that a median 61.4% of the generated
hand, one would perhaps not expect LLMs to produce tests tests contain non-trivial assertions, and that these non-trivial
that cover complex edge cases or exercise unusual function tests alone achieve a median 61.6% coverage, indicating
inputs, as these will be rare in the training data, making that the generated tests contain meaningful oracles that
LLMs more suitable for generating regression tests than for exercise functionality from the target package. Upon deeper
bug finding. examination, we find that the most common reason for the
There has been some exploratory work on using LLMs generated tests to fail is exceeding the two-second timeout
for test generation. For example, Bareiß et al. [25] evaluate we enforce, usually because of a failure to communicate
the performance of Codex for test generation. They follow test completion to the testing framework. We find that,
a few-shot learning paradigm where their prompt includes on average, the adaptive approach is able to fix 15.6%
the function to be tested along with an example of another of failing tests. Our empirical evaluation also shows that
function and an accompanying test to give the model an all five components included in the prompts are essential
idea of what a test should look like. In a limited evaluation for generating meaningful test suites with high coverage.
on 18 Java methods, they find that this approach compares Excluding any of these components results in either a higher
favorably to feedback-directed test generation [8]. Similarly, proportion of failing tests or in reduced coverage. On the
Tufano et al.’s ATHENAT EST [26] generates tests using a other hand, while excluding usage examples from prompts
BART transformer model [44] fine-tuned on a training set reduces effectiveness of the approach, it does not render it
of functions and their corresponding tests. They evaluate obsolete, suggesting that the LLM is able to learn from the
on five Java projects, achieving comparable coverage to presence of similar test code in its training set.
EvoSuite [17]. While these are promising early results, these Finally, from experiments conducted with the gpt3.5-
approaches, as well as others [29], [45], [46], rely on a turbo LLM, we note that high coverage is still achieved on
training corpus of functions and their corresponding tests, packages whose source code is hosted on GitLab (and thus
which is expensive to curate and maintain. has not been part of the LLM’s training data). Moreover,
In this paper, we explore the feasibility of automatically we find that 60.0% of the tests generated using the gpt3.5-
generating unit tests using off-the-shelf LLMs, with no turbo LLM have ≤ 40% similarity to existing tests and 92.8%
additional training and as little pre-processing as possible. have ≤ 50% similarity, with none of the tests being exact
Following Reynolds and McDonell [47], we posit that pro- copies. This suggests that the generated tests are not copied
viding the model with input-output examples or performing verbatim from the LLM’s training set.
additional training is not necessary and that careful prompt In principle, the test generation approach under consid-
crafting is sufficient. Specifically, apart from test scaffolding eration can be used with any LLM. However, the effective-
code, our prompts contain (1) the signature of the function ness of the approach is likely to depend on the LLM’s size
under test; (2) its documentation comment, if any; (3) us- and training set. To explore this factor, we further conducted
age examples for the function mined from documentation, experiments with two additional LLMs: the previous pro-
if available; (4) its source code. Finally, we consider an prietary code-cushman-002 [48] model developed by OpenAI
adaptive component to our technique: each generated test and StarCoder [49], an LLM for which the training process
is executed, and if it fails, the LLM is prompted again with is publicly documented. We observed qualitatively similar
a special prompt including (5) the failing test and the error results using code-cushman-002 (median coverage of 68.2%
message it produced, which often allows the model to fix for statements, 51.2% for branches), and somewhat worse
the test and make it pass. results using StarCoder (54.0% and 37.5%).
To conduct experiments, we have implemented these
In summary, this paper makes the following contribu-
techniques in a system called T EST P ILOT, an LLM-based
tions:
test generator for JavaScript. We chose JavaScript as an
example of a popular language for which test generation • A simple test generation technique where unit tests are
using traditional methods is challenging due to the absence generated by iteratively querying an LLM with a prompt
of static type information and its permissive runtime seman- containing signatures of API functions under test and,
tics [11]. We evaluate our approach on 25 npm packages optionally, the bodies, documentation, and usage exam-
from various domains hosted on both GitHub and GitLab, ples associated with such functions. The technique also
with varying levels of popularity and amounts of available features an adaptive component that includes in a prompt
documentation. These packages have a total of 1,684 API error messages observed when executing previously gen-
functions that we attempt to generate tests for. We inves- erated tests.
tigate the coverage achieved by the generated tests and
their quality in terms of success rate, reasons for failure, 1. https://platform.openai.com/docs/models/gpt-3-5
3
prompt
re ner
API readFile(f, cb)
explorer readFileSync(f)
API functions
/**
* this function ..
package **/
prompt prompt LLM candidate test tests
generator tests validator
doc
comments
let x =
documentation readFile(f, cb)
miner …
usage
snippets
Fig. 2: Overview of the adaptive test generation technique we use in T EST P ILOT.
1 let mocha = require('mocha'); 1 let mocha = require('mocha');
2 let assert = require('assert'); 2 let assert = require('assert');
3 let countries_and_timezones = require('countries-and-timezones'); 3 let countries_and_timezones = require('countries-and-timezones');
4 // countries-and-timezones.getCountry(id) 4 // usage #1
5 describe('test countries_and_timezones', function() { 5 // const ct = require('countries-and-timezones');
6 it('test countries-and-timezones.getCountry',
fi
function(done) { 6 //
7 let country = countries_and_timezones.getCountry('US'); 7 // const country = ct.getCountry('DE');
8 assert.equal(country.name, 'United States'); // fails 8 // console.log(country);
9 assert.equal(country.timezones.length, 2); 9 //
10 assert.equal(country.timezones[0], 'America/New_York'); 10 // /*
11 assert.equal(country.timezones[1], 'America/Chicago'); 11 // Prints:
12 done(); 12 //
13 }) 13 // {
14 }) 14 // id: 'DE',
15 // name: 'Germany',
16 // timezones: [ 'Europe/Berlin', 'Europe/Zurich' ]
17 // }
18 //
19 // */
20 // countries-and-timezones.getCountry(id)
21 describe('test countries_and_timezones', function() {
22 it('test countries-and-timezones.getCountry', function(done) {
23 let country = countries_and_timezones.getCountry('DE');
24 assert.equal(country.id, 'DE');
25 assert.equal(country.name, 'Germany');
26 assert.equal(country.timezones[0], 'Europe/Berlin');
27 assert.equal(country.timezones[1], 'Europe/Zurich');
28 done();
29 })
30 })
(a) (b)
Fig. 3: Examples of prompts (highlighted) and the completions provided by the LLM, comprising complete tests. Prompt (a)
contains no snippets and the test generated from it fails. Prompt (b) contains one snippet and the generated test passes.
Prompt Generator: This component constructs the Test Validator: Next, we send the generated prompts
initial prompt to send to the LLM for generating a test to LLM and wait for completions. We only consume as many
for a given function f . As mentioned above, we initially tokens as are needed to form a syntactically valid test. Since
have (at most) four pieces of information about f at our there is no guarantee that the completions suggested by the
disposal: its signature, its definition, its doc comment, and model are syntactically valid, the test validator tries to fix
its usage snippets extracted from documentation. While it simple syntactic errors such as missing brackets, and then
might seem natural to construct a prompt containing all of parses the resulting code to check whether it is syntactically
this information, in practice it can sometimes happen that valid. If not, the test is immediately marked as failed.
more complex prompts lead to worse completions as the Otherwise it is run using the Mocha test runner to determine
LLM gets confused by the additional information. There- whether it passes or fails (either due to an assertion error or
fore, we follow a different strategy: we start with a very some other runtime error).
simple initial prompt that includes no metadata except the
function signature, and then let the prompt refiner extend it Each returned completion can be concatenated with the
step by step with additional information. prompt to yield a candidate test. However, to allow us to
eliminate duplicate tests generated from different prompts,
we post-process the candidate tests as follows: we strip
5
Algorithm 1 Pseudo-code for API exploration. uniquely represents an API method, and sig is the signature
1: function exploreAPI(pkgName) of a function. Our notion of an access path takes a somewhat
2: modObj ← object created by importing pkgName simplified form compared to the original concept proposed
3: seen ← ∅ by Mezzetti et al. [51], and consists of a package name
4: return explore(pkgName, modObj, seen)
5: function explore(accessPath, obj, seen)
followed by a sequence of property names.
6: apis ← ∅ We rely on a dynamic approach to explore the API
7: if obj ̸∈ seen then of a package pkgName, by creating a small program that
8: seen ← seen ∪ { obj } imports the package (line 2), and relying on JavaScript’s
9: if obj is a function with signature sig then
10: apis ← apis ∪ { ⟨accessPath, sig⟩ } introspective capabilities to determine which properties are
11: else if obj is an object then present in the package root object modObj that is created by
12: props ← { prop | obj has a property prop } importing pkgName and what types these properties have.
13: for prop in props do
Exploration of modObj’s properties is handled by a recursive
14: apis ← apis ∪
15: explore(extend(accessPath, prop),obj[prop], seen) function explore that begins at the access path representing
16: else if obj is an array then the package root and that traverses this object recursively,
17: for each index i in the array do calling another auxiliary function extend to extend the access
18: apis ← apis ∪ path as the traversal descends into the object’s structure.
19: explore(extend(accessPath, prop),obj[i], seen)
During exploration, if an object is encountered at access
20: return apis
path a whose type is a function with signature sig, then a
21: function extend(accessPath, component)
22: if component is numeric then pair ⟨a, sig⟩ is recorded (line 10). If the type of p is an object,
23: return accessPath[component ] then the objects referenced by its properties are recursively
24: else explored (lines 15–15), and if the type of p is an array, then
25: return accessPath.component
p’s properties are explored recursively as well (lines 17–19).
Test Generation: Algorithm 2 shows pseudo-code
for the test generation step. The algorithm begins by ini-
out the comment containing the function metadata in the tializing the set prompts of generated prompts, the set tests
prompt and replace the descriptions in the describe and it of generated passing tests, and the set seen containing all
calls with the generic strings 'test suite' and 'test case', generated tests to the empty set and by using Algorithm 1
respectively. to obtain the set apis of (access path, signature) pairs that
Prompt Refiner: The Prompt Refiner applies a num- constitute the package’s API (lines 2–5). Then, on lines 6–7,
ber of strategies to generate additional prompts to use for each such pair, a base prompt is constructed and added
for querying the model. Overall, we employ four prompt to prompts, containing only the access path and signature,
refiners as follows: using the template illustrated in Figure 1. Next, lines 9–
1) FnBodyIncluder: If p did not contain the definition of f , a 27 create additional prompts by adding the function body,
prompt is created that includes it. example usage snippets, and documentation comments ex-
2) DocCommentIncluder: If f has a doc comment but p did tracted from the code to previously generated prompts.
not include it, a prompt with the doc comment is created. Here, the refine function extends a previously generated
3) SnippetIncluder: If usage snippets for f are available but p prompt by adding the function body, example snippets, or
did not include them, a prompt with snippets is created. doc comment. The order in which each type of information,
4) RetryWithError: If t failed with error message e, a prompt if included, appears in prompts is fixed as follows: example
is constructed that consists of: the text of the failing test snippets, error message from previously generated test, doc
t followed by a comment // the test above fails with comments, function body, signature.
the following error: e, followed by a comment // fixed The while loop on lines 29–44 describes an iterative pro-
test. This strategy is only applied once per prompt, so it cess for generating tests that continues as long as prompts
is not attempted if p itself was already generated by this remain that have not been processed. In each iteration, a
strategy. prompt is selected and removed from prompts, and the LLM
The refined prompt is then used to construct a test in is queried for completions (line 31). For each completion that
the same way as the original prompt. All strategies are was received, a test is constructed by concatenating the
applied independently and in all possible combinations, but prompt and the completion (line 33) and minor syntactic
note that the first three will only apply at most once and problems are fixed such as adding missing ‘}’ characters at
the fourth will never apply twice in a row, thus ensuring the end of the test (line 34). Moreover, we remove comments
termination. from the test to enable deduplication of tests that only differ
in their comments (line 35).
If the resulting test is syntactically valid and the same
2.2 Algorithm Details test was not encountered previously, it is executed (line 38).
We now provide additional detail on the two key steps of Otherwise, we do not re-execute it but still link the prompt
our approach: API exploration and test generation. to the previously seen test. If the test executed successfully,
API Exploration: Algorithm 1 shows pseudocode we add it to tests (line 40). If it failed (due to an assertion
that illustrates how the set of functions that constitute the failure, nontermination, or because of an uncaught excep-
API for a package is identified. The algorithm takes a tion), and if the test was not derived from a prompt that
package under test, pkgName, and produces a list of pairs was constructed from a previous failing test (line 42), then
⟨a, sig⟩ representing its API. Here, a is an access path that we create a new prompt containing the failed test and the
6
error message and add it to prompts. Next, we refine this prompt to include the usage snippet
When the iterative process concludes, the set tests is as shown in the highlighted part of Figure 3(b). This enables
returned (line 45). the LLM to generate a test incorporating the information
provided in this snippet, which passes when executed.
We show another example in Figure 4 from quill-delta,3
2.3 Examples
a package for representing and manipulating changes to
To make the discussion more concrete, we will now show documents. As before, Figure 4(a) shows the initial prompt
two examples of how T EST P ILOT generates tests. for quill-delta’s concat method, which concatenates two
As the first example, we consider the npm package change sets, and a test that was generated from this prompt.
countries-and-timezones.2 API exploration reveals that this It is noteworthy that the LLM was able to generate a
package exports a function getCountry with a single pa- syntactically correct test for quill-delta, where arguments
rameter id and the project’s README.md file provides a usage such as
example. 1 [{ insert: 'Hello' },
Figure 3(a) shows a test for this function generated from 2 { insert: ' ', attributes: { bold: true } },
3 { insert: 'World!' }]
the initial highlighted prompt that only includes the func-
tion signature, but no other metadata. This test fails when are passed to the constructor even in the absence of any usage
execution reaches the assertion on line 8 because the ex- examples. Most likely, this is because quill-delta is a popular
pression country.name evaluates to "United States of America", package with more than 1.2 million weekly downloads,
which differs from the value "United States" expected by the which means that the LLM is likely to have seen examples
assertion. of its use in its training set.
Fig. 4: Example illustrating how a prompt is refined in response to the failure of a previously generated test. Prompt (a)
contains no information except the method signature, and the test generated from it fails. Prompt (b) adds information
about the test failure, and the generated test passes.
Nevertheless, the test in Figure 4(a) fails because when RQ1 How much statement coverage and branch coverage do tests
reaching the assertion on line 16 delta3.ops.length has the generated by T EST P ILOT achieve? Ideally, the generated
value 5, whereas the assertion expects the value 6. The tests would achieve high coverage to ensure that most
reason for the assertion’s failure is the fact that the concat of the API’s functionality is exercised. Given that our
method merges adjacent elements if they have the same goal is to generate complete unit test suites (as op-
attributes. Therefore, when execution reaches line 16, the posed to bug finding), we measure statement coverage
array delta3.ops will hold the following value: for passing tests only. We report coverage on both the
1 [ package level and function level.
2 { insert: 'Hello' },
RQ2 How does T EST P ILOT’s coverage compare to Nessie [11]?
3 { insert: ' ', attributes: { bold: true } },
4 { insert: 'World!Hello' }, We compare T EST P ILOT’s coverage to the state-of-
5 { insert: ' ', attributes: { bold: true } }, the-art JavaScript test generator, Nessie, which uses
6 { insert: 'World!' } a feedback-directed approach.
7 ]
RQ3 How many of T EST P ILOT’s generated tests contain non-
and therefore delta3.ops.length will have the value 5. trivial assertions? A test with no assertions or with
In response to this failure, the Prompt Refiner will create trivial assertions such as assert.equal(true, true) may
the prompt shown in Figure 4(b) from which a passing test still achieve high coverage. However, such tests do not
is generated. In this test, the expected value in the assertion provide useful oracles. We examine the generated tests
has been updated to 5, as per the assertion error message. and measure the prevalence of non-trivial assertions.
Note that all these tests look quite natural and similar RQ4 What are the characteristics of T EST P ILOT’s failing tests?
to tests that a human developer might write, and they We investigate the reasons behind any failing gener-
exercise typical usage scenarios (rather than edge cases) of ated test.
the functions under test. RQ5 How does each of the different types of information included
in prompts contribute to the effectiveness of T EST P ILOT’s
3 R ESEARCH Q UESTIONS & E VALUATION S ETUP generated tests? To investigate if all the information
3.1 Research Questions included in prompts through the refiners is necessary
Our evaluation aims to answer the following research ques- to generate effective tests, we disable each refiner and
tions. report how it affects the results.
8
TABLE 1: Overview of npm packages used for evaluation, ordered by descending popularity in terms of downloads/wk.
The top 10 packages correspond to the Nessie benchmark, the next 10 are additional GitHub-hosted packages we include,
while the last 5 are GitLab-hosted packages.
API functions
Package Domain LOC Existing Weekly Total
Tests Downloads # # (%) w/ examples # (%) w/ comment Examples
glob file system 314 22 103M 21 2 (9.5%) 0 (0.0%) 4
fs-extra file system 822 417 79M 172 23 (13.4%) 0 (0.0%) 27
graceful-fs file system 208 11 48M 137 1 (0.7%) 0 (0.0%) 1
jsonfile file system 46 43 48M 4 4 (100.0%) 0 (0.0%) 14
bluebird promises 3.1K 238 26M 115 59 (51.3%) 0 (0.0%) 248
q promises 736 214 14M 98 29 (29.6%) 15 (15.3%) 64
rsvp promises 565 171 8.6M 29 11 (37.9%) 16 (55.2%) 15
memfs file system 2.2K 265 13M 376 21 (5.6%) 7 (1.9%) 26
node-dir file system 244 55 6M 6 6 (100.0%) 5 (83.3%) 8
zip-a-folder file system 25 5 95K 3 2 (66.7%) 0 (0.0%) 2
js-sdsl data structures 1.5K 88 9.7M 133 3 (2.3%) 0 (0.0%) 1
quill-delta document changes 395 180 1.6M 36 17 (47.2%) 0 (0.0%) 17
complex.js numbers/arithmetic 393 21 497K 52 7 (13.5%) 52 (100.0%) 5
pull-stream streams 308 31 78K 24 7 (29.2%) 0 (0.0%) 7
countries-and-timezones date & timezones 78 31 115K 7 7 (100.0%) 0 (0.0%) 7
simple-statistics statistics 917 307 103K 89 3 (3.4%) 88 (98.9%) 3
plural text processing 53 14 18K 4 3 (75.0%) 0 (0.0%) 3
dirty key-value store 89 24 9.7K 27 5 (18.5%) 0 (0.0%) 2
geo-point geographical coordinates 76 10 1.1K 19 10 (52.6%) 0 (0.0%) 11
uneval serialization 31 3 417 1 1 (100.0%) 0 (0.0%) 1
image-downloader image handling 32 12 23K 1 1 (100.0%) 0 (0.0%) 3
crawler-url-parser URL parser 100 185 5 3 3 (100.0%) 0 (0.0%) 4
gitlab-js API wrapper 205 14 184 37 4 (10.8%) 2 (5.4%) 7
core access control 136 16 1 20 6 (30.0%) 0 (0.0%) 2
omnitool utility library 1.6K 420 1 270 15 (5.6%) 80 (29.6%) 9
RQ6 Are T EST P ILOT’s generated tests copied from existing an additional 5 packages whose source code is hosted on
tests? Since gpt3.5-turbo is trained on GitHub code, it GitLab.4
is likely that the LLM has already seen the tests of Table 1 shows that the 25 packages vary in terms of
our evaluation packages before and may simply be popularity (downloads/week) and size (LOC), as well as
producing copies of tests it “memorized”. We inves- in terms of the number of API functions they offer and the
tigate the similarity between the generated tests and extent of the available documentation. The “API functions”
any existing tests in our evaluation packages. columns show the number of available API functions; the
RQ7 How much does the coverage of T EST P ILOT’s generated number and proportion of API functions that have at least
tests rely on the underlying LLM? To understand the one example code snippet in the documentation (“w/ ex-
generalizability of an LLM-based test generation ap- amples”); and the number and proportion of API functions
proach and the effect of the underlying LLM T EST P I - that have a documentation comment (“w/ comment”). We
LOT relies on, we compare coverage we obtain using also show the total number of example snippets available in
gpt3.5-turbo with two other LLMs: (1) OpenAI’s code- the documentation of each package.
cushman-002 model [48], one of gpt3.5-turbo’s prede- To answer RQ1–RQ6, we run T EST P ILOT using the
cessors which is part of the Codex suite of LLMs [52] gpt3.5-turbo LLM (version gpt-3.5-turbo-0301), sampling five
and which served as the main model behind the first completions of up to 100 tokens at temperature zero,5 with
release of GitHub Copilot [38], and (2) StarCoder [49], a all other options at their default values. In RQ7, we use
publicly available LLM for which the training process the same settings for code-cushman-002 and StarCoder, except
is fully documented. that the sampling temperature for the latter is 0.01 since it
does not support a temperature of zero.
Note that LLM-based test generation does not have a
test-generation budget per se since it is not an infinite
3.2 Evaluation Setup process. Instead, we ask the LLM for at most five comple-
tions for every prompt (but the model may return less). We
To answer the above research questions, we use a bench-
deduplicate the returned tests to avoid inflating the number
mark of 25 npm packages. Table 1 shows the size and
of generated tests. For example, if two prompts return the
number of downloads (popularity) of each of these pack-
same test (modulo comments), we only record this test once
ages. The first 10 packages shown in the table are the same
but keep track of which prompt(s) resulted in its generation.
GitHub-hosted packages used for evaluating Nessie [11],
While we set the sampling temperature as low as pos-
a recent feedback-directed test-generation technique for
sible, there is still some nondeterminism in the received
JavaScript. However, we notice that these 10 packages
responses. Accordingly, we run all experiments 10 times.
primarily focus on popular I/O-related libraries with a
All the per-package data points reported in Section 4 are
callback-heavy style, so we add 10 new packages from dif-
ferent domains (e.g., document processing and data struc-
4. We checked similarly-named repos to ensure that they are not
tures), programming styles (primarily object-oriented), as mirrored on GitHub.
well as less popular packages. Since gpt3.5-turbo (as well as 5. Intuitively speaking, the sampling temperature controls the ran-
the other LLMs we experiment with in RQ7) was trained domness of the generated completions, with lower temperatures mean-
on GitHub repositories, we have to assume that all our ing less non-determinism. Language models encode their input and
output using a vocabulary of tokens, with commonly occurring se-
subject packages (and in particular their tests) were part of quences of characters (such as require, but also contiguous runs of
the model’s training set. For this reason, we also include space characters) represented by a single token.
9
TABLE 2: Statement and branch coverage for T EST P ILOT’s passing tests, generated using gpt3.5-turbo. We also show passing
tests that uniquely cover a statement. The last two columns show Nessie’s statement and branch coverage for each package.
Note that Nessie generates 1000 tests per package and the reported coverage is for all generated tests.
median values over these 10 runs, except for integer-typed repository.7 Including the extracted example snippets from
data such as number of tests where we use the ceiling of this external repository increases the achieved coverage to
the median value. For RQ6, without loss of generality, we 43.6%, which suggests the importance of including usage
present the similarity numbers based on the first run only. examples in the prompts. We examine the effect of the infor-
We use Istanbul/nyc [53] to measure statement and mation included in prompts in detail in RQ5 (Section 4.5).
branch coverage and use Mocha’s default time limit of 2s It is worth noting that T EST P ILOT’s coverage for the
per test. GitLab projects listed in the bottom 5 rows of Table 2 ranges
from 51.4% to 78.3%. This demonstrates that T EST P ILOT is
effective at generating high-coverage unit tests for packages
4 E VALUATION R ESULTS it has not seen in its training set.
4.1 RQ1: T EST P ILOT’s Coverage Branch Coverage: We also show the branch coverage
achieved by the passing tests in Table 2. We find that the
Table 2 shows the number of tests T EST P ILOT generates
branch coverage per package is between 16.5% and 71.3%,
for each package, the number (and proportion) of passing
with a median of 52.8%. Similar to statement coverage, the
tests, and the corresponding coverage achieved by the pass-
achieved branch coverage is also much higher than the
ing tests. The first two columns of Table 2 also show the
loading coverage with a difference of 15.9%– 71.3% and a
coverage obtained by simply loading the package (loading
median difference of 50.0%.
coverage). This is the coverage we get “for free” without Since achieving branch coverage is generally harder than
having any test suite, which we provide as a point of achieving statement coverage, it is expected that the branch
reference for interpreting our results. Overall, 9.9%–80.0% coverage for the generated tests is lower than the statement
of the tests generated by T EST P ILOT are passing tests, with coverage. However, we note an interesting case in gitlab-js
a median of 48.0% across all packages. We now discuss the where this difference seems more pronounced (51.7% vs.
different coverage measurements of these passing tests. 16.5%). Upon further investigation of its source code and
Statement Coverage: The statement coverage per documentation, we find that gitlab-js offers various config-
package achieved by the passing tests ranges between 33.9% uration options and parameters to specify the GitLab reposi-
and 93.1%, with a median of 70.2%. We note that across all tory to connect to and use/query (e.g., its url, authentication
packages, the achieved statement coverage is much higher token, search parameters to use for a query). The processing
than the loading coverage with a difference of 19.1%– 88.2% of these options is reflected in the main branching logic
and a median difference of 53.7%.6 in the code. While T EST P ILOT does attempt to generate
The lowest statement coverage T EST P ILOT achieves is on reasonable tests that call different endpoints with different
js-sdsl, at 33.9%. Upon further investigation of this package,
options, it sometimes struggles to find the correct function
we find that it maintains the documentation examples that call to use, resulting in type errors. In general, a large
appear on its website as markdown files in a separate proportion of the tests T EST P ILOT generates for this package
fail, and thus do not contribute to our coverage numbers. It
6. For some of the projects we share with Nessie, our loading cov-
erage differs from the one reported in their paper. We contacted the is also worth noting that properly testing such a package
authors, who confirmed that with recent versions of Istanbul/nyc they would require mocking, but we did not observe any of
obtained the same numbers as we did, except for a very small difference
on memfs (29.1% vs 29.3%), which may be due to platform differences. 7. https://github.com/js-sdsl/js-sdsl.github.io/tree/main/start
10
glob
fs-extra
graceful-fs
jsonfile
bluebird
q
rsvp
memfs
node-dir
zip-a-folder
js-sdsl
quill-delta
complex.js
pull-stream
countries-and-timezones
simple-statistics
image-downloader
crawler-url-parser
plural
dirty
geo-point
uneval
gitlab-js
core
omnitool
and each data point in a box represents the statement cov-
erage for a function in that package. The median statement
coverage per function for each package is shown in red.
Overall, the median statement coverage per function for
a given project ranges from 0.0%–100.0%, with a median
of 77.1%. To ensure that T EST P ILOT is not generating high Fig. 5: Distribution of statement coverage per function for
coverage tests only for smaller functions, we run a Pearson’s T EST P ILOT’s generated tests using gpt3.5-turbo.
correlation test between the statement coverage per function
and the corresponding function size (in statements). We find
ditional feedback-directed approach.9 For each package,
no statistically significant correlation between coverage and
Nessie generates 1000 tests, for which we measure statement
size, indicating that T EST P ILOT is not only doing well for
and branch coverage in the same way as for T EST P ILOT.
smaller functions.8
We then repeat these measurements 10 times and take the
As expected, Figure 5 shows that for most packages,
median coverage across the 10 runs to follow a similar setup
T EST P ILOT does well for some functions while achieving
to T EST P ILOT’s evaluation. We use a Wilcoxon paired rank-
low coverage for others. Let us take jsonfile as an example.
sum test to determine if there are statistically significant
In Table 2, we saw that its statement coverage at the package
differences between the coverage achieved by both tools.
level is 38.3%. From Figure 5, we see that statement coverage
The last two columns of Table 2 show statement and
per function ranges from 0% to 100%, with a median of
branch coverage for Nessie. We note that Nessie could
almost 50%. Diving into the data, we find that there are
not run on uneval, because the module’s only export is a
two functions that T EST P ILOT cannot cover, because their
function, which Nessie does not support. For the remaining
corresponding generated tests fail either due to references to
24 packages, Nessie achieved 4.7%– 96.0% statement cov-
non-existent files T EST P ILOT includes in the tests or because
erage, with a median of 51.3%. In contrast, as shown in
they time out. However, the functions that T EST P ILOT is
Table 2, T EST P ILOT’s median statement coverage is much
able to cover have statement coverage ranging from 58%-
higher at 70.2%. The difference in branch coverage is even
100%. We can observe similar behavior with other file sys-
higher, with 52.8% for T EST P ILOT vs 25.6% for Nessie. Both
tem dependent packages, such as graceful-fs or fs-extra. At
these differences are statistically significant (p-values 0.002
the other end of the spectrum, we see zip-a-folder where
and 0.027 respectively) with a large effect size, measured
T EST P ILOT achieves both high statement coverage at the
by Cliff’s delta [55], of 0.493 for statement coverage and
package level (84%) as well as high statement coverage
a medium one (0.431) for branch coverage.10 Note that
at the function level in Figure 5 where the minimum per
Nessie always generates 1000 tests per package, while T EST-
function coverage is 75%.
P ILOT usually generates far fewer tests, except on memfs
Uniquely Contributing Tests: To further understand
and omnitool. It is also worth emphasizing that Nessie (and
the diversity of the generated tests, Table 2 also shows
other test-generation techniques such as LambdaTester [56])
how many of the tests T EST P ILOT generates are uniquely
report coverage of all generated tests, regardless of whether
contributing, meaning that they cover at least one statement
they pass or fail while our reported coverage numbers are
that no other tests cover. A median of 10.5% of the pass-
for passing tests only.
ing tests are of this kind, with some packages as high as
We now dive into the results at the package level.
100.0%. These results are promising because they show that
For each package, Table 2 highlights the higher coverage
T EST P ILOT can generate tests that cover edge cases, but
from the two techniques in bold. T EST P ILOT outperforms
there is clearly some redundancy among the generated tests.
Nessie on 17 of the 24 packages (glob, fs-extra, blue-
Of course, we cannot simply exclude all 89.5% remaining
bird, q, rsvp, memfs, js-sdsl, quill-delta, complex.js, pull-
tests without losing coverage, since some statements may
stream, simple-statistics, plural, dirty, geo-point, image-
be covered by multiple tests non-uniquely. Exploring test
downloader, core, omnitool), increasing coverage by 3.6%–
suite minimization techniques [54] to reduce the size of the
74.5%, with a median 30.0% increase. For 7 of the re-
generated test suite is an interesting avenue for future work.
maining packages (graceful-fs, jsonfile, node-dir, zip-a-
folder, countries-and-timezones, crawler-url-parser, gitlab-
4.2 RQ2 T EST P ILOT vs. Nessie 9. Note that Nessie’s implementation has been refactored and im-
proved after the publication of the original paper, which is why
We compare T EST P ILOT’s coverage to the state-of-the-art some of the values in this table differ slightly from the pub-
JavaScript test generator Nessie [11], which uses a tra- lished numbers. Nessie’s first author has kindly helped us run the
improved version (specifically, https://github.com/emarteca/nessie/
tree/86e48f1d038d98dcd2663d6d36a58a70c4666b1b) on all 25 packages.
8. Exact correlation coefficients and p-values are provided in our We include the Nessie results in our artifact
artifact. 10. All effect sizes for all statistical tests are available in our artifact.
11
Fig. 6: Example of a test generated by Nessie. Highlighted lines are for debugging purposes only and do not contribute to
the testing of the package under test.
js), T EST P ILOT achieves lower coverage than Nessie. For Table 3 shows the number of tests with non-trivial as-
these packages, it reduces coverage by 0.5%– 53.2%, with sertions (non-trivial test for short) and their proportion w.r.t
a median 3.6% decrease. We also note that Nessie fails to all generated tests from Table 2. The table also shows the
achieve any branch coverage on 3 projects (dirty, geo-point, number and proportion of these tests that pass, along with
core), while the statement coverage for these projects is non- the statement coverage they achieve.
zero. Upon further examination, and after consulting the We observe that there is only one package,
Nessie authors, we found that Nessie cannot generate tests image-downloader where T EST P ILOT generates only trivial
that instantiate classes, meaning that statement coverage is tests. While the generated tests for image-downloader did
barely above loading coverage for packages with a class- include calls to its API, they were all missing assert
based API, while the branch coverage is zero. statements. Across the remaining packages, a median of
Aside from the difference in coverage achieved by 9.1% – 94.6% of T EST P ILOT’s generated tests per package
Nessie and T EST P ILOT, tests generated by Nessie tend are non-trivial. A median of 61.4% of the generated tests
to look quite different from the ones generated by T EST- for a given package are non-trivial. When compared to
P ILOT, which stems from Nessie’s random approach to all generated tests, we can also see that only a slightly
test generation. To illustrate this, Figure 6 shows an ex- lower proportion of non-trivial tests pass (median 48.0% for
ample of a test generated by Nessie that exercises the overall passing tests from Table 2 vs. 43.7% for non-trivial
getCountry function of countries-and-timezones. As can be seen passing tests from Table 3). Both these results show that
in the figure, the test uses long variable names such as T EST P ILOT typically generates tests with assertions that
ret_val_manuelmhtr_countries_and_timezones_1 that hamper read- exercise functionality from the target package.
ability. Moreover, the test invokes getCountry on lines 13–17 The coverage achieved by the non-trivial tests also sup-
with an object literal that binds random values to some ran- ports this finding. Specifically, when comparing the state-
domly named properties, which does not reflect intended ment coverage for all the generated tests in Table 2 to
use of the API. Moreover, tests generated by Nessie do that for non-trivial tests in Table 3, we find that the dif-
not contain any assertions. By contrast, tests generated by ference ranges from 0.0%–84.0%, with a median difference
T EST P ILOT for the same package (see Figure 3) typically of only 7.5%. This means that the achieved coverage for
use variable names that are similar to those chosen by most packages mainly comes from exercising API func-
programmers, invoke APIs with sensible values, and often tionality that is tested by the generated oracles. We note
contain assertions. however that there are 4 packages (jsonfile, node-dir, zip-
a-folder, image-downloader) where non-trivial tests achieve
0% statement coverage, causing the larger differences. Apart
4.3 RQ3: Non-trivial Assertions from image-downloader discussed above, the three remaining
We define a non-trivial assertion as an assertion that depends packages do not have any passing non-trivial tests. Since we
on at least one function from the package under test. To calculate coverage for passing tests only, this results in the
identify non-trivial assertions, we first use CodeQL [57] to 0% statement coverage for the non-trivial tests.
compute a backwards program slice from each assertion
in the generated tests. We consider assertions whose back-
4.4 RQ4: Characteristics of Failing Tests
wards slice contains an import of the package under test as
non-trivial assertions. We then report generated tests that Figure 7 shows the number of failing tests for each package,
contain at least one non-trivial assertion. along with the breakdown of the reasons behind the failure.
12
TABLE 3: Number (%) of non-trivial T EST P ILOT tests gener- Full Without Error Msgs Without Doc Comments
Without Function Body Without Example Snippets Base Prompt
ated using gpt3.5-turbo and the resulting statement coverage 100%
from the passing non-trivial tests.
80%
Value
Tests (%) Tests (%) Stmt Cov
40%
glob 37 (54.4%) 3 (8.1%) 50.1%
fs-extra 142 (30.1%) 70 (49.5%) 28.0% 20%
graceful-fs 64 (18.4%) 27 (42.5%) 41.5%
jsonfile 4 (32.0%) 0 (0.0%) 0.0% 0%
bluebird 227 (61.4%) 137 (60.4%) 61.6% % passing coverage non
q 235 (72.6%) 136 (58.0%) 66.4% tests trivial
coverage
rsvp 68 (62.4%) 48 (70.6%) 67.6% Metric
memfs 758 (73.1%) 356 (47.0%) 77.4%
node-dir 7 (16.5%) 0 (0.0%) 0.0%
zip-a-folder 1 (9.1%) 0 (0.0%) 0.0% Fig. 8: Effect of disabling prompt refiners in T EST P ILOT,
js-sdsl 349 (85.3%) 44 (12.6%) 33.9% using gpt3.5-turbo. Full considers all refiners while Base
quill-delta 92 (60.5%) 27 (28.8%) 59.7% includes only the function signature and Mocha scaffolding.
complex.js 190 (90.9%) 104 (54.6%) 62.7%
pull-stream 60 (72.3%) 29 (47.5%) 64.7%
countries-and-timezones 22 (78.6%) 7 (31.8%) 73.5% simply adding a call to done11 .
simple-statistics 189 (53.6%) 115 (60.6%) 46.9% We find that a median 19.2% of failures are assertion
plural 12 (92.3%) 8 (66.7%) 73.8%
dirty 29 (41.7%) 13 (44.8%) 66.0% errors, indicating that in some cases gpt3.5-turbo is not able
geo-point 60 (78.9%) 34 (56.7%) 64.6% to figure out the correct expected value for the test oracle.
uneval 4 (57.1%) 2 (50.0%) 68.8%
This is especially true when the package under test is not
image-downloader 0 (0.0%) – 0.0%
crawler-url-parser 6 (42.9%) 1 (16.7%) 49.5% widely used and none of the information we provide the
gitlab-js 104 (73.8%) 12 (11.5%) 49.3% model can help it in figuring out the correct values. For
core 64 (74.7%) 12 (18.9%) 75.5%
omnitool 977 (94.6%) 319 (32.6%) 73.8% example, in one of the tests for geo-point, T EST P ILOT was
Median 61.4% 43.7% 61.6% able to use coordinates in the provided example snippet to
Assertion Errors File System Errors Correctness Errors Timeout Errors Other correctly construct two geographical coordinates as input
glob for the calculateDistance function, which computes the dis-
fs-extra
graceful-fs tance between the two coordinates. However, T EST P ILOT
jsonfile
bluebird incorrectly generated 131.4158102876726 as the expected
q
rsvp
memfs value for the distance between these two points, while the
node-dir
zip-a-folder correct expected value is 130584.05017990958; this caused
js-sdsl
quill-delta the test to fail with an assertion error. We note that in
complex.js
pull-stream this specific case, when T EST P ILOT re-prompted the model
countries-and-timezones
simple-statistics with the failing test and error message, it was then able to
plural
dirty
geo-point produce a passing test with the corrected oracle. On average
uneval
image-downloader across the packages, we find that the RetryWithError refiner
crawler-url-parser
gitlab-js was able to fix 11.1% of assertion errors.
core
omnitool Finally, we note that file-system errors are domain spe-
0 100 200 300 400 500 600 700
Number of failing tests cific. The generated tests for packages in the file system
domain, such as fs-extra or memfs, have a high proportion of
Fig. 7: Types of errors in the failed tests generated by failing tests due to such errors. This is not surprising given
T EST P ILOT, using gpt3.5-turbo. that these tests may rely on files that may be non-existent
or require containing specific content. Packages in the other
domains do not face this problem.
Assertion errors occur when the expected value in an as- Overall, we find that re-prompting the model with the
sertion does not match the actual value from executing the error message of failing tests (regardless of the failure rea-
code. File-system errors include errors such as files or direc- son) allows T EST P ILOT to produce a consequent passing test
tories not being found, which we identify by checking for in 15.6% of the cases.
file-system related error codes [58] in the error stack trace.
Correctness errors include all type errors, syntax errors, 4.5 RQ5: Effect of Prompt Refiners
reference errors, incorrect invocations of done, and infinite
recursion/call stack errors. Timeout errors occur when tests Our results so far include tests generated with all four
exceed the maximum running time we allow them (2s/test). prompt refiners discussed in Section 2. In this RQ, we
Finally, we group all other application-specific errors we investigate the effect of each of these refiners on the quality
observe under Other. of the generated tests. Specifically, we conduct an ablation
study where we disable one refiner at a time. Disabling
We find that the most common failure reason is timeouts
with a median 22.7% of failing tests, followed by correctness 11. While the insertion of missing calls to done may seem straight-
errors (particularly type errors) with a median of 20.0% of forward and therefore be amenable to automated repair, it can be
failing tests. The majority of timeouts are due to missing surprisingly tricky to find the correct locations where to insert such
calls to done, causing Mocha to keep waiting for the call. calls, and handling this correctly would require applying static analysis
to the generated test. We therefore opted for an automated approach
We note that on average, the RetryWithError refiner was able that relies solely on the LLM but will consider the use of static analysis
to fix 15.4% of such timeout errors, with the model often to repair generated tests as future work.
13
Fig. 9: Example of a refinement negatively influencing test generation. Prompt (a) contains no information except the
method signature, and the generated test passes. Prompt (b) adds the body of the method, but the generated test fails.
a refiner means that we no longer generate prompts that in a large drop in the percentage of passing tests. This
include the information it provides. For example, disabling suggests that the refiners are effective in guiding the model
DocCommentIncluder means that none of the prompts we towards generating more passing tests, even if this does
generate would contain documentation comments. We then not necessarily result in additional coverage. We find that
compare the percentage of passing tests, the achieved cov- across all packages, a full configuration always leads to a
erage, as well as the coverage by non-trivial tests (non-trivial higher percentage of passing tests for a given API, while
coverage). maintaining high coverage.
Figure 8 shows our results. The x-axis shows the metrics To understand if the differences between the distribu-
we compare across the different configurations shown in tions we observe in Figure 8 are statistically significant, we
the legend. The y-axis shows the values for each metric compare the results of each pair of configurations for all
(all percentages). Each data point in a boxplot represents three metrics using a Wilcoxon matched pairs signed rank
the results of the specific metric for a given package, using tests. Note that when comparing against DocCommentIn-
the corresponding refiner configuration. The black line in cluder, we compare distributions for only the 8 packages
the middle of each box represents the median value for that contain doc comments.
each metric across all packages. The full configuration is the We find a statistically significant difference between the
configuration we presented so far (i.e., all refiners enabled). full configuration and each configuration that disables any
The other configurations show the results of excluding refiner as well as between the base configuration and each
only one of the refiners. For example, the red box plot of the other configurations. Compared to the full configu-
shows the results when disabling the SnippetIncluder (i.e., ration, the largest effect size we observed for disabling a
Without Example Snippets). The base prompt configuration refiner was on passing tests when either FnBodyIncluder or
contains only the function signature and test scaffolding DocCommentIncluder were disabled (Cliff’s delta 0.582 and
(i.e., disabling all refiners). Note, however, that only 8 0.531 respectively).
of the packages in our evaluation contain documentation Apart from differences with the full and base configura-
comments. It does not make sense to compare the effect of tion, we find no statistically significant differences between
disabling the DocCommentIncluder on packages that do not the pairs of other configurations except for the following
contain doc comments in the first place. Therefore, while the cases: We find that for both passing tests and coverage,
distributions shown in all boxplots represent 25 packages, there is a statistically significant difference between the
the Without Doc Comments configuration contains data for configuration that disables FnBodyIncluder and that which
only 8 packages. disables RetryWithError (medium and negligible effect sizes,
Overall, we can see that the full configuration out- respectively). For passing tests, we also find a statistically
performs all other configurations, across all three metrics, significant difference between disabling FnBodyIncluder and
implying that all the prompt information we include con- disabling each of SnippetIncluder and DocCommentIncluder
tributes to generating more effective tests. We find that (small and medium effect sizes, respectively). However, we
there was not a single package where disabling a refiner note that a sample size of 8 is too small to draw any
led to better results on any metric. With the exception of 4 valid conclusions for DocCommentIncluder. It is particularly
packages where disabling one of the refiners did not affect interesting to see that there was no statistically significant
the results (SnippetIncluder on crawler-url-parser and dirty; difference between disabling SnippetIncluder and disabling
and RetryWithError on gitlab-js and zip-a-folder), disabling any of the other refiners. This suggests that the absence of
a refiner always resulted in lower values in at least one example snippets does not necessarily affect the metrics any
metric. more than the absence of any of the information provided by
The contributions of the refiners are especially notable the other refiners. Since Figure 8 shows that we still obtain a
for the percentage of passing tests where disabling any of high median coverage even when disabling SnippetIncluder,
the refiners (e.g., FnBodyIncluder or SnippetIncluder) results this suggests that the presence of examples snippets is
14
assert.equal(err.message, 'test');
with Maximum Similarity Se
}).finally(done);
60% });
glob pull-stream (a)
fs-extra countries-and-timezones
graceful-fs simple-statistics
40% jsonfile plural
it("1 level", function() {
bluebird dirty
q geo-point
rsvp uneval return Promise.resolve().then(function() {
memfs image-downloader throw new Error();
20% node-dir crawler-url-parser
zip-a-folder gitlab-js }).then(assert.fail, function(e) {
js-sdsl core
quill-delta omnitool assertLongTrace(e, 1 + 1, [1]);
complex.js all });
0%
0.0 0.2 0.4 0.6 0.8 1.0 });
Maximum Similarity (Se = 1 normalized edit distance) (b)
Fig. 10: Cumulative percent of T EST P ILOT generated test Fig. 11: Example of a T EST P ILOT-generated test case for
cases, using gpt3.5-turbo, with maximum similarity less than bluebird (a), and an existing test case (b) with similarity 0.62.
the similarity value shown on the x-axis.
not essential for generating effective test suites with high t∗ is a given generated test, and dist is the edit distance
coverage, and that T EST P ILOT is applicable even in cases between a generated test and an existing test. We follow the
where no documentation examples are present. same method for calculating maximum similarity for each
Finally, we note that while the overall results across a generated test, using the npm Levenstein package [61] to
given package show that the refiners always improve, or calculate dist .
at least maintain, coverage and percentage of passing tests,
this does not mean that a refiner always improves the results Figure 10 shows the cumulative percentage of generated
for an individual API function. We have observed situations tests cases for each project where the maximum similarity is
where adding information such as the function implemen- less than the value on the x-axis. We also show this cumula-
tation to a prompt that does not include it confuses the tive percentage for all generated test cases across all projects.
model, resulting in the generation of a failing test. Figure 9 We find that 6.2% of T EST P ILOT’s generated test cases have
shows an example for the complex.js package: given the base less than ≤ 0.3% maximum similarity to an existing test,
prompt on the left, gpt3.5-turbo is able to produce a (very 60.0% have ≤ 0.4 similarity, 92.8% have ≤ 0.5, 99.6% have
simple) passing test for the valueOf method of the constant ≤ 0.6 while 100.0% of the generated tests cases have ≤ 0.7.
ZERO exported by the package; adding the function body This means that T EST P ILOT never generates exact copies of
yields the prompt on the right, which seems to confuse the existing tests. In contrast, while 90% of Lemieux et al. [27]’s
model, resulting in the generation of a failing test. Across all generated Python tests have ≤ 0.4 similarity, 2% of their
packages, 5,367 prompts were generated by applying one of test cases are exact copies. That said, given the differences
the refiners, and in only 394 cases (7.3%) the refined prompt between testing frameworks in Python and JavaScript (e.g.,
was less effective than the original prompt in the sense that Mocha requires more boilerplate code than pytest), similar-
a passing test was generated from the original prompt, but ity numbers cannot be directly compared between the two
not from the refined prompt. languages.
TABLE 4: A comparison of statement coverage of T EST P ILOT’s generated tests using three LLMs. For each project, we show
the number of generated tests, the number (%) of passing tests, and the statement coverage achieved by these passing tests.
gpt3.5-turbo code-cushman-002 StarCoder
Project
Tests Passing Stmt Coverage Branch Coverage Tests Passing Stmt Coverage Branch Coverage Tests Passing Stmt Coverage Branch Coverage
glob 68 18 (26.5%) 71.3% 66.3% 76 31 (40.1%) 61.7% 51.0% 45 10 (22.2%) 64.8% 58.4%
fs-extra 471 277 (58.8%) 58.8% 38.9% 394 254 (64.3%) 41.0% 23.3% 443 163 (36.7%) 43.0% 25.5%
graceful-fs 345 177 (51.4%) 49.3% 33.3% 301 135 (44.9%) 47.5% 30.3% 309 100 (32.4%) 44.7% 22.7%
jsonfile 13 6 (48.0%) 38.3% 29.4% 15 8 (53.3%) 46.8% 44.1% 13 7 (53.8%) 59.6% 47.0%
bluebird 370 204 (55.2%) 68.0% 50.0% 400 211 (52.6%) 68.2% 51.3% 395 130 (32.8%) 55.6% 36.0%
q 323 186 (57.6%) 70.4% 53.7% 356 190 (53.4%) 66.9% 51.2% 348 96 (27.4%) 63.0% 48.1%
rsvp 109 70 (64.2%) 70.1% 55.3% 115 77 (66.5%) 73.3% 60.5% 141 45 (31.9%) 66.8% 53.2%
memfs 1037 471 (45.4%) 81.1% 58.9% 1058 505 (47.7%) 78.9% 54.9% 922 268 (29.0%) 71.9% 49.8%
node-dir 40 19 (48.1%) 64.3% 50.8% 22 16 (74.4%) 52.2% 41.1% 51 17 (33.3%) 54.0% 42.7%
zip-a-folder 11 6 (54.5%) 84.0% 50.0% 10 7 (70.0%) 88.0% 62.5% 11 4 (36.4%) 56.0% 37.5%
js-sdsl 409 46 (11.3%) 33.9% 24.3% 274 63 (23.0%) 36.5% 27.3% 235 21 (8.9%) 26.9% 17.9%
quill-delta 152 33 (21.7%) 73.0% 64.3% 187 50 (26.5%) 74.0% 66.6% 135 7 (5.2%) 31.0% 21.1%
complex.js 209 121 (58.0%) 70.2% 46.5% 221 125 (56.3%) 62.7% 46.2% 178 56 (31.5%) 53.5% 34.9%
pull-stream 83 34 (41.0%) 69.1% 52.8% 76 43 (55.9%) 70.8% 54.7% 69 10 (14.5%) 51.6% 32.7%
countries-and-timezones 28 13 (46.4%) 93.1% 69.1% 41 18 (44.4%) 93.1% 74.4% 33 11 (33.8%) 88.2% 64.9%
simple-statistics 353 250 (70.9%) 87.8% 71.3% 350 213 (60.7%) 80.1% 63.9% 352 164 (46.6%) 69.9% 54.5%
plural 13 8 (61.5%) 73.8% 59.1% 17 8 (47.1%) 75.4% 59.1% 13 5 (38.5%) 73.8% 59.1%
dirty 70 32 (45.3%) 74.5% 65.4% 89 42 (47.5%) 81.1% 69.2% 57 23 (40.4%) 72.6% 61.5%
geo-point 76 50 (65.8%) 87.8% 70.6% 87 35 (40.2%) 61.0% 70.6% 62 16 (25.8%) 46.3% 70.6%
uneval 7 2 (28.6%) 68.8% 58.3% 5 0 (0.0%) 0.0% 0.0% 6 0 (0.0%) 0.0% 0.0%
image-downloader 5 4 (80.0%) 63.6% 50.0% 5 2 (40.0%) 75.8% 50.0% 5 2 (40.0%) 63.6% 50.0%
crawler-url-parser 14 2 (14.3%) 51.4% 35.0% 17 2 (11.8%) 49.5% 31.3% 14 1 (7.1%) 48.6% 32.5%
gitlab-js 141 14 (9.9%) 51.7% 16.5% 116 35 (29.7%) 61.8% 31.8% 117 1 (0.9%) 28.4% 0.6%
core 85 13 (15.3%) 78.3% 50.0% 102 21 (20.7%) 72.7% 47.7% 61 5 (8.2%) 16.1% 0.0%
omnitool 1033 330 (31.9%) 74.2% 55.2% 1029 321 (31.1%) 70.1% 54.2% 812 194 (23.9%) 40.0% 18.1%
Median 48.0% 70.2% 52.8% 47.1% 68.2% 51.2% 31.5% 54.0% 37.5%
4.7 RQ7: Effect of Different LLMs for a given package is 6m 55s.12 The bulk of this time is
spent querying the model, so the choice of LLM makes a
big difference. For example, the median time for T EST P ILOT
Table 4 shows the number of generated tests, percent of to generate tests for a given function using StarCoder and
generated tests that pass, as well as statement and branch code-cushman-002 is 24s and 11s, respectively, and 10m 48s
coverage of T EST P ILOT’s generated tests when using three and 4m 53s, respectively, for a complete test suite. All
different LLMs. While the individual coverage per package these performance numbers suggest that it is feasible to
varies, we can see that the coverage of tests generated by the use T EST P ILOT either in an online setting (e.g., in an IDE)
code-cushman-002 model is comparable to those generated to generate tests for individual functions, or in an offline
by gpt3.5-turbo, with the latter having a slightly higher setting (e.g., during code review) to generate complete test
median statement and branch coverage across the pack- suites for an API.
ages. A Wilcoxon matched-pairs signed-rank test shows
no statistically significant differences between gpt3.5-turbo
5 T HREATS TO VALIDITY
and code-cushman-002 for either type of coverage. On the
other hand, we do find a statistically significant difference Internal Validity: The extraction of example snippets
between StarCoder and each of the OpenAI models (p-value from documentation relies on textually matching a func-
< 0.05) for both types of coverage. As shown in Table 4, tion’s name. Given two functions with the same name but
StarCoder achieves lower median statement (54.0%) and different access paths, we cannot disambiguate which func-
branch coverage (37.5%) than both other models. Cliff’s tion is being used in the example snippet. Accordingly, we
delta [55] shows a large and medium effect size for state- match this snippet to both functions. While this may lead to
ment and branch coverage, respectively, between gpt3.5- inaccuracies, there is unfortunately no precise alternative for
turbo and StarCoder and a medium and small effect size for this matching. Any heuristics may cause us to miss snippets
statement and branch coverage, respectively between code- altogether, which may be worse since example snippets help
cushman-002 and StarCoder. with increasing the percentage of passing tests as shown
in Figure 8. The overall high coverage and percentage of
However, we note that StarCoder’s median statement passing tests suggest that our matching technique is not a
coverage and branch coverage are both higher than Nessie limiting factor in practice.
(statement: 54.0% vs. 51.3% and branch: 37.5% vs 25.6%). Construct Validity: We use the concept of non-trivial
While this higher coverage was not statistically significant, assertions as a proxy for oracle quality in the generated
the results show that even LLMs trained with potentially tests. When determining non-trivial assertions, we search
smaller datasets and/or a different training process than for any usage of the package under test in the backwards
OpenAI’s models are on par (or even sometimes higher) slice of the assertion. Such usage may be different from the
than state-of-the-art traditional test-generation techniques, intended function under test. However, given the dynamic
such as Nessie [11]. Furthermore, in RQ2, we showed nature of JavaScript, precisely determining the usage of a
that using gpt3.5-turbo with T EST P ILOT resulted in higher given function, as extracted by the API explorer, and its
coverage test suites, with statistically significant differences occurrence in the backwards slice is difficult. While our
to Nessie. Overall, these results emphasize the promise of approach does not allow us to precisely determine non-
LLM-based test generation techniques in generating high trivial coverage for a given function, this does not affect the
coverage test suites. non-trivial coverage we report for each package’s complete
Finally, we note that the median time for T EST P ILOT
12. These timings were measured on a standard GitHub Actions
to generate tests for a given function using gpt3.5-turbo is Linux VM with a 2-core CPU, 7GB of RAM, and 14GB of SSD disk
15s, and the median time to generate a complete test suite space.
16
API. Note that when calculating non-trivial coverage, we our work and these efforts: (i) the goal and types of tests
measure the full coverage of tests that contain at least one generated and (ii) the need for some form of fine-tuning or
non-trivial assertion. There may be other calls in those non- additional data. We discuss the details below.
trivial tests that contribute to coverage but do not contribute Differing goals: T I C ODER [24] and C ODE T [23]
to the assertion. Measuring assertion/checked coverage as use Codex to generate implementations and test cases
defined by Schuler and Zeller [62] is a possible alternative, from problem descriptions expressed in natural language.
but this is practically difficult to implement precisely for T I C ODER relies on a test-driven user-intent formalization
JavaScript. (TDUIF) loop in which the user and model interact to
Our definition of non-trivial assertions is simple, setting generate both an implementation matching the user’s intent
a low bar for non-triviality. Any assertion classified as trivial and a set of test cases to validate its correctness. C ODE T, on
by our criterion is, indeed, not meaningful, but the converse the other hand, generates both a set of candidate implemen-
is not necessarily true. Accordingly, our measure of non- tations and some test cases based on the same prompt, runs
trivial coverage is a lower bound on the true non-trivial the generated tests on the candidate implementations, and
coverage. chooses the best solution based on the test results. Unlike
While the examples we show in the paper suggest that T EST P ILOT, neither of these efforts solves the problem of
T EST P ILOT’s generated tests use variable names that are automatically generating unit tests for existing code.
similar to those chosen by programmers, we do not formally Given the characteristics of LLMs in generating natural
assess the readability of these tests. In the future, it would be looking code, there have been several efforts exploring the
interesting to conduct user studies to assess the readability use of LLMs to help [27] or complement [28] traditional
of tests generated by different techniques. test generation techniques. Most recently, Lemieux et al. [27]
External Validity: Despite our evaluation scale sig- explore using tests generated by Codex as a way to unblock
nificantly exceeding evaluations of previous test generation the search process of test generation using search-based
approaches [11], [25], our results are still based on 25 techniques [64], which often fails when the initial randomly
npm packages and may not generalize to other JavaScript generated test has meaningless input that cannot be mu-
code bases. In particular, T EST P ILOT’s performance may not tated effectively. Their results show that, on most of their
generalize to proprietary code that was never seen in the target 27 Python projects, their proposed technique, C O -
LLM’s training set. We try to mitigate this effect in several DA M OSA , outperforms the baseline search-based technique,
ways: (1) we evaluate on less popular packages that are Pynguin’s implementation of M OSA [64], as well as using
likely to have had less impact on the model’s training, (2) only Codex. However, their Codex prompt includes only
we evaluate on 5 GitLab repositories that have not been the function implementation and an instruction to generate
included in the models’ training, and (3) we measure the tests. Since their main goal is to explore whether a test
similarity of the generated tests to the existing tests. Our generated by Codex can improve the search process, they
results show that T EST P ILOT performs well for both popular do not systematically explore the effect of different prompt
and unpopular packages and that 92.8% of the test cases components. In fact, they conjecture that further prompt
have ≤ 50% similarity with existing tests, with no exact engineering might improve results, motivating the need for
copies. Overall, this reassures us that T EST P ILOT is not our work which systematically explores different prompt
producing “memorized” code. components. Additionally, their generated tests are in the
Finally, we note that while our technique is conceptually MOSA format [64], which the authors acknowledge could
language-agnostic, our current implementation of T EST P I - lose readability, and do not contain assertions. Most of our
LOT targets JavaScript, and thus we cannot generalize our tests contain assertions, and we further study the quality of
results to other languages. assertions we generate as well as reasons for test failures.
Similarly, given that it is often difficult for traditional test
6 R ELATED WORK generation techniques to generate (useful) assertions [21],
[22], ATLAS [28] uses LLMs to generate an assert statement
T EST P ILOT provides an alternative to (and potentially com-
for a given (assertion-less) Java test. They position their tech-
plements) traditional techniques for automated test gener-
nique as a complement to traditional techniques [8], [17].
ation, including feedback-directed random test generation
With the same goal, Mastrapaolo et al. [29], [45] and Tufano
[7]–[11], search-based and evolutionary techniques [16], [17],
et al. [46] perform follow up work using transfer learning,
[63], [64], and dynamic symbolic execution [12]–[15]. This
while Yu et al. [66] use information retrieval techniques to
section reviews neural techniques for test generation, and
further improve the assert statements generated by Atlas.
previous test generation techniques for JavaScript.
TOGA [67] uses similar techniques but additionally incor-
porates an exceptional oracle classifier to decide if a given
6.1 Neural Techniques method requires an assertion to test exceptional behavior. It
Neural techniques are rapidly being adopted for solving then bases the generation of the assertion on a pre-defined
various Software Engineering problems, with promising oracle taxonomy created by manually analyzing existing
results in several domains including code completion [34]– Java tests and using a neural-based ranking mechanism to
[38], program repair [39]–[41], and bug-finding [42], [43]. rank candidates with oracles higher. In contrast with these
Pradel and Chandra [65] survey the current state of the efforts, our goal is to generate a complete test method without
art in this emerging research area. We are aware of several giving the model any content of the test method (aside
recent research efforts in which LLMs are used for test gen- from boilerplate code required by Mocha), which means
eration [23]–[29]. There are two main differences between that the model needs to generate both any test setup code
17
(e.g., initializing objects and populating them) as well as the gram elements such as variables and expressions. Then, a
assertion. While TOGA can be integrated with EvoSuite [16] probabilistic type inference is applied to these relationships
to create an end-to-end test-generation tool, recent work [68] to construct a model. Finally, they show how search-based
points out several shortcomings of the evaluation methods, techniques can take advantage of the information contained
casting doubt on the validity of the reported results. in such models by proposing two strategies for consulting
Differing Input/Training: Bareiß et al. [25] evaluate these models in the main loop of DynaMOSA [64].
the performance of Codex on three code-generation tasks, Recently, El Haji [71] presented an empirical study that
including test generation. Like us, they rely on embedding explores the effectiveness of GitHub Copilot at generating
contextual information into the prompt to guide the LLM, tests. In this study, tests are selected from existing test
though the specific data they embed is different: while suites associated with 7 open-source Python projects. After
T EST P ILOT only includes the signature, definition, docu- removing the body of each test function, Copilot is asked
mentation, and usage examples in the prompt, Bareiß et to complete the implementation so that the resulting test
al. pursue a few-shot learning approach where, in addition can be executed and compared against the original test.
to the definition of a function under test, they include an Two variations of this approach are explored, viz., “with
example of a different function from the same code base context” where the other tests in the suite are preserved and
and its associated test to give the model a hint as to what “without context” where other tests are removed. El Haji
it is expected to do, as well as a list of related helper also explores the impact of (manually) adding comments
function signatures that could be useful for test generation. that include descriptions of intended behavior and usage
For a limited list of 18 Java methods, they show that this examples. The results from the study show that 45.28% of
approach yields slightly better coverage than Randoop [8], generated test are passing in the “with context” scenario
[9], a popular technique for feedback-directed random test (the rest are failing, syntactically invalid, or empty) vs
generation. This is a promising result, but finding suitable only 7.55% passing generated tests in the “without con-
example tests to use in few-shot learning can be difficult, text” scenario, and that the addition of usage examples
especially since their evaluation shows that good coverage and comments is generally helpful. There are several sig-
crucially depends on the examples being closely related to nificant differences between our approach and El Haji’s
the function under test. work: we explore a fully automated technique without any
Tufano et al. [26] present AthenaTest, an approach for manual steps, we report on a significantly more extensive
automated test generation based on a BART transformer empirical evaluation, we present an adaptive technique in
model [44]. For a given test case, they rely on heuristics which prompts are refined in response to the execution
to identify the “focal” class and method under test. These behavior of previously executed tests, we target a different
mapped test cases are then used to fine-tune the model programming language (JavaScript instead of Python), and
for the task of producing unit tests by representing this TestPilot interacts directly with an LLM rather than relying
task as a translation task that maps a focal method (along on Copilot, an LLM-based programming assistant.
with the focal class, constructors, and other public methods
and fields in that class) to a test case. In experiments on
5 projects from Defects4J [69], AthenaTest generated 158K 6.2 Test Generation Techniques for JavaScript
test cases, achieving similar test coverage as EvoSuite [16], T EST P ILOT’s mechanism for refining prompts based on ex-
a popular search-based test generation tool, and covering ecution feedback was inspired by the mechanism employed
43% of all focal methods. A significant difference between by feedback-directed random test generation techniques [7]–
their work and ours is that their approach requires training [11], where new tests are generated by extending previ-
the model on a large set of test cases whereas T EST P ILOT ously generated passing tests. As reported in Section 4.2,
uses an off-the-shelf LLM. In fact, in addition to the goal T EST P ILOT achieves significantly higher statement coverage
differences with ATLAS [28] and Mastrapaolo et al.’s [29], and branch coverage than Nessie [11], which represents the
[45] work above, both these efforts also require a data set of state-of-the-art in feedback-directed random test generation
test methods (with assertions) and their corresponding focal for JavaScript.
methods, whether to use in the main training [28] or in fine Several previous projects have considered test genera-
tuning during transfer learning [29], [45], [46]. tion for JavaScript (see [72] for a survey). Saxena et al. [73]
Unfortunately, the above differences in goals or in the present Kudzu, a tool that aims to find injection vulnera-
required data for model training make it meaningless or bilities in client-side JavaScript applications by exploring an
impossible to do a direct experimental comparison with application’s input space. They differentiate an application’s
T EST P ILOT. Additionally, none of these efforts support input space into an event space, which concerns the order
JavaScript or provide JavaScript data sets that can be used in which event handlers execute (e.g., as a result of buttons
for comparison. In fact, one of our main motivations for being clicked), and a value space which concerns the choice of
exploring prompt engineering for an off-the-shelf LLM is values passed to functions or entered into text fields. Kudzu
to avoid the need to collect test examples for few-shot uses dynamic symbolic execution to explore the value space
learning [25] or test method/focal method pairs required systematically, but it relies on a random exploration strategy
for training [28] or additional fine tuning [29], [45], [46]. to explore the event space. Artemis [74] is a framework for
Other techniques: Stallenberg et al. [70] present a automated test generation that iteratively generates tests for
test generation technique for JavaScript based on unsuper- client-side JavaScript applications consisting of sequences of
vised type inference consisting of three phases. First, a static events, using a heuristics-based strategy that considers the
analysis is performed to deduce relationships between pro- locations read and written by each event handler to focus
18
on the generation of tests involving event handlers that tuning nor a parallel corpus of functions and tests. Instead,
interact with each other. Li et al. [75] extends Artemis with we embed contextual information about the function under
dynamic symbolic execution to improve its ability to explore test into the prompt, specifically its signature, its attached
the value space, and Tanida et al. [76] further improve on documentation comment (if any), any usage examples from
this work by augmenting generated test inputs with user- the project documentation, and the source code of the func-
supplied invariants. Fard et al. [77] present ConFix, a tool tion. Furthermore, if a generated test fails, we adaptively
that uses a combination of dynamic analysis and symbolic create a new prompt embedding this test and the failure
execution to automatically generate instances of the Docu- message to guide the model towards fixing the problematic
ment Object Model (DOM) that can serve as test fixtures test. We have implemented our approach for JavaScript
in unit tests for client-side JavaScript code. Marchetto and on top of off-the-shelf LLMs, and shown that it achieves
Tonella [78] present a search-based test generation technique state-of-the art statement coverage on 25 npm packages.
that constructs tests consisting of sequences of events that Further evaluation shows that the majority of the generated
relies on the automatic extraction of a finite state machine tests contain non-trivial assertions, and that all parts of
that represents that application’s state. None of these tools the information included in the prompt contributes to the
generate tests that contain assertions. quality of the generated tests. Experiments with three LLMs
Several test generation tools for JavaScript are capable of (gpt3.5-turbo, code-cushman-002, and StarCoder) demonstrate
generating tests containing assertions. JSART [79] is a tool that LLM-based test generation already outperforms state-
that generates regression tests that contain assertions reflect- of-the-art previous test generation methods such as Nessie
ing likely invariants that are generated using a variation of on key metrics. We conjecture that the use of more advanced
the Daikon dynamic invariant generator [80]. Since Daikon LLMs will further improve results, though we are reluctant
generates assertions that are likely to hold, an additional to speculate by how much.
step is needed in which invalid assertions are removed In future work, we plan to further investigate the quality
from the generated tests. Mirshokraie et al. [81], [82] present of the tests generated by T EST P ILOT. While in this paper
an approach in which tests are generated for client-side we have focused on passing tests and excluded failing
JavaScript applications consisting of sequences of events. tests from consideration entirely, we have seen examples of
Then, in an additional step, function-level unit tests are failing tests that are “almost right” and might be interesting
derived by instrumenting program execution to monitor the to a developer as a starting point for further refinement.
state of parameters, global variables, and the DOM upon However, doing this depends on having a good strategy for
entry and exit to functions to obtain values with which func- telling apart useful failing tests from useless ones. Our ex-
tions are to be invoked. Assertions are added automatically periments have demonstrated that timeout errors, assertion
to the generated tests by: (i) mutating the DOM and the errors, and correctness errors are key factors that cause tests
code of the application under test, (ii) executing generated to fail. In future work, we plan to apply static and dynamic
tests to determine how application state is impacted by program analysis to failing tests in order to determine why
mutations, and (iii) adding assertions to the tests that reflect timeout errors and assertion errors occur and how failing
the behavior prior to the mutation. Testilizer [83] is a test tests could be modified to make them pass.
generation tool that aims to enhance an existing human- Further research is needed to determine what factors
written test suite. To this end, Testilizer instruments code to prevent the generation of non-trivial assertions. Anecdo-
observe how existing tests access the DOM, and executes tally, we have observed that the availability of usage ex-
them to obtain a State-Flow Graph in which the nodes amples is generally helpful. We envision that the number
reflect dynamic DOM states and edges reflect the event- of useful assertions could be improved by obtaining usage
driven transitions between these states. Alternative paths examples in other ways, e.g., by interacting with a user, or
are explored by exploring previously unexplored events by extracting usage examples from clients of the application
in each state. Testilizer adds assertions to the generated under test.
tests that are either copied verbatim from existing tests, by Another fruitful area of experimentation could be vary-
adapting the structure of an existing assertion to a newly ing the sampling temperature of the LLM. In this work, we
explored state, or by inferring a similar assertion using always sample at temperature zero, which has the advan-
machine learning techniques. tage of providing stable results, but also means that the
These techniques share the limitation that they require model is less likely to offer lower-probability completions
the entire application under the test to be executable, lim- that might result in more interesting tests.
iting their applicability. Moreover, several of the techniques Another area of future work is the development of
discussed above require re-execution of tests (to infer as- hybrid techniques that combine existing feedback-directed
sertions using mutation testing [81], [82], or to filter out test generation techniques with an LLM-based technique
assertions that are invalid [79]), which adds to their cost. such as T EST P ILOT. For example, one could use an LLM-
By contrast, T EST P ILOT only requires the functions of API based technique to generate an initial set of tests and use
functions under test to be available and executable, and it the tests that it generates as a starting point for extension by
executes each test that it generates only once. a feedback-directed technique such as Nessie, thus enabling
it to uncover edges cases that would be difficult to uncover
7 C ONCLUSIONS AND F UTURE W ORK using only random values.
We have presented T EST P ILOT, an approach for adaptive In principle, our approach can be adapted to any pro-
unit-test generation using a large language model. Unlike gramming language. Practically speaking, this would in-
previous work in this area, T EST P ILOT requires neither fine volve adapting prompts to use the syntax of the language
19
under consideration, and to use a testing framework for that [13] K. Sen, D. Marinov, and G. Agha, “CUTE: a concolic unit
language. In addition, the mining of documentation and testing engine for C,” in Proceedings of the 10th European Software
Engineering Conference held jointly with 13th ACM SIGSOFT
usage examples would need to be adapted to match the International Symposium on Foundations of Software Engineering,
documentation format used for that language. The LLMs 2005, Lisbon, Portugal, September 5-9, 2005, M. Wermelinger and
that we used did not language-specific training and could H. C. Gall, Eds. ACM, 2005, pp. 263–272. [Online]. Available:
be used to generate tests for other languages, though the https://doi.org/10.1145/1081706.1081750
[14] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and
effectiveness of the approach will depend on the amount D. R. Engler, “EXE: automatically generating inputs of death,”
of code written in that language that was included in the in Proceedings of the 13th ACM Conference on Computer and
LLM’s training set. One specific question that would be in- Communications Security, CCS 2006, Alexandria, VA, USA, October
30 - November 3, 2006, A. Juels, R. N. Wright, and S. D. C.
teresting to explore is how well an approach like T EST P ILOT
di Vimercati, Eds. ACM, 2006, pp. 322–335. [Online]. Available:
would perform on a statically-typed language. https://doi.org/10.1145/1180405.1180445
[15] N. Tillmann, J. de Halleux, and T. Xie, “Transferring an
automated test generation tool to practice: from pex to
ACKNOWLEDGMENT fakes and code digger,” in ACM/IEEE International Conference
on Automated Software Engineering, ASE ’14, Vasteras, Sweden
The research reported on in this paper was conducted while - September 15 - 19, 2014, I. Crnkovic, M. Chechik, and
S. Nadi and F. Tip were sabbatical visitors and A. Eghbali P. Grünbacher, Eds. ACM, 2014, pp. 385–396. [Online]. Available:
https://doi.org/10.1145/2642937.2642941
an intern at GitHub. The authors are grateful to the GitHub [16] G. Fraser and A. Arcuri, “Evolutionary generation of whole
Next team for many insightful and helpful discussions test suites,” in Proceedings of the 11th International Conference
about TestPilot. F. Tip was supported in part by National on Quality Software, QSIC 2011, Madrid, Spain, July 13-14,
Science Foundation grants CCF-1907727 and CCF-2307742. 2011, M. Núñez, R. M. Hierons, and M. G. Merayo, Eds.
IEEE Computer Society, 2011, pp. 31–40. [Online]. Available:
S. Nadi’s research is supported by the Canada Research https://doi.org/10.1109/QSIC.2011.19
Chairs Program and the Natural Sciences and Engineering [17] ——, “EvoSuite: automatic test suite generation for object-
Research Council of Canada (NSERC), RGPIN-2017-04289. oriented software,” in SIGSOFT/FSE’11 19th ACM SIGSOFT
Symposium on the Foundations of Software Engineering (FSE-19)
and ESEC’11: 13th European Software Engineering Conference
(ESEC-13), Szeged, Hungary, September 5-9, 2011, T. Gyimóthy and
R EFERENCES A. Zeller, Eds. ACM, 2011, pp. 416–419. [Online]. Available:
https://doi.org/10.1145/2025113.2025179
[1] K. Beck, Extreme Programming Explained: Embrace Change.
Addison-Wesley, 2000. [18] M. M. Almasi, H. Hemmati, G. Fraser, A. Arcuri, and J. Benefelds,
[2] J. Shore and S. Warden, The Art of Agile Development, 2nd ed. “An industrial evaluation of unit test generation: Finding real
O’Reilly, 2021. faults in a financial application,” in 2017 IEEE/ACM 39th Inter-
[3] S. Siddiqui, Learning Test-Driven Development. O’Reilly, 2021. national Conference on Software Engineering: Software Engineering in
Practice Track (ICSE-SEIP). IEEE, 2017, pp. 263–272.
[4] E. Daka and G. Fraser, “A survey on unit testing practices and
problems,” in 2014 IEEE 25th International Symposium on Software [19] G. Grano, S. Scalabrino, H. C. Gall, and R. Oliveto, “An empirical
Reliability Engineering, 2014, pp. 201–211. investigation on the readability of manual and generated test
[5] B. P. Miller, L. Fredriksen, and B. So, “An empirical cases,” in 2018 IEEE/ACM 26th International Conference on Program
study of the reliability of UNIX utilities,” Commun. ACM, Comprehension (ICPC). IEEE, 2018, pp. 348–3483.
vol. 33, no. 12, pp. 32–44, 1990. [Online]. Available: https: [20] E. Daka, J. Campos, G. Fraser, J. Dorn, and W. Weimer, “Modeling
//doi.org/10.1145/96267.96279 readability to improve unit tests,” in Proceedings of the 2015 10th
[6] M. Zalewski, “American fuzzy lop,” https://lcamtuf.coredump. Joint Meeting on Foundations of Software Engineering, ESEC/FSE
cx/afl/, 2022. 2015, Bergamo, Italy, August 30 - September 4, 2015, E. D. Nitto,
[7] C. Csallner and Y. Smaragdakis, “JCrasher: an automatic robust- M. Harman, and P. Heymans, Eds. ACM, 2015, pp. 107–118.
ness tester for Java,” Software Practice and Experience, vol. 34, no. 11, [Online]. Available: https://doi.org/10.1145/2786805.2786838
pp. 1025–1050, 2004. [21] A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and
[8] C. Pacheco and M. D. Ernst, “Randoop: Feedback-directed V. J. Hellendoorn, “Revisiting test smells in automatically
random testing for java,” in Companion to the 22Nd ACM generated tests: Limitations, pitfalls, and opportunities,” in
SIGPLAN Conference on Object-oriented Programming Systems IEEE International Conference on Software Maintenance and
and Applications Companion, ser. OOPSLA ’07. New York, Evolution, ICSME 2020, Adelaide, Australia, September 28 -
NY, USA: ACM, 2007, pp. 815–816. [Online]. Available: October 2, 2020. IEEE, 2020, pp. 523–533. [Online]. Available:
http://doi.acm.org/10.1145/1297846.1297902 https://doi.org/10.1109/ICSME46990.2020.00056
[9] C. Pacheco, S. K. Lahiri, and T. Ball, “Finding errors in .net [22] F. Palomba, D. D. Nucci, A. Panichella, R. Oliveto, and A. D.
with feedback-directed random testing,” in Proceedings of the 2008 Lucia, “On the diffusion of test smells in automatically generated
International Symposium on Software Testing and Analysis, ser. ISSTA test code: an empirical study,” in Proceedings of the 9th International
’08. New York, NY, USA: ACM, 2008, pp. 87–96. [Online]. Workshop on Search-Based Software Testing, SBST@ICSE 2016,
Available: http://doi.acm.org/10.1145/1390630.1390643 Austin, Texas, USA, May 14-22, 2016. ACM, 2016, pp. 5–14.
[10] M. Selakovic, M. Pradel, R. Karim, and F. Tip, “Test generation [Online]. Available: https://doi.org/10.1145/2897010.2897016
for higher-order functions in dynamic languages,” Proc. ACM [23] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and
Program. Lang., vol. 2, no. OOPSLA, pp. 161:1–161:27, 2018. W. Chen, “CodeT: Code Generation with Generated Tests,” 2022.
[Online]. Available: https://doi.org/10.1145/3276531 [Online]. Available: https://arxiv.org/abs/2207.10397
[11] E. Arteca, S. Harner, M. Pradel, and F. Tip, “Nessie: Automatically [24] S. K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh,
testing javascript apis with asynchronous callbacks,” in 44th M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive Code
IEEE/ACM 44th International Conference on Software Engineering, Generation via Test-Driven User-Intent Formalization,” 2022.
ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022, pp. [Online]. Available: https://arxiv.org/abs/2208.05950
1494–1505. [Online]. Available: https://doi.org/10.1145/3510003. [25] P. Bareiß, B. Souza, M. d’Amorim, and M. Pradel, “Code
3510106 Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-
[12] P. Godefroid, N. Klarlund, and K. Sen, “DART: directed Trained Language Models on Code,” CoRR, vol. abs/2206.01335,
automated random testing,” in Proceedings of the ACM 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.
SIGPLAN 2005 Conference on Programming Language Design and 01335
Implementation, Chicago, IL, USA, June 12-15, 2005, V. Sarkar and [26] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and
M. W. Hall, Eds. ACM, 2005, pp. 213–223. [Online]. Available: N. Sundaresan, “Unit test case generation with transformers
https://doi.org/10.1145/1065010.1065036 and focal context,” arXiv, May 2021. [Online]. Avail-
20
able: https://www.microsoft.com/en-us/research/publication/ Eds. AAAI Press, 2017, pp. 1345–1351. [Online]. Available: http:
unit-test-case-generation-with-transformers-and-focal-context/ //aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603
[27] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMOSA: [40] V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber,
Escaping coverage plateaus in test generation with pre-trained “Global relational models of source code,” in 8th International
large language models,” in 45th International Conference on Software Conference on Learning Representations, ICLR 2020, Addis Ababa,
Engineering, ser. ICSE, 2023. Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online].
[28] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, Available: https://openreview.net/forum?id=B1lnbRNtwr
“On Learning Meaningful Assert Statements for Unit Test Cases,” [41] M. Allamanis, H. Jackson-Flux, and M. Brockschmidt, “Self-
in Proceedings of the ACM/IEEE 42nd International Conference supervised bug detection and repair,” in Advances in Neural
on Software Engineering. ACM, jun 2020. [Online]. Available: Information Processing Systems 34: Annual Conference on Neural
https://doi.org/10.1145%2F3377811.3380429 Information Processing Systems 2021, NeurIPS 2021, December 6-14,
[29] A. Mastropaolo, S. Scalabrino, N. Cooper, D. Nader Palacio, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin,
D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage P. Liang, and J. W. Vaughan, Eds., 2021, pp. 27 865–27 876.
of text-to-text transfer transformer to support code-related tasks,” [Online]. Available: https://proceedings.neurips.cc/paper/2021/
in 43rd IEEE/ACM International Conference on Software Engineering hash/ea96efc03b9a050d895110db8c4af057-Abstract.html
(ICSE), 2021, pp. 336–347. [42] M. Pradel and K. Sen, “Deepbugs: a learning approach
[30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- to name-based bug detection,” Proc. ACM Program. Lang.,
training of deep bidirectional transformers for language under- vol. 2, no. OOPSLA, pp. 147:1–147:25, 2018. [Online]. Available:
standing,” arXiv preprint arXiv:1810.04805, 2018. https://doi.org/10.1145/3276517
[31] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, [43] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, represent programs with graphs,” in 6th International Conference on
S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, Learning Representations, ICLR 2018, Vancouver, BC, Canada, April
A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, 2018. [Online]. Available: https://openreview.net/forum?id=
S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, BJOFETxR-
“Language models are few-shot learners,” 2020. [Online]. [44] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
Available: https://arxiv.org/abs/2005.14165 hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart:
[32] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, Denoising sequence-to-sequence pre-training for natural lan-
J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, guage generation, translation, and comprehension,” arXiv preprint
R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, arXiv:1910.13461, 2019.
B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, [45] A. Mastropaolo, N. Cooper, D. N. Palacio, S. Scalabrino, D. Poshy-
M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, vanyk, R. Oliveto, and G. Bavota, “Using transfer learning for
M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, code-related tasks,” IEEE Transactions on Software Engineering,
A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, 2022.
S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, [46] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan,
V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, “Generating accurate assert statements for unit test cases
M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, using pretrained transformers,” in Proceedings of the 3rd
S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating Large ACM/IEEE International Conference on Automation of Software
Language Models Trained on Code,” CoRR, vol. abs/2107.03374, Test, ser. AST ’22. New York, NY, USA: Association for
2021. [Online]. Available: https://arxiv.org/abs/2107.03374 Computing Machinery, 2022, p. 54–64. [Online]. Available:
[33] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, https://doi.org/10.1145/3524481.3527220
T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, [47] L. Reynolds and K. McDonell, “Prompt programming for large
C. d. M. d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, language models: Beyond the few-shot paradigm,” 2021. [Online].
J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, Available: https://arxiv.org/abs/2102.07350
E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and [48] OpenAI, “OpenAI LLMs: Deprecations,”
O. Vinyals, “Competition-level code generation with alphacode,” https://platform.openai.com/docs/deprecations, 2023.
2022. [Online]. Available: https://arxiv.org/abs/2203.07814 [49] HuggingFace, “Starcoder: A state-of-the-art LLM for code,”
[34] J. Li, Y. Wang, M. R. Lyu, and I. King, “Code completion https://huggingface.co/blog/starcoder, 2023.
with neural attention and pointer networks,” in Proceedings [50] “Mocha,” https://mochajs.org/, 2022.
of the Twenty-Seventh International Joint Conference on Artificial [51] G. Mezzetti, A. Møller, and M. T. Torp, “Type Regression Testing
Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, to Detect Breaking Changes in Node.js Libraries,” in 32nd European
J. Lang, Ed. ijcai.org, 2018, pp. 4159–4165. [Online]. Available: Conference on Object-Oriented Programming, ECOOP 2018, July 16-
https://doi.org/10.24963/ijcai.2018/578 21, 2018, Amsterdam, The Netherlands, ser. LIPIcs, T. D. Millstein,
[35] A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode Ed., vol. 109. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik,
compose: code generation using transformer,” in ESEC/FSE ’20: 2018, pp. 7:1–7:24.
28th ACM Joint European Software Engineering Conference and [52] OpenAI, “OpenAI Codex,” https://openai.com/blog/openai-
Symposium on the Foundations of Software Engineering, Virtual codex, 2023.
Event, USA, November 8-13, 2020, P. Devanbu, M. B. Cohen, [53] “Istanbul coverage tool,” https://istanbul.js.org/, 2022.
and T. Zimmermann, Eds. ACM, 2020, pp. 1433–1443. [Online]. [54] S. Yoo and M. Harman, “Regression testing minimization, selec-
Available: https://doi.org/10.1145/3368089.3417058 tion and prioritization: a survey,” Software testing, verification and
[36] R. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big reliability, vol. 22, no. 2, pp. 67–120, 2012.
code != big vocabulary: open-vocabulary models for source code,” [55] N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal
in ICSE ’20: 42nd International Conference on Software Engineering, questions.” Psychological bulletin, vol. 114, no. 3, p. 494, 1993.
Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and [56] M. Selakovic, M. Pradel, R. Karim, and F. Tip, “Test generation
D. Bae, Eds. ACM, 2020, pp. 1073–1085. [Online]. Available: for higher-order functions in dynamic languages,” Proceedings of
https://doi.org/10.1145/3377811.3380342 the ACM on Programming Languages, vol. 2, no. OOPSLA, pp. 1–27,
[37] S. Kim, J. Zhao, Y. Tian, and S. Chandra, “Code prediction by 2018.
feeding trees to transformers,” in 43rd IEEE/ACM International [57] GitHub, “CodeQL,” https://codeql.github.com/, 2022.
Conference on Software Engineering, ICSE 2021, Madrid, Spain, [58] O. Foundation, “Node.js error codes,” https://nodejs.org/dist/
22-30 May 2021. IEEE, 2021, pp. 150–162. [Online]. Available: latest-v18.x/docs/api/errors.html#nodejs-error-codes, 2022.
https://doi.org/10.1109/ICSE43902.2021.00026 [59] S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing:
[38] GitHub, “GitHub Copilot,” https://copilot.github.com/, 2022. Local algorithms for document fingerprinting,” in Proceedings of
[39] R. Gupta, S. Pal, A. Kanade, and S. K. Shevade, “Deepfix: Fixing the 2003 ACM SIGMOD International Conference on Management
common C language errors by deep learning,” in Proceedings of the of Data, ser. SIGMOD ’03. New York, NY, USA: Association
Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, for Computing Machinery, 2003, p. 76–85. [Online]. Available:
2017, San Francisco, California, USA, S. Singh and S. Markovitch, https://doi.org/10.1145/872757.872770
21
[60] G. Myers, “A fast bit-vector algorithm for approximate Eds. ACM, 2014, pp. 449–459. [Online]. Available: https:
string matching based on dynamic programming,” J. ACM, //doi.org/10.1145/2635868.2635913
vol. 46, no. 3, p. 395–415, may 1999. [Online]. Available: [76] H. Tanida, T. Uehara, G. Li, and I. Ghosh, “Automatic Unit Test
https://doi.org/10.1145/316542.316550 Generation and Execution for JavaScript Program through Sym-
[61] “npm levenshtein distance package.” [Online]. Available: https: bolic Execution,” in Proceedings of the Ninth International Conference
//www.npmjs.com/package/levenshtein on Software Engineering Advances, 2014, pp. 259–265.
[62] D. Schuler and A. Zeller, “Assessing oracle quality with checked [77] A. M. Fard, A. Mesbah, and E. Wohlstadter, “Generating fixtures
coverage,” in Fourth IEEE International Conference on Software Test- for JavaScript unit testing,” in 30th IEEE/ACM International
ing, Verification and Validation, 2011, pp. 90–99. Conference on Automated Software Engineering, ASE 2015, Lincoln,
[63] P. McMinn, “Search-based software testing: Past, present and NE, USA, November 9-13, 2015, M. B. Cohen, L. Grunske, and
future,” in Fourth IEEE International Conference on Software Testing, M. Whalen, Eds. IEEE Computer Society, 2015, pp. 190–200.
Verification and Validation, ICST 2012, Berlin, Germany, 21-25 March, [Online]. Available: https://doi.org/10.1109/ASE.2015.26
2011, Workshop Proceedings. IEEE Computer Society, 2011, pp. 153– [78] A. Marchetto and P. Tonella, “Using search-based algorithms for
163. [Online]. Available: https://doi.org/10.1109/ICSTW.2011.100 Ajax event sequence generation during testing,” Empir. Softw.
Eng., vol. 16, no. 1, pp. 103–140, 2011. [Online]. Available:
[64] A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test
https://doi.org/10.1007/s10664-010-9149-1
case generation as a many-objective optimisation problem with
[79] S. Mirshokraie and A. Mesbah, “JSART: JavaScript assertion-
dynamic selection of the targets,” IEEE Transactions on Software
based regression testing,” in Web Engineering - 12th International
Engineering, vol. 44, no. 2, pp. 122–158, 2018.
Conference, ICWE 2012, Berlin, Germany, July 23-27, 2012.
[65] M. Pradel and S. Chandra, “Neural software analysis,” Commun. Proceedings, ser. Lecture Notes in Computer Science, M. Brambilla,
ACM, vol. 65, no. 1, pp. 86–96, 2022. [Online]. Available: T. Tokuda, and R. Tolksdorf, Eds., vol. 7387. Springer,
https://doi.org/10.1145/3460348 2012, pp. 238–252. [Online]. Available: https://doi.org/10.1007/
[66] H. Yu, Y. Lou, K. Sun, D. Ran, T. Xie, D. Hao, Y. Li, 978-3-642-31753-8 18
G. Li, and Q. Wang, “Automated Assertion Generation via [80] “The Daikon invariant detector,” http://plse.cs.washington.edu/
Information Retrieval and Its Integration with Deep Learning,” daikon/, 2023.
in Proceedings of the 44th International Conference on Software [81] S. Mirshokraie, A. Mesbah, and K. Pattabiraman, “PYTHIA:
Engineering, ser. ICSE ’22. New York, NY, USA: Association generating test cases with oracles for JavaScript applications,”
for Computing Machinery, 2022, p. 163–174. [Online]. Available: in 2013 28th IEEE/ACM International Conference on Automated
https://doi.org/10.1145/3510003.3510149 Software Engineering, ASE 2013, Silicon Valley, CA, USA,
[67] E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, November 11-15, 2013, E. Denney, T. Bultan, and A. Zeller,
“Toga: A neural method for test oracle generation,” in Eds. IEEE, 2013, pp. 610–615. [Online]. Available: https:
Proceedings of the 44th International Conference on Software //doi.org/10.1109/ASE.2013.6693121
Engineering, ser. ICSE ’22. New York, NY, USA: Association for [82] ——, “JSEFT: automated Javascript unit test generation,” in
Computing Machinery, 2022, p. 2130–2141. [Online]. Available: 8th IEEE International Conference on Software Testing, Verification
https://doi.org/10.1145/3510003.3510141 and Validation, ICST 2015, Graz, Austria, April 13-17, 2015.
[68] Z. Liu, K. Liu, X. Xia, and X. Yang, “Towards more realistic IEEE Computer Society, 2015, pp. 1–10. [Online]. Available:
evaluation for neural test oracle generation,” in Proceedings of https://doi.org/10.1109/ICST.2015.7102595
the ACM SIGSOFT International Symposium on Software Testing and [83] A. M. Fard, M. MirzaAghaei, and A. Mesbah, “Leveraging
Analysis, ser. ISSTA ’23, 2023. existing tests in automated test generation for web applications,”
[69] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: a database of existing in ACM/IEEE International Conference on Automated Software
faults to enable controlled testing studies for java programs,” in Engineering, ASE ’14, Vasteras, Sweden - September 15 -
International Symposium on Software Testing and Analysis, ISSTA 19, 2014, I. Crnkovic, M. Chechik, and P. Grünbacher,
’14, San Jose, CA, USA - July 21 - 26, 2014, C. S. Pasareanu and Eds. ACM, 2014, pp. 67–78. [Online]. Available: https:
D. Marinov, Eds. ACM, 2014, pp. 437–440. [Online]. Available: //doi.org/10.1145/2642937.2642991
https://doi.org/10.1145/2610384.2628055
[70] D. M. Stallenberg, M. Olsthoorn, and A. Panichella, “Guess
what: Test case generation for Javascript with unsupervised
probabilistic type inference,” in Search-Based Software Engineering
- 14th International Symposium, SSBSE 2022, Singapore, November
17-18, 2022, Proceedings, ser. Lecture Notes in Computer Science,
M. Papadakis and S. R. Vergilio, Eds., vol. 13711. Springer,
2022, pp. 67–82. [Online]. Available: https://doi.org/10.1007/
978-3-031-21251-2 5
[71] K. El Haji, “Empirical study on test generation using GitHub
Copilot,” Master’s thesis, Delft University of Technology, 2023.
[72] E. Andreasen, L. Gong, A. Møller, M. Pradel, M. Selakovic, K. Sen,
and C. Staicu, “A survey of dynamic analysis and test generation
for JavaScript,” ACM Comput. Surv., vol. 50, no. 5, pp. 66:1–66:36,
2017. [Online]. Available: https://doi.org/10.1145/3106739
[73] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and
D. Song, “A symbolic execution framework for JavaScript,” in
31st IEEE Symposium on Security and Privacy, S&P 2010, 16-19 May
2010, Berleley/Oakland, California, USA. IEEE Computer Society,
2010, pp. 513–528. [Online]. Available: https://doi.org/10.1109/
SP.2010.38
[74] S. Artzi, J. Dolby, S. H. Jensen, A. Møller, and F. Tip,
“A framework for automated testing of JavaScript web
applications,” in Proceedings of the 33rd International Conference
on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI,
USA, May 21-28, 2011, R. N. Taylor, H. C. Gall, and
N. Medvidovic, Eds. ACM, 2011, pp. 571–580. [Online].
Available: https://doi.org/10.1145/1985793.1985871
[75] G. Li, E. Andreasen, and I. Ghosh, “Symjs: automatic symbolic
testing of JavaScript web applications,” in Proceedings of the
22nd ACM SIGSOFT International Symposium on Foundations
of Software Engineering, (FSE-22), Hong Kong, China, November
16 - 22, 2014, S. Cheung, A. Orso, and M. D. Storey,