Multi-language Unit Testing LLM 2024
Multi-language Unit Testing LLM 2024
ABSTRACT KEYWORDS
Implementing automated unit tests is an important but time con- unit test, java, python, llm
suming activity in software development. Developers dedicate sub-
stantial time to writing tests for validating an application and pre- ACM Reference Format:
venting regressions. To support developers in this task, software Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh
engineering research over the past few decades has developed many Sinha. 2024. Multi-language Unit Test Generation using LLMs. In Proceedings
of ACM Conference (Conference’17). ACM, New York, NY, USA, 13 pages.
techniques for automating unit test generation. However, despite
https://doi.org/10.1145/nnnnnnn.nnnnnnn
this effort, usable tools exist for very few programming languages—
mainly Java, C, and C# and, more recently, for Python. Moreover,
studies have found that automatically generated tests suffer poor
readability and often do not resemble developer-written tests. In 1 INTRODUCTION
this work, we present a rigorous investigation of how large lan- Unit testing is a key activity in software development that serves
guage models (LLMs) can help bridge the gap. We describe a generic as the first line of defense against the introduction of software
pipeline that incorporates static analysis to guide LLMs in gener- bugs. Manually writing high-coverage and effective unit tests can
ating compilable and high-coverage test cases. We illustrate how be tedious and time consuming. To address this, many automated
the pipeline can be applied to different programming languages, test generation techniques have been developed, aimed at reduc-
specifically Java and Python, and to complex software requiring ing the cost of manual test suite development: over the last few
environment mocking. We conducted a through empirical study decades, this research has a produced a variety of approaches based
to assess the quality of the generated tests in terms of coverage, on symbolic analysis (e.g., [8, 27, 50, 54, 57, 66, 70]), search-based
mutation score, and test naturalness—evaluating them on standard techniques (e.g., [24, 28, 34, 36, 38, 58]), random and adaptive-
as well as enterprise Java applications and a large Python bench- random techniques (e.g., [3, 10, 11, 35, 37, 44]), and learning-based
mark. Our results demonstrate that LLM-based test generation, approaches [20].
when guided by static analysis, can be competitive with, and even These techniques have achieved considerable success in generat-
outperform, state-of-the-art test-generation techniques in coverage ing high-coverage test suites with good fault-detection ability, but
achieved while also producing considerably more natural test cases they still have several key limitations—with respect to test read-
that developers find easy to read and understand. We also present ability, test scenarios covered, and test assertions created. Previous
the results of a user study, conducted with 161 professional devel- studies (e.g., [25]) have shown that developers find automatically
opers, that highlights the naturalness characteristics of the tests generated tests lacking in these characteristics, suffering poor read-
generated by our approach. ability and comprehensibility, covering uninteresting sequences,
and containing trivial or ineffective assertions. Automatically gener-
CCS CONCEPTS ated tests are also known to contain anti-patterns or test smells [45]
• General and reference → Empirical studies; • Computing and generally not perceived as being “natural” in the sense that
methodologies → Neural networks. they do not resemble the kinds of tests that developers write.
Recently, researchers have started to explore the potential of
∗ Author was an intern at IBM Research at the time of this work. leveraging large language models (LLMs) to overcome these limi-
tations of conventional test-generation techniques [68]. Given the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed inherent ability of LLMs in synthesizing natural-looking code (hav-
for profit or commercial advantage and that copies bear this notice and the full citation ing been trained on massive amounts of human-written code), this
on the first page. Copyrights for components of this work owned by others than ACM is a promising research direction toward development of techniques
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a that generate natural test cases. Early work in this area has focused
fee. Request permissions from [email protected]. on empirical studies of different LLMs for unit test generation using
Conference’17, July 2017, Washington, DC, USA simple prompts or experimenting with various prompting strate-
© 2024 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 gies by including different types of context information suitable
https://doi.org/10.1145/nnnnnnn.nnnnnnn for test generation (e.g., [29, 53, 55, 56, 62]), and investigation of
Conference’17, July 2017, Washington, DC, USA Pan et al.
1. @Test 1 1. @Test 1
2. public void test0() throws Throwable { 2. public void testFlattenWithShortOption_c5_o0() {
3. BasicParser basicParser0 = new BasicParser(); 3. Options options = new Options();
4. String[] stringArray0 = new String[4]; 4. options.addOption("option1", "o", true, "description1");
EvoSuite
ASTER
5. String[] stringArray1 = basicParser0.flatten((Options)null, 5. String[] arguments = new String[] {"-o", "value1",
stringArray0, true); 2 "--option2", "value2"};
6. assertEquals(4, stringArray1.length); } 6. BasicParser parser = new BasicParser();
7. String[] flattenedArguments = parser.flatten(options,
arguments, false); 2
8. assertArrayEquals(new String[] {"-o", "value1", "-option2",3
"value2”,flattenedArguments); }
ASTER
4. Locale locale0 = Locale.ROOT; 4. public void testisLeaf() throws Exception{ 4
5. DOMNodePointer dOMNodePointer0 = 5. node = mock(Node.class);
new DOMNodePointer(iIOMetadataNode0, locale0, "rRdLo^)|p4U!k"); 6. domnodepointer = new DOMNodePointer(node, new Locale("en"),"id");
6. boolean boolean0 = dOMNodePointer0.isLeaf(); 7. when(node.hasChildNodes()).thenReturn(false);
7. assertTrue(boolean0); }
8. assertTrue(domnodepointer.isLeaf()); }
ASTER
3. object_dict_0 = module_0.ObjectDict() 3. self.assertEqual(timedelta_to_seconds(time_delta), 3600.0) 3
4. var_0 = module_0.timedelta_to_seconds(object_dict_0)
5. except BaseException:
6. pass
(c) Sample test cases generated for Python Tornado Web
Figure 1: Illustration of naturalness (in terms of test names, variable names, and assertions) and mocking in test cases generated
by the LLM-assisted technique of aster (right) compared with tests generated by EvoSuite [24] and CodaMosa [33] (left).
feedback mechanisms for generating compilable code or achiev- the feasibility of building multi-lingual unit test generators with
ing specific testing goals, such as increasing coverage or killing LLMs guided by lightweight program analysis. aster also incor-
surviving mutants (e.g., [18, 49, 52, 53]). porates mocking capability for Java unit test generation, which
Although these explorations have demonstrated the potential makes it applicable to applications that perform database opera-
of LLMs in generating natural-looking test cases, they have also tions, implement services, or use complex libraries. We present a
highlighted drawbacks of LLM-based test generation, such as the generic approach for generation of test with mocks that is extensi-
LLM’s tendency to generate tests that often do not compile, achieve ble to different library APIs. The approach performs static analysis
low coverage, contain testing anti-patterns, or are redundant in a to identify types in the focal class (i.e., the class under test) that
generated test suite. Moreover, much of this early work explores need mocking and constructs a structured prompt, which includes
LLMs off-the-shelf as tools for test generation without considering a test skeleton with mocking setup, fixtures, and partial test code
complete solutions that generate tests with some guarantees (e.g., created and leaves the test-completion task for the LLM. With such
tests that compile) as conventional tools, such as EvoSuite [22], prompts, LLMs are able, in many cases, to generate usable tests for
do. Finally, the empirical studies conducted so far are limited in scenarios involving Java Servlet API [32], etc.
the LLMs considered, the investigation of test naturalness, and the We performed a comprehensive empirical evaluation of aster,
complexity of code targeted for test generation. using six general-purpose and code LLMs, to study the generated
In this work, we present a technique for performing LLM-assisted tests in terms of code coverage achieved, mutation score, and natu-
test generation guided by program analysis. The technique consists ralness of test code; we also compared aster against state-of-the-art
of preprocessing and postprocessing phases that wrap LLM inter- Java and Python unit test generators. Our results show that aster
actions. The preprocessing phase performs static program analysis performs very competitive (+2.0%, -0.5%, and +5.1% line, branch, and
to compute relevant information to be included in LLM prompts method coverage) for Java SE applications while performing signifi-
for a given method under test. This ensures that the LLM prompt cantly better (+26.4%, +10.6%, and +18.5%, line, branch, and method
has sufficient context (similar to the information that a developer coverage) for Java EE applications. For Python, aster has a high
would use while writing test cases for a method) and increases coverage gain compared to CodaMosa (+9.8%, +26.5%, and +22.5%
the chances that it generates compilable and meaningful test cases. line, branch, and method coverage). In terms of naturalness, both
The postprocessing phase checks the generated tests for compilation our automated evaluation and developers’ survey identified that
and runtime errors and constructs new prompts aimed at fixing aster-generated test cases are more natural compared to EvoSuite-
the errors. At the end of the test-repair process, the technique pro- and CodaMosa-generated tests and, surprisingly, better than even
duces a set of passing test cases and a set of failing tests for the developer-written test cases. The results of our experiments and
focal method (i.e., method under test). To increase code coverage, the naturalness evaluator are available in our artifact-source [4]
the technique includes a coverage-augmentation phase, in which The main contributions of this work include:
prompts are crafted for instructing the LLM to generate test cases
aimed at exercising uncovered lines of code. • Generic pipeline supporting multiple PLs. We present a
We implemented the technique in a tool, called aster, for two generic LLM-assisted test-generation pipeline fueled by static
programming languages (PL), Java and Python, thus demonstrating analysis that can be extended for multiple PLs. We evaluate aster
Multi-language Unit Test Generation using LLMs Conference’17, July 2017, Washington, DC, USA
Pre-processing
• Complex enterprise application. We evaluated aster on com- scope call chains
Mocking fields Mocking
plex enterprise applications that often require special test gener- constructors
Add constructor Add relevant
ation capabilities such as mocking. chain accessor Mocking non- Mocking
methods static APIs static APIs
• Multiple models. Our evaluation includes a wide variety of
models ranging from closed-source (GPT) to open-source (Llama, LLM 3 LLM 1
Granite) as well as small and large models (8B to >1T). 2
processing
Coverage-based Output
Post-
• Evaluation of test naturalness. We performed an in-depth feedback
Error remediation sanitization
LLM
quantitative and qualitative analysis of test naturalness by devel-
oping an automated approach and conducting a survey of 161
professional developers. Figure 2: Overview of aster. ○,
1 ○,
2 ○
3 represent test-generation,
test-repair, and coverage-augmentation prompts.
2 MOTIVATION
The primary motivation for LLM-assisted test generation is to over- analysis extracts context and template (Fig. 3) constructs initial
come the limitation of conventional test generation w.r.t lack of prompts for LLM-driven unit test generation in Java and Python
naturalness in the tests they create. By leveraging the LLM’s inher- and (2) Postprocessing, where the generated tests undergo code
ent ability of creating natural-looking code, we can generate more fixing for correctness and code coverage. The prompt template
readable, comprehensible, and meaningful test cases. Another mo- is modified with relevant content. This iterative approach lever-
tivation for LLM-assisted test generation is to build multi-lingual ages LLM capabilities while ensuring test suite robustness through
unit test generators—leveraging the LLM’s understanding of the systematic analysis and refinement.
syntax and semantics of multiple PLs on which the models are
typically trained. Building such test generators using conventional 3.1 Preprocessing
approaches (e.g., symbolic or evolutionary techniques) can be chal- The preprocessing phase performs static analysis of the application
lenging and no such test generators exist. In contrast, with light- to gather all the context that an LLM might require to generate
weight static analysis guiding LLM interactions, a multi-lingual unit unit tests for Java and Python. The key objective of this stage is to
test generator can be easily implemented. Finally, an LLM-assisted gather all the necessary information pertaining to the focal method
approach can also address test generation for scenarios requiring and its broader context within the application. This information
mocking. is used to populate fields 1 and 2 of the prompt template shown
To illustrate these benefits, Fig. 1 presents sample aster-generated in Figure 3. In this section, we discuss the preprocessing steps for
test cases and tests generated by two conventional techniques: Java and how those steps help construct a prompt for the LLM.
EvoSuite [24] for Java and CodaMosa [33] for Python. The aster-
Testing scope. The first step in the preprocessing phase identifies
generated test cases have more meaningful test names 1 , vari-
the testing scope given a Java class under test (i.e., the focal class)
able names 2 , and assertions 3 than the EvoSuite- or CodaMosa-
𝑓𝑐 . The testing scope lists the set of focal methods to be targeted
generated tests. For instance, consider the test cases for Apache
for test generation. This set consists of (1) public, protected, and
Commons CLI [14]. The variable storing the return value from
package-visibility methods declared in 𝑓𝑐 and (2) any inherited
flatten() is called flattenedArguments in the aster test—clearly
method implementations from an abstract super class of 𝑓𝑐 . The
capturing the meaning of the stored data—whereas the correspond-
set excludes inherited methods from a non-abstract super class
ing variable in the EvoSuite test is called stringArray1 (line 5),
as those methods are targeted for test generation in the context
which captures simply the data structure type instead of any mean-
of their declaring class as the focal class. If 𝑓𝑐 is an abstract class,
ing of the stored data. Similarly, the Python test case generated by
the testing scope consists of static methods, if any, declared in 𝑓𝑐 .
aster, shown in Fig. 1(c), has meaningful test name and variable
The focal class and method are used to populate part d of the
names. Moreover, the assertion in the test case is generated taking
test-generation prompt.
into account the expected transformation of the input by the fo-
Relevant constructors. aster next identifies the relevant construc-
cal method, converting the input hour value to minutes. Fig. 1(b)
tors for a focal method so that their signatures can be specified
shows a unit test case with mocking of library APIs generated for
to the LLM, enabling it to create required objects for invoking the
Apache Commons JXPath [17]. The test mocks the behavior of the
focal method. The relevant constructors for a focal method 𝑚 in-
org.w3c.dom.Node library class 4 . As we describe later, by creat-
clude the constructors of the focal class (if 𝑚 is a virtual method),
ing a test skeleton containing mocking-related fixture via static
along with the constructors of the each formal parameter type of
analysis, our approach can generate usable test cases for complex
𝑚, considering application types only (i.e., ignoring library types).
applications requiring mocking of API calls.
The analysis is done transitively for the formal parameter types
of each identified constructor, thus ensuring that the LLM prompt
3 OUR APPROACH includes comprehensive context information about how to create
Figure 2 illustrates our LLM-based test-generation technique, guided instances of application types that may be needed for testing 𝑚.
by program analysis. The process comprises two phases, both uti- The discovered constructors are used to populate part a of the
lizing stage-specific prompting: (1) Preprocessing, in which static test-generation prompt in Fig. 3.
Conference’17, July 2017, Washington, DC, USA Pan et al.
Figure 3: Templates for composing prompts for test generation, test repair, and coverage augmentation.
Relevant auxiliary methods. Accessor methods, which consist of Algorithm 1: Identifying mockable fields, types, and scope
getters and setters, provide a mechanism for reading and modifying Input : focal method 𝑓𝑚 , focal class 𝑓𝑐 , mockable APIs T𝑎𝑝𝑖𝑠
internal object state, while preserving encapsulation. Our approach Output : Candidate mockable Types T𝑚 , and mockable Fields F𝑚
Function IdentifyMockedFieldsAndParameterTypes:
identifies setter methods in the focal class and in each formal pa- 1
⊲ Identify mockable fields.
rameter type of the focal method (if the type is an application class). 2 F𝑚 , T𝑚 ← ∅, ∅ ;
Signatures of these setters are added to the LLM prompt to help 3 foreach 𝑓 ∈ 𝑓𝑐 .fields do
with setting up suitable object states for invoking the focal method 4 if 𝑓 .type ∈ T𝑎𝑝𝑖𝑠 then
5 F𝑚 ← F𝑚 ∪ { 𝑓 } ;
in the generated test case. For reading object state, our approach
⊲ Identify all mockable Types
computes getters of the focal class and the return type of the focal 6 𝑇 ← { 𝑓𝑐 } ∪ formalParamTypes ( 𝑓𝑚 ) ;
method (limiting to application return types), and includes their 7 while 𝑇 ≠ ∅ do
signatures in the LLM prompt. This can help the LLM generate 8 𝑡 ← 𝑇 .pop();
suitable assertion statements for verifying relevant object state 9 T𝑚 ← T𝑚 ∪ {𝑡 : 𝑡 ∈ T𝑎𝑝𝑖𝑠 } ;
after the focal method call—the assertions can check state of the 10 foreach 𝑐 ∈ 𝑡 .constructors do
11 𝑇 ← 𝑇 ∪ formalParamTypes (𝑐 ) ;
receiver object of, or the object returned by, the focal method call. 12 return T𝑚 , F𝑚 ;
This information is used to fill section b of the test-generation Input : focal method 𝑓𝑚 , focal class 𝑓𝑐 , and mockable types T𝑎𝑝𝑖𝑠 .
prompt in Fig. 3. Output : Mockable constructor calls M𝑐 , static calls M𝑠 , API calls M𝑎
Function IdentifyMockingScope:
Private methods. Private methods are inaccessible outside the class 13
14 𝑆, M𝑐 , M𝑠 , M𝑎 ← ∅, ∅, ∅, ∅ ;
and therefore, to test them, we need to invoke them through a non- ⊲ Add focal method and constructor to scope
private method of the class. To facilitate this, we compute the class 15 𝑆 ← 𝑆 ∪ { 𝑓𝑚 , 𝑓𝑐 .constructors } ;
call graph of the focal class, identify call chains from non-private ⊲ Add methods reachable from 𝑓𝑚 within 𝑓𝑐
methods to private methods, and provide these call chains to the 16 𝑆 ← 𝑆 ∪ {𝑚 ∈ 𝑓𝑐 .methods : isReachable (𝑓𝑚 , 𝑚) } ;
LLM to enable it to generate test cases that invoke private methods ⊲ Add service entry class methods to scope
if isServiceEntryClass (𝑓𝑐 ) then
through externally visible methods. This information is used to 17
18 𝑆 ← 𝑆 ∪ {𝑚 ∈ 𝑓𝑐 .methods : isOverriden (𝑚) } ;
populate part e of the test-generation prompt in Fig. 3. ⊲ Iterate over all callsites.
Facilitating Mocking. Mocking enables the creation of simulated 19 foreach m ∈ 𝑆 do
objects that mimic the behavior of actual components. These are 20 foreach 𝑐𝑠 ∈ callSites ( m ) do
21 if isConstructor (𝑐𝑠 ) ∧ 𝑐𝑠.type ∈ T𝑎𝑝𝑖𝑠 then
particularly useful for simulating external services, databases, or 22 M𝑐 ← M𝑐 ∪ 𝑐𝑠 ;
third-party API calls. By facilitating mocking, LLM-generated tests 23 if isStatic (𝑐𝑠 ) ∧ 𝑐𝑠.type ∈ T𝑎𝑝𝑖𝑠 then
can attempt to verify code behavior and ensure proper component 24 M𝑠 ← M𝑠 ∪ 𝑐𝑠 ;
25 if receiverType (𝑐𝑠 ) ∈ T𝑎𝑝𝑖𝑠 then
interaction without having to reflect on the full implementation 26 M𝑎 ← M𝑎 ∪ 𝑐𝑠 ;
details of all dependencies in the application. Our approach uses 27 return M𝑐 , M𝑠 , M𝑎 ;
a deterministic methodology (shown in Alg. 1) to identify rele-
vant candidates for mocking given a focal method (M 𝑓 𝑜𝑐𝑎𝑙 ), it’s
corresponding focal class (T𝑓 𝑜𝑐𝑎𝑙 ), and user-specified mockable
To determine the candidate fields, we iterate through the fields de-
third-party APIs (T𝑎𝑝𝑖 ). The computed information is used to con-
fined in the focal class, identifying each field whose type occurs
struct a test skeleton that is included in the test-generation prompt
in T𝑎𝑝𝑖 (lines 3–5). To identify all types that need to be mocked,
(shown in sections f , g , and h of Fig. 3). The skeleton includes
we begin by examining the formal parameter types of the focal
mocking-related definitions and setup code, leaving the task of
method 𝑓𝑚 and the focal class 𝑓𝑐 (line 6). We then perform a transi-
filling the test inputs, sequence, and assertions to the LLM.
tive search (lines 7–11), iterating through each type to determine
– Identifying fields to be mocked. First, we discover all the candidate
whether it belongs to T𝑎𝑝𝑖 . For each identified type, we also con-
fields (T𝑚 ) and types (T𝑚 ) in the application that need to be mocked.
sider the parameter types of its constructors. This process ensures
that all relevant types are included. The identified fields and types
Multi-language Unit Test Generation using LLMs Conference’17, July 2017, Washington, DC, USA
Table 1: Models used in the evaluation.
to be mocked are used to populate section f of the test-generation
prompt (Fig. 3). Model Name Provider Update Date Model Size License Data Type
– Identifying methods to be stubbed. Next, we identify the scope for GPT-4-turbo OpenAI May-24 1.76T† Closed-source Generic
creating stubs for mocking. These stubs emulate method calls with Llama3-70b Meta Apr-24 70B Llama-3 License* Generic
“when” and “then” clauses. The “when” clauses define the conditions CodeLlama-34b Meta Aug-23 34B Llama-2 License* Code
under which a mock should return a specified value and the “then” Granite-34b IBM May-24 34B Apace 2.0 License* Code
clauses specify the expected behavior once the conditions are met.
Llama3-8b Meta Apr-24 8B Llama-3 License* Generic
Our approach for determining this scope is shown in Function
Granite-8b IBM May-24 8B Apace 2.0 License* Code
IdentifyMockingScope of Alg. 1: we start by adding the focal method
*Open-source with different licenses. † Model size not confirmed.
(𝑓𝑚 ) and its class constructors to the scope set. We then include
methods reachable from 𝑓𝑚 within the focal class. If 𝑓𝑐 is a service Table 2: Java and Python datasets used in the evaluation.
entry class, we also add its overridden methods. We iterate over
all methods in the scope set, examining their call sites to classify Dataset Classes/Modules Methods NCLOC
them as mockable constructor calls, static calls, or API calls based
Commons CLI 31 305 2498
on their types. Information about these classes of calls are used to
Java SE
Commons Codec 97 776 9681
populate parts g and h of the test-generation prompt (Fig. 3). Commons Compress 500 3650 43545
Preprocessing for Python. Unlike Java, Python’s more flexible struc- Commons JXPath 180 1502 20142
ture, where a single file or module can contain a mix of functions, CargoTracker 107 482 5445
Java EE
classes, and standalone code, lends itself better to module-scoped DayTrader 148 1067 11409
test generation. Therefore, for Python test generation, aster tar- PetClinic 23 84 805
gets a module as a whole for test generation, including all classes, App X 140 2111 21655
methods, and functions declared in the module. This module-level Python Dataset 283 2216 38633
approach together with distinct language features of Python allows
us to omit some preprocessing steps that are necessary for Java but
irrelevant for Python. Because our approach provides the entire 3.2.3 Increasing coverage of generated tests. Error-free test cases
module to the LLM, instructing the LLMs about calls to private are executed to measure code coverage, which is further analyzed
methods is unnecessary. Member visibility is Python is specified to identify uncovered code segments and then used to update the
via naming conventions (names beginning with single underscore LLM prompt template (Fig. 3) to guide the LLM to generate test
for protected members and double underscores for private mem- cases that exercise the uncovered code segments. The process is
bers) and there is no strict encapsulation as these members can repeated iteratively until the desired code coverage is achieved.
be accessed via name mangling. Python’s properties feature (us- Postprocessing for Python. For Python, we use Pylint [51], for
ing @property decorators) also lets private members to be accessed identifying compilation and parsing error.
directly. Thus, identification of accessors is also unnecessary for Specifically, aster focuses on the error category of Pylint and
Python test generation. In terms of relevant constructors, the con- uses this information to guide LLMs in the task of fixing compi-
structor definitions in the focal module are already available to the lation and parsing issues. Then, tests are executed using Pytest
LLM. Additionally, we add constructors for all imported modules and the output is used to provide feedback to the LLMs in case
to the LLM prompt. Also, we found that, for Python, adding a few of failures. Finally, Coverage.py [6] is leveraged to perform the
examples of tests helps LLMs produce more predictable output, coverage-augmentation step.
thus, making the postprocessing steps easier. For Java, this was not
necessary, but investigation of RAG-based approaches for incor-
4 EXPERIMENT SETUP
porating in-context learning for test generation is an interesting
future research direction. 4.1 Research Questions
Our evaluation focuses on the following research questions:
3.2 Postprocessing RQ1: How effective is aster in code coverage achieved with dif-
3.2.1 Output sanitization. Generated tests undergo sanitization ferent models?
to remove extraneous content (such as natural language text) and RQ2: How natural are aster-generated tests?
ensure syntactic correctness of the generated test cases. RQ3: How effective is aster in mutation scores achieved?
RQ4: What is the effect of different types context information and
3.2.2 Error remediation. The sanitized code is compiled to identify
postprocessing steps on coverage achieved?
any compilation errors. If errors occur, the error message as well as
RQ5: How do developers perceive aster-generated tests in terms
the compiler feedback is analyzed to identify the cause of the error.
of their comprehensibility and usability?
Section 3 of the LLM template updated with the erroneous line
as well as the previous context to reprompt the LLM in a targeted
manner to address the specified problems. Then, the compilable 4.2 Baseline test-generation tools
code is executed and runtime issues (assertion failure or runtime We evaluated aster against two state-of-the-art unit test genera-
error) are fed to LLM for fixing. We use a similar prompt to provide tors. For Java, we used EvoSuite [22], specifically Release 1.2.0. For
the context related to the focal method and the error details. Python, we selected CodaMosa and used the latest version from
Conference’17, July 2017, Washington, DC, USA Pan et al.
80 80 80 80
60 60 60 60
40 40 40 40
20 20 20 20
0 0 0 0
Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage
(a) Commons CLI (b) Commons Codec (c) Commons Compress (d) Commons JXPath
100 100 100 100
ASTER + GPT-4-turbo
ASTER + Llama3-70b
ASTER + CodeLlama-34b
80 80 80 80 ASTER + Granite-34b
ASTER + Llama3-8b
ASTER + Granite-8b
60 60 60 60 EvoSuite 1.2.0
EvoSuite 1.0.6
40 40 40 40
20 20 20 20
0 0 0 0
Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage Line coverage Branch coverage Method coverage
Figure 4: Line, branch, and method coverage achieved on Java SE and Java EE applications by aster (configured with different
LLMs) and EvoSuite (GPT-4 run excluded for App X for confidentiality reasons).
its repository [12]. CodaMosa is built on Pynguin [36], a search- 4.5 Evaluation metrics
based test-generation tool, and improves upon it by leveraging 4.5.1 Code coverage and mutation scores. To measure code cover-
LLM-generated test cases to expand the search space on reach- age, we used JaCoCo [21] for Java and Coverage.py [6] for Python.
ing coverage plateaus. The available version of CodaMosa works Because Coverage.py does not report method coverage, we imple-
with the deprecated code-davinci-002 model; we, therefore, updated mented custom code for inferring method coverage from line cover-
the tool to work with GPT-3.5-turbo-instruct, the recommended age. For both Java and Python, we report data on line, branch, and
replacement model for code-davinci-002 [41]. method coverage. To compute mutation scores, we used PIT [48]
for Java and mutmut [30] for Python. Both of these tools take an
application and a test suite as inputs, generate mutants for the
application by introducing small changes using mutation operators,
4.3 Models and compute mutation scores for the test suite. We used the default
We selected six models for evaluation, including LLMs of differ- set of mutation operators available in the tools.
ent sizes (ranging from 8B to over 1T parameters), open-source
and closed-source LLMs, and LLMs from different model families
(GPT [42], Llama-3 [39], and Granite [31]), considering general- 4.5.2 Naturalness. For test cases, we consider naturalness to en-
purpose and code models. Table 1 provide details of the selected compass different characteristics, such as (1) readability in terms
models. Our emphasis was on selecting open-source models along of meaningfulness of test and variable names, (2) quality of test
with one frontier model. assertions, (3) meaningfulness of test sequences, (4) quality of input
values, and (5) occurrences of test smells or anti-patterns. Our eval-
uation focuses on assessing characteristics 1 and 2 quantitatively;
additionally, we conducted a developer survey, which provides
4.4 Datasets developer perspectives on characteristics 1–4. For studying occur-
rences of test smells, we attempted to use a test-smell detection tool
Table 2 lists the datasets used in the evaluation. The Java dataset
for Java, TSDetect [46, 61], but ran into numerous issues with the
is split into Java Standard Edition (SE) and Enterprise Edition
tool, such as identification of spurious test smells (false positives)
(EE) applications. For Java SE, we selected four Apache Commons
and incorrect counts of test smells. We therefore chose to not use
projects: Commons CLI [14], Commons Codec [15], Commons Com-
the tool, and perform our quantitative evaluation with a custom
press [16], and Commons JXPath [17]. For Java EE, we used three
naturalness checker for characteristics (1) and (2) implemented
open-source applications (CargoTracker [9], DayTrader [19], Pet-
using Tree-sitter [60] and WALA [67]. We note that our metrics
Clinic [47]), covering different Java frameworks and a proprietary
for assessing test assertion quality include some of the test smells
enterprise application (called “App X” for confidentiality).
detected by TSDetect.
For Python, we started with 486 modules from the CodaMosa
artifact [13]. We ran CodaMosa on this dataset, but encountered
crashes on 203 of the modules. We excluded those modules, result- Measuring assertion quality. For assessing test assertion quality,
ing 286 modules from 20 projects. our implementation uses the following metrics.
Multi-language Unit Test Generation using LLMs Conference’17, July 2017, Washington, DC, USA
• Assertion ratio. This measures the percentage of lines of code For instance, a variable of type String named str is not mean-
with assertions in a test case. ingful as it is not capturing any information about the stored
• Tests with no assertions. The percentage of test cases in a test file data. In contrast, variable BasicParser parser is meaningful as
that have no assertions. type name itself conveys meaning of the stored data. Additionally,
• Tests with duplicate assertions. For a test file, the percentage of a variable name’s meaningfulness can depend on the context in
tests with assertions that contain duplicate assertions. which it is used. For example, in the case of String[] arguments,
• Tests with null assertions. For a test file, the percentage of tests the variable name depends on the formal parameter names of the
with assertions that contain a null assertion. method flatten(Options options, String[] arguments, boolean
• Tests with exception assertions. For a test file, the percentage of stopAtNonOption) whose returned value is stored in the variable.
tests with assertions that contain an exception assertion. Based on this premise, we categorize variables into two groups.
Then, depending on the group, we determine whether to match
Measuring meaningfulness of test method name. A unit test name
with the data type name, assignment context, and formal parameter
should clearly capture the functionality being tested, by includ-
names, or simply the assignment context and formal parameter
ing the focal method name and the functionality being tested if
names. Finally, we use the Levenshtein distance to compute the
necessary. For example, a test that checks whether command-line
closeness score.
options are correctly populated when long option values are pro-
vided can be named test_getOptions_longOptions. However, tools
such as EvoSuite and CodaMosa often generate non-descriptive 4.6 Experiment environment
test names (e.g., test01) that do not reflect the functionality being The experiments were conducted on cloud VMs, each equipped
tested. We developed the following approach for measuring the with a 48-core Intel(R) Xeon(R) Platinum 8260 processor; the RAM
meaningfulness of test names. First, for a given test file, we iden- ranged from 128 to 384 GB. We used the OpenAI API [40] to access
tify the focal class, similar to the Method2Test approach [62]. For the GPT models and an internal (proprietary) cloud service to access
Python, we skip this step because focal methods can exist inside the other models. We used v1.2.0 of EvoSuite and our updated
or outside of classes, and resolving imports is more complex. In- version of CodaMosa. We performed three runs of test generation
stead, we directly identify focal methods, assuming any function with each model for both Java dataset and the Python dataset. We
call could be a focal method. For Java, we examine all call sites in use temperature = 0.2, which has been used for code generation
the test case, matching them with testable methods in the focal tasks [64], and for generating more predictable outputs, with max
classes. This step results in a list of potential focal methods. We token length set to 1024 for reducing cost. We use the same settings
then check if one of these methods is mentioned in the test name. for all the models.
If it is, we assign a 50% score for the match. Next, we tokenize
the remaining part of the test name after removing “test” key-
words. We use camel-case and underscore splitting and generate all 5 EVALUATION RESULTS
possible word combinations by merging them sequentially. For ex- 5.1 RQ1: Code Coverage
ample, test_addOption_longArgs_throwsException is first matched
We collected code coverage in two steps, configuring aster to
with the focal method addOption, then the name is broken into long,
run with each of the six models. In the first step, we ran aster
args, and longargs. For exceptions, we match them separately with
with its preprocessing and prompting components, followed by
exceptions thrown in the test body. We calculate the closeness score
the postprocessing, i.e., to fix parsing/compilation/runtime errors
using the Levenshtein distance by matching these tokens with all
and test failures. This step resulted in the initial generated test
code identifiers.
suite. Next, we performed coverage augmentation, asking the LLM
Measuring meaningfulness of variable names in tests. Consider to generate tests targeting lines of code not covered initially. We
the below illustrative example, showing two test cases that have restricted augmentation to partially covered methods, omitting
the same set of steps. Our premise is that the name of a variable of uncovered methods—the rationale is that methods that remain
a data structure type (e.g., String, List) can be meaningful based uncovered after the first step are unlikely to be covered during
on its context, whereas the meaningfulness of a variable name for augmentation.
a non-data structure type depends on both type and context. 1. Java SE applications: The top part of Figure 4 presents coverage
results for the Java SE applications in our dataset. The charts show,
for each application, line, branch, and method coverage achieved
by aster configured with different LLMs and by EvoSuite. aster
is, in general, very competitive with EvoSuite—within a few per-
centage points, and even outperforming it, on different coverage
metrics—with the best-performing model. For instance, for Com-
mons CLI, aster with Llama3-70B achieves only slightly lower line
coverage (-4%) and higher branch coverage (+3%) than EvoSuite.
Similarly, on Commons Codec, aster with GPT-4 has slightly lower
on line coverage (-7%) and branch coverage (-7%). On Commons
Compress, both tools achieve much lower coverage, due to applica-
tion complexity (Commons Compress provides an API for different
Figure 5: Example of two tests with meaningfulness of
method name and variable name.
Conference’17, July 2017, Washington, DC, USA Pan et al.
Figure 6: Line, branch, and method coverage achieved on Finding 2: For Java EE projects, aster significantly outperforms
Python projects by aster (with different LLMs) and Co- EvoSuite for most applications (10.6%-26.4% coverage) and is
daMosa. capable of generating test cases for applications where existing
approaches fail to do so.
file compression and archive formats), but there is not much dif-
ference between them. Finally, Commons JXPath highlights the
Our findings suggest that smaller models (8b, 34b) can outper-
case where aster significantly outperforms EvoSuite, achieving
form or match the performance of larger models such as Llama3-70b
4x–5x more line and branch coverage with different models. On
or GPT-4. This is promising because the major drawback of LLM-
examining the test cases for Commons JXPath, we found aster’s
based test generation is the associated cost. Because this process
support for mocking to be one of the factors contributing to its
involves thousands of LLM calls with token-based billing for ac-
superior performance; Figure 1(b) illustrates an example test case.
cessing models through their APIs, the cost can be significantly
To understand the effect of our mock-generation approach, we
high. We found that running aster against all the Java applications
conducted an ablation with Commons JXPath. We found that, on
in our dataset—generating tests, fixing compilation/runtime errors,
average, there is a (relative) loss of 13.7%, 17.4%, and 10.5% in line,
and performing coverage augmentation—requires >20K LLM calls
branch, and method coverage.
per model, which can cost thousands of dollars and is not a feasible
Overall, these results are positive and indicate that LLM-based
option for long-term usability.
test generation can match conventional test-generation tools on
their one of their key strengths (code coverage), while also produc-
ing considerably more natural test cases (§5.2) that are preferred Finding 3: Smaller models (Granite-34b and Llama-3-8b) demon-
by developers (§5.5). strate competitive performance, with only 0.1%, 6.3%, and 2.7%
loss in line, branch, and method coverage, compared to larger
Finding 1: LLM-based test generation guided by static analysis is models (Llama-70b and GPT-4).
very competitive with EvoSuite in coverage achieved for Java SE
projects, being slightly lower in some cases (-7%) and considerably
higher in other cases (4x–5x). 3. Python applications: Figure 6 presents coverage results for
Python: it shows line, branch, and method coverage achieved by
Among the LLMs, we observe that all models perform roughly aster configured with the six LLMs compared with CodaMosa
similar with respect to line and method coverage, but some differ- over the Python dataset. aster performs considerably better than
ences become apparent on examining branch coverage. Another CodaMosa (CodaMosa: 44%, 53.7%, and 61.2% line, branch, and
noteworthy result is that despite GPT-4’s significantly larger size method coverage, aster + GPT-4: 78% line coverage, 77.2% branch
compared to the other models, its performance is not far superior coverage, and 86.7% method coverage). aster with Granite-34B
to the smaller models; in some cases, the smaller models in fact performed even better in method coverage, reaching 89.9%. No-
perform better GPT-4 (Llama-70b: +1.2%, CodeLlama-34b: +0.1%, tably, the smaller models performed well too. With Granite-8b,
Granite-34b: -1.2%, Llama-8b: -1.35%, Granite-8b: -4.7% w.r.t line aster achieved method, branch, and line coverage of 83%, 71.9%,
coverage). and 48.3%, respectively. These results validate the effectiveness of
2. Java EE applications: The bottom part of Figure 4 presents cov- aster’s LLM-driven approach, demonstrating a clear advantage
erage results for Java EE applications. The result for CargoTracker over conventional techniques.
is similar to that for Java SE applications, with aster being competi-
tive with EvoSuite and, in some instances, performing better than it. Finding 4: aster generates Python tests with higher coverage
For PetClinic, EvoSuite could not be run because PetClinic requires (+9.8%, +26.5%, and +22.5%) for all the models compared to Co-
Java 17, which is not supported by EvoSuite. This highlights the daMosa.
benefit of LLM-based test generation that it can inherently support
more language versions than conventional tools.
Multi-language Unit Test Generation using LLMs Conference’17, July 2017, Washington, DC, USA
Figure 7: Naturalness metrics for Java SE and EE applications (1: GPT-4, 2: Llama3-70b, 3: CodeLlama-34b, 4: Granite-34b, 5:
Llama-8b, 6: Granite-8b, 7: EvoSuite, 8: Developer).
Table 3: Mutation scores.
pared to those generated by EvoSuite. This opens up an intriguing Line -13.3 -7.2 -8.5 -7.5 -22.4 -17.0 -18.9 +11.9 +2.3 +17.6 +20.3 +4.3 +20.1
avenue for enhancing LLM-generated tests to reduce smells while Branch -23.1 -26.1 -28.9 -21.3 -43.3 -8.9 -7.6 +14.9 +2.4 +26.3 +10.9 +1.1 +8.5
maintaining coverage. Notably, aster-generated test cases contain Method -9.9 -0.2 -3.0 -1.4 -14.8 -10.6 -8.9 +10.1 +1.6 +12.5 +13.1 -0.3 +10.3
significantly fewer exception-related tests than those generated by CS: Constructor, EAC: extend abstract class, PM: private methods, AM: auxiliary methods, NAC: no additional context,
FS: few shot, CF: compilation error fix, RF: runtime error fix, CA: coverage augmentation. All data are in %.
EvoSuite. In terms of readability metrics, aster-generated test cases
outperform those generated by EvoSuite. However, there are still
several instances where aster-generated tests have names that lack 5.3 RQ3: Mutation Scores
meaningfulness. Additionally, our automated approach has certain 1. Java applications: Table 3 shows the mutation scores achieved
limitations, such as not capturing natural language descriptions in by CodaMosa, EvoSuite, and aster, indicating the effectiveness of
test names effectively. For example, in testPrintGainHTML_PositiveGain, each tool in detecting code mutations. For Java SE applications, Evo-
the term PositiveGain indicates a positive computation result, which Suite consistently demonstrates superior performance (60%-75%),
is not directly reflected in the code body. Improving the technique to whereas aster achieved 22.3%–35.0% mutation coverage. This is
better capture natural-language intents in test and variable names expected, as EvoSuite sets mutation goals during test generation
could be another direction for future work. and improves the tests to enhance mutation coverage, and indicates
2. Python applications: CodaMosa consistently generated test scope for improving LLM-based test generation with mutation-
cases without any assertions, whereas aster generates test cases coverage goals. Conversely, aster outperforms EvoSuite consis-
with assertions in most cases. None of the developer-written, aster- tently on the EE applications, which correlates with aster’s higher
generated, or CodaMosa-generate tests contain null assertions or code coverage achieved on these applications than EvoSuite.
exception assertions. Our analysis found that developers prefer to 2. Python applications: CodaMosa performed the best with 8.8%
have around 4–5 lines of test code for each assertion, which is simi- mutation scores, followed by aster with GPT-4, which achieved
lar to aster-generated cases. For both variable and test name natu- 7.6% mutation coverage. Similar to EvoSuite, CodaMosa uses muta-
ralness, aster achieved scores very similar to developer-written tion coverage to evolve test cases during generation, thus benefiting
test cases, while CodaMosa had a much lower naturalness scores from this approach. Overall, both tools achieve low mutation scores.
(figure in artifact).
Finding 5: aster-generated test cases have more meaningful test Finding 6: EvoSuite achieves considerably higher mutation
and variable names compared to EvoSuite and CodaMosa tests scores than aster on Java SE applications due to its use of
but still lack in terms of test smells, although aster produces mutation-coverage goals during test generation, whereas aster
fewer exception-related assertions. outperforms EvoSuite on the EE applications.
Conference’17, July 2017, Washington, DC, USA Pan et al.
100% 100%
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0%
0%
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
EvoSuite
ASTER-Java
Java Dev
Java Dev
Java Dev
Java Dev
Java Dev
Java Dev
Java Dev
Java Dev
Python Dev
Python Dev
Python Dev
Python Dev
Python Dev
Python Dev
Python Dev
Python Dev
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
CodaMosa
ASTER-Py
Q10. Understand Q11. Understand Q12. Understand Q13. Test adds Q14. Meaningful Q15. Meaningful Q16. Meaningful Q17. Meaningful Q10. Understand Q11. Understand Q12. Understand Q13. Test adds Q14. Meaningful Q15. Meaningful Q16. Meaningful Q17. Meaningful test
test case assertion purpose value method name variable name inputs test sequence test case assertion purpose value method name variable name inputs sequence
Strongly agree Agree Neither agree nor disagee Disagee Strongly disagree Strongly agree Agree Neither agree nor disagee Disagee Strongly disagree
Figure 9: Survey responses on test quality (Q10–Q17) for Java test cases (left) and Python test cases (right).
5.4 RQ4: Ablation Study Table 5: Survey questions categorized into two groups.
Background
Q4. Level of expertise in Java MCQ
CLI and PetClinic) and the Python dataset. We used Granite-34B, as Q5. Level of expertise in Python MCQ
it offers a good balance between size and competitiveness compared Q6. Prior experience with automated test generation MCQ
Q7. Prior experience with automated test generation (if yes, list the tools used MCQ,
to other model. in the past) Open
Q8. How often do you write unit tests for your code? MCQ
Q9. On average, how long do you spend writing good unit tests for a single MCQ
100% method?
90%
Q10. I understand what this test case is doing Likert
80% Q11. I understand what the assertions in this test are checking Likert
70% Q12. I can describe the purpose of this test case Likert
Test Quality
We used two Java (Commons CLI and Commons JXPath) and 6 RELATED WORK
three Python applications (Ansible [1], Tornado Web [59], and Compared to the previous works [2, 24, 27, 36, 43, 44, 54, 57, 63] on
Flutes [23]) for the study. We designed the survey to present, for conventional testing approaches, we leverage LLM to generate more
each focal method, a pair of test cases, where the pair could be natural unit tests while supporting multiple PLs. There have been
(aster, EvoSuite/CodaMosa), (aster, developer), or (developer, several works that attempted to use LLMs for test generations [5,
EvoSuite/CodaMosa) tests—without indicating the source of a test 52, 53, 62, 65, 69]. The closest works are done by ChatTester [72]
case. To select focal methods, first we mapped each test case, from and ChatUnitTest [71] (arxiv), which have used the GPT model to
the three sources, to its focal method, and removed focal methods generate test cases given the information related to the class under
that did not have at least one source-pair test. Then, we randomly test and has shown that it performs competitively. However, this
selected focal methods and a pair of tests for each method, while approach is specifically tied to GPT, and in many enterprise use
ensuring that the participant would see an equal number of tests cases, this approach may not work. Recent Pizzorno and Berger [49]
from the three sources. In total, the survey had 9 focal methods and (arxiv), use LLM as a coverage augmenting approach starting with
18 test cases, with 6 tests from each test source. The survey presents tests generated by CodaMosa. Compared to that work, aster (1)
the participant with the focal method’s body, a brief task description, supports more PLs, (2) generates tests from scratch, (3) creates more
and a GitHub repository URL (using which the participant could, if natural tests (based on our automated analysis and developers’
needed, browse the focal class and its broader context). survey) compared to CodaMosa, which the prior work is dependent
The survey received 161 responses, with participants coming on, and (4) generates tests with higher coverage (prior work: line:
various roles, such as software developer, QA engineer, principal +7.3%, branch: +10.3%, aster: line: +37.1%, branch: 24.9% (with
solution architect, research scientist, etc. In terms of experience, GPT-4, on which the prior work has been evaluated)) compared to
69.7% of the participants have >10 years of experience, with 22.2% CodaMosa.
exceeding 25 years. Most participants have very strong industry
experience, with 48.1% of the participants being in industry for 10+
years. Participants also have high level of proficiency in Java and 7 DISCUSSION
Python, with 71% and 62% reporting professional or experienced Our study identified several pros and cons of using LLMs for test
levels, respectively. A few participants reported having previously generation. Although aster addresses some of them, there remains
code assistant, such as JetBrains Code with Me [7] and GitHub significant room for further research.
Copilot [26], as aids in test generation. Additionally, all participants Supporting more PLs. Although LLMs are trained on various PLs,
agreed that writing test cases is a tedious task and 90.4% said that using them off-the-shelf may not always produce the best results.
writing a good unit test takes more than five minutes. However, when combined with lighter-weight program analysis,
For Q10–Q18, participants provided responses on a five-point their performance is often very competitive with, and sometimes
scale, ranging from “strongly agree” to “strongly disagree”. The surpasses, existing approaches. This reduced need for complex
results, shown in Figure 9, indicate a strong preference for aster- program analysis makes it easier to support more PLs.
generated test cases over EvoSuite and CodaMosa for all aspects Multiple frameworks and versions Another benefit of using LLMs
evaluated. For example, on Q10, 91.6% of the respondents agreed for test generation is their ability to support multiple frameworks
or strongly agreed that they understood the aster-generated Java and PL versions. For instance, state-of-the-art tools such as EvoSuite
test cases, which is considerably higher than their ratings for Evo- do not support beyond Java 11–posing a significant challenge for
Suite tests (67.7%) and the developer-written tests (69.1%). On this practical usage. LLMs can play a significant role in bridging this
question, the difference for the Python tests is much higher, with gap.
100% positive responses on aster-generated tests, compared with Increasing naturalness of existing approaches. In this work, we
84.0% for developer-written tests and only 45.1% for the CodaMosa developed an automated method to assess the naturalness of test
tests. Overall, on all the questions, the participants rated aster- cases and discovered that LLM-generated tests are notably more
generated tests higher than EvoSuite/CodaMosa-generated tests natural than those produced by existing tools or written by de-
and even developer-written test cases in most cases. Finally, on Q18 velopers. We can leverage a similar approach to (1) enhance the
(Figure 10), participants indicated a significantly stronger prefer- naturalness of test cases generated by existing approaches and (2)
ence for incorporating tests similar to aster-generated ones into guide LLMs to produce more natural code by incorporating this
their test suites with minor or no changes—70% for Java and 88% for aspect as a pretraining objective.
Python—compared to EvoSuite and CodaMosa tests. On this ques- Beyond unit testing. We evaluated LLMs’ capability in unit test
tion too, aster-generated tests score higher than developer-written generation. However, we believe that there is a significant potential
tests. of using LLMs to go beyond unit testing to other testing levels—
module, integration, and system—for which there is limited support
among the current test-generation tools, especially for white-box
testing.
Finding 8: Developers prefer aster-generated tests over Evo- Affordable models. One of the main challenges of using LLMs
Suite and CodaMosa tests, with over 70% also willing to add is the associated cost. LLM-based approaches require substantial
such tests with minor or no changes to their test buckets. aster- computing resources, making them less accessible for developers
generated tests are also favored slightly over developer-written compared to existing methods. However, our study revealed some
tests in these characteristics. noteworthy findings: (1) unlike many other use cases, GPT-4 is not
Conference’17, July 2017, Washington, DC, USA Pan et al.
the most dominant model and (2) smaller models (e.g., Granite-34b [7] JET Brains. 2024. Code With Me. https://www.jetbrains.com/code-with-me
and Llama3-8b) perform competitively. We believe that more effort [8] Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and
Automatic Generation of High-Coverage Tests for Complex Systems Programs.
should be directed towards developing quantized, test-generation- In Proceedings of the 8th USENIX Conference on Operating Systems Design and
specific models that can run efficiently. Implementation. 209–224.
[9] cargotracker 2024. Eclipse Cargo Tracker. https://github.com/eclipse-ee4j/
cargotracker
8 THREATS TO VALIDITY [10] Tsong Yueh Chen, Fei-Ching Kuo, Robert G. Merkel, and T. H. Tse. 2010. Adaptive
Random Testing: The ART of Test Case Diversity. J. Syst. Softw. 83, 1 (Jan. 2010),
In this section, we identify the most relevant threats to validity and 60–66. https://doi.org/10.1016/j.jss.2009.02.022
discuss how we mitigated them. [11] Ilinca Ciupa, Andreas Leitner, Manuel Oriol, and Bertrand Meyer. 2008. ARTOO:
External threat. To address threat related to generalizability of Adaptive Random Testing for Object-Oriented Software. In Proceedings of the
30th International Conference on Software Engineering. 71–80.
aster, we extended support for (a) multiple PLs and (b) multiple [12] codamosa 2024. CodaMosa. https://github.com/microsoft/codamosa
models with varying sizes, modalities, and from different model [13] codamosaartifact 2024. CodaMOSA Artifact. https://github.com/microsoft/
families. codamosa/tree/main/replication
[14] commonscli 2024. Apache Commons CLI. https://github.com/apache/commons-
Internal threat. One potential threat to internal validity is the cli
limited number of evaluation runs. While previous studies have [15] commonscodec 2024. Apache Commons Codec. https://github.com/apache/
commons-codec
performed more than ten runs, the substantial cost associated with [16] commonscompress 2024. Apache Commons Compress. https://github.com/
running evaluations across several large applications and six differ- apache/commons-compress
ent LLMs led us to limit the runs. Another threat can be automated [17] commonsjxpath 2024. Apache Commons JXPath. https://github.com/apache/
commons-jxpath
naturalness evaluation, and to mitigate we conducted a survey of [18] Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh,
professional developers, and we found that the findings are very and Michel C Desmarais. 2023. Effective test generation using pre-trained large
similar. language models and mutation testing. arXiv preprint arXiv:2308.16557 (2023).
[19] daytrader 2024. DayTrader8 Sample. https://github.com/OpenLiberty/sample.
Construct threat. To ensure construct validity, we measured cov- daytrader8
erage and mutation scores using widely used tools such as Ja- [20] Vinicius H. S. Durelli, Rafael S. Durelli, Simone S. Borges, Andre T. Endo,
Marcelo M. Eler, Diego R. C. Dias, and Marcelo P. Guimarães. 2019. Machine Learn-
CoCo [21], Coverage.py [6], PIT [48], and Mutmut [30]. ing Applied to Software Testing: A Systematic Mapping Study. IEEE Transactions
on Reliability 68, 3 (2019), 1189–1212. https://doi.org/10.1109/TR.2019.2892517
9 SUMMARY AND FUTURE WORK [21] EclEmma. [n. d.]. JaCoCo: Java Code Coverage Library. Accessed: 2024-07-27.
[22] evosuite 2024. EvoSuite: Automatic Test Suite Generation for Java. https:
In this paper, we presented aster, a multi-lingual test-generation //www.evosuite.org/
tool that leverages LLMs guided by lightweight static analysis to [23] flutes 2024. Flutes. https://github.com/huzecong/flutes
[24] Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Gen-
generate natural and effective unit test cases for Java and Python. eration for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT
Through its preprocessing component, aster ensures that LLM symposium and the 13th European conference on Foundations of software engineer-
prompts have adequate context required for generating unit tests ing. 416–419.
[25] Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg.
for a focal method. aster’s postprocessing component performs 2015. Does Automated Unit Test Generation Really Help Software Testers? A
iterative test repair and coverage augmentation. Our extensive Controlled Empirical Study. ACM Trans. Softw. Eng. Methodol., Article 23 (Sept.
2015). https://doi.org/10.1145/2699688
evaluation, with six LLMs on a dataset of Java SE, Java EE, and [26] GitHub. 2024. GitHub Copilot. hhttps://github.com/features/copilot
Python application, showed that aster is competitive with state- [27] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed auto-
of-the-art tools in coverage achieved on Java SE application, and mated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on
Programming language design and implementation. 213–223.
outperforms them significantly on Java EE and Python applications, [28] Mark Harman and Phil McMinn. 2010. A Theoretical and Empirical Study of
while also producing considerably more natural tests than those Search-Based Testing: Local, Global, and Hybrid Search. IEEE Transactions on
tools. Our developer survey, with over 160 participants, highlighted Software Engineering 36, 2 (2010), 226–247. https://doi.org/10.1109/TSE.2009.71
[29] Sepehr Hashtroudi, Jiho Shin, Hadi Hemmati, and Song Wang. 2023. Automated
the naturalness characteristics of aster-generated tests and their test case generation using code models and domain adaptation. arXiv preprint
usability for building automated test suites. Future research direc- arXiv:2308.08033 (2023).
[30] Anders Hovmöller. [n. d.]. Mutmut: Mutation Testing for Python. https://mutmut.
tions include extending aster to other PLs and levels of testing readthedocs.io/. Accessed: 2024-07-27.
(e.g., integration testing), creating fine-tuned models for testing to [31] IBM. 2024. Granite Code Models. https://huggingface.co/collections/ibm-
reduce the cost of LLM interactions, and exploring techniques for granite/granite-code-models-6624c5cec322e4c148c8b330
[32] javaservletspec 2024. Jakarta Servlet. https://jakarta.ee/specifications/servlet
improving fault-detection ability of the generated tests. [33] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen.
2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained
REFERENCES large language models. In 2023 IEEE/ACM 45th International Conference on Soft-
ware Engineering (ICSE). IEEE, 919–931.
[1] ansible 2024. Ansible. https://github.com/ansible/ansible [34] Yun Lin, You Sheng Ong, Jun Sun, Gordon Fraser, and Jin Song Dong. 2021.
[2] Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. Graph-Based Seed Object Synthesis for Search-Based Unit Testing. In Proceedings
ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 1 (2019), of the 29th ACM Joint Meeting on European Software Engineering Conference
1–37. and Symposium on the Foundations of Software Engineering. 1068–1080. https:
[3] Andrea Arcuri and Lionel Briand. 2011. Adaptive Random Testing: An Illusion //doi.org/10.1145/3468264.3468619
of Effectiveness?. In Proceedings of the 2011 International Symposium on Software [35] Yu Lin, Xucheng Tang, Yuting Chen, and Jianjun Zhao. 2009. A Divergence-
Testing and Analysis. 265–275. https://doi.org/10.1145/2001420.2001452 Oriented Approach to Adaptive Random Testing of Java Programs. In 2009
[4] asterartifact 2024. ASTER Artifact. https://anonymous.4open.science/r/aster- IEEE/ACM International Conference on Automated Software Engineering. 221–232.
54FC/ https://doi.org/10.1109/ASE.2009.13
[5] Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. 2022. Code [36] Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test gen-
generation tools (almost) for free? a study of few-shot, pre-trained language eration for python. In Proceedings of the ACM/IEEE 44th International Conference
models on code. arXiv preprint arXiv:2206.01335 (2022). on Software Engineering: Companion Proceedings. 168–172.
[6] Ned Batchelder. [n. d.]. Coverage.py: Code coverage measurement for Python. [37] Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2023. An empirical study
https://coverage.readthedocs.io/. Accessed: 2024-07-27. of automated unit test generation for python. Empirical Software Engineering 28,
Multi-language Unit Test Generation using LLMs Conference’17, July 2017, Washington, DC, USA
2 (2023). https://doi.org/10.1007/s10664-022-10248-w [55] Mohammed Latif Siddiq, Joanna CS Santos, Ridwanul Hasan Tanvir, Noshin
[38] Phil McMinn. 2004. Search-based Software Test Data Generation: A Survey: Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. 2024. Using Large Language
Research Articles. Softw. Test. Verif. Reliab. 14, 2 (June 2004), 105–156. Models to Generate JUnit Tests: An Empirical Study. (2024).
[39] Meta. 2024. Meta Llama 3. https://huggingface.co/collections/meta-llama/meta- [56] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2024. ChatGPT vs SBST:
llama-3-66214712577ca38149ebb2b6 A Comparative Assessment of Unit Test Suite Generation. IEEE Transactions on
[40] OpenAI. 2024. OpenAI API. https://platform.openai.com/docs/api-reference/ Software Engineering (2024), 1–19. https://doi.org/10.1109/TSE.2024.3382365
introduction [57] Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation
[41] OpenAI. 2024. OpenAI Deprecations. https://platform.openai.com/docs/ for. net. In International conference on tests and proofs. Springer, 134–153.
deprecations [58] Paolo Tonella. 2004. Evolutionary Testing of Classes. In Proceedings of the 2004
[42] OpenAI. 2024. OpenAI Models. https://platform.openai.com/docs/models/gpt- ACM SIGSOFT International Symposium on Software Testing and Analysis. 119–128.
base https://doi.org/10.1145/1007512.1007528
[43] Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random [59] tornadoweb 2024. Tornado Web Server. https://github.com/tornadoweb/tornado
testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object- [60] treesitter 2024. Tree-sitter. https://tree-sitter.github.io/tree-sitter
oriented programming systems and applications companion. 815–816. [61] tsdetect 2024. TSDetect. https://github.com/TestSmells/TSDetect
[44] Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. [62] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel
Feedback-directed random test generation. In 29th International Conference on Sundaresan. 2020. Unit test case generation with transformers and focal context.
Software Engineering (ICSE’07). IEEE, 75–84. arXiv preprint arXiv:2009.05617 (2020).
[45] Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, [63] Rachel Tzoref-Brill, Saurabh Sinha, Antonio Abu Nassar, Victoria Goldin, and
and Vincent J. Hellendoorn. 2020. Revisiting Test Smells in Automatically Haim Kermany. 2022. Tackletest: A tool for amplifying test generation via
Generated Tests: Limitations, Pitfalls, and Opportunities. In 2020 IEEE Inter- type-based combinatorial coverage. In 2022 IEEE Conference on Software Testing,
national Conference on Software Maintenance and Evolution (ICSME). 523–533. Verification and Validation (ICST). IEEE, 444–455.
https://doi.org/10.1109/ICSME46990.2020.00056 [64] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep
[46] Anthony Peruma, Khalid Almalki, Christian D. Newman, Mohamed Wiem Singh. 2024. Improving llm code generation with grammar augmentation. arXiv
Mkaouer, Ali Ouni, and Fabio Palomba. 2020. tsDetect: an open source test preprint arXiv:2403.01632 (2024).
smells detection tool. In Proceedings of the 28th ACM Joint Meeting on European [65] Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. 2023. Can large language
Software Engineering Conference and Symposium on the Foundations of Software models write good property-based tests? arXiv preprint arXiv:2307.04346 (2023).
Engineering. 1650–1654. https://doi.org/10.1145/3368089.3417921 [66] Willem Visser, Corina S. Pasareanu, and Sarfraz Khurshid. 2004. Test Input
[47] petclinic 2024. Spring PetClinic Sample Application. https://github.com/spring- Generation with Java PathFinder. In Proceedings of the 2004 ACM SIGSOFT
projects/spring-petclinic International Symposium on Software Testing and Analysis. 97–107. https:
[48] PITEST. [n. d.]. PIT Mutation Testing. Accessed: 2024-07-27. //doi.org/10.1145/1007512.1007526
[49] Juan Altmayer Pizzorno and Emery D Berger. 2024. CoverUp: Coverage-Guided [67] wala 2024. WALA. https://github.com/wala/WALA
LLM-Based Test Generation. arXiv preprint arXiv:2403.16218 (2024). [68] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing
[50] Corina S. Păsăreanu and Neha Rungta. 2010. Symbolic PathFinder: Symbolic Exe- Wang. 2024. Software testing with large language models: Survey, landscape,
cution of Java Bytecode. In Proceedings of the IEEE/ACM International Conference and vision. IEEE Transactions on Software Engineering (2024).
on Automated Software Engineering. 179–180. https://doi.org/10.1145/1858996. [69] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing
1859035 Wang. 2024. Software testing with large language models: Survey, landscape,
[51] pylint dev. [n. d.]. Pylint. Accessed: 2024-07-27. and vision. IEEE Transactions on Software Engineering (2024).
[52] Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Mu- [70] Tao Xie, Darko Marinov, Wolfram Schulte, and David Notkin. 2005. Symstra: A
rali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-Aware Prompting: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution.
study of Coverage Guided Test Generation in Regression Setting using LLM. FSE In Proceedings of the 11th International Conference on Tools and Algorithms for the
(2024). Construction and Analysis of Systems. 365–381. https://doi.org/10.1007/978-3-
[53] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical 540-31980-1_24
evaluation of using large language models for automated unit test generation. [71] Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023.
IEEE Transactions on Software Engineering (2023). ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint
[54] Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A concolic unit testing arXiv:2305.04764 (2023).
engine for C. ACM SIGSOFT Software Engineering Notes 30, 5 (2005), 263–272. [72] Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng,
and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Genera-
tion. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1703–1726.