2407.09018v1
2407.09018v1
Testing
Yongxiang Hu∗ Xuan Wang∗ Yingchuan Wang∗
School of Computer Science School of Computer Science School of Computer Science
Fudan University Fudan University Fudan University
Shanghai, China Shanghai, China Shanghai, China
ABSTRACT KEYWORDS
The Graphical User Interface (GUI) is how users interact with mo- Automatic Testing, Mobile Apps, Functional Bug, In-context Learn-
bile apps. To ensure it functions properly, testing engineers have to ing
make sure it functions as intended, based on test requirements that
are typically written in natural language. While widely adopted ACM Reference Format:
manual testing and script-based methods are effective, they demand Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi
substantial effort due to the vast number of GUI pages and rapid Chen, Xin Wang, and Yangfan Zhou. 2024. AUITestAgent: Automatic Re-
iterations in modern mobile apps. This paper introduces AUITestA- quirements Oriented GUI Function Testing. In Proceedings of ACM Confer-
gent, the first automatic, natural language-driven GUI testing tool ence (Conference’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.
for mobile apps, capable of fully automating the entire process of 1145/nnnnnnn.nnnnnnn
GUI interaction and function verification. Since test requirements
typically contain interaction commands and verification oracles.
AUITestAgent can extract GUI interactions from test requirements
1 INTRODUCTION
via dynamically organized agents. Then, AUITestAgent employs a
multi-dimensional data extraction strategy to retrieve data relevant Since GUI (Graphical User Interface) is the medium through which
to the test requirements from the interaction trace and perform users interact with mobile apps, developers need to verify that it
verification. Experiments on customized benchmarks 1 demonstrate functions as expected. The most direct approach to this verifica-
that AUITestAgent outperforms existing tools in the quality of gen- tion is to check the UI functionality according to requirements,
erated GUI interactions and achieved the accuracy of verifications of specifically by executing GUI interactions and checking GUI re-
94%. Moreover, field deployment in Meituan has shown AUITestA- sponses. Despite being effective, the vast number of GUI pages and
gent’s practical usability, with it detecting 4 new functional bugs the rapid pace of app iterations result in a large volume of test-
during 10 regression tests in two months. ing tasks. Therefore, to avoid expending a large amount of human
effort, automation is essential.
Since the test requirements are usually expressed in natural lan-
∗ This work was performed during the authors’ internship in Meituan. guage, we call this form of testing natural language-driven GUI
1 https://github.com/bz-lab/AUITestAgent
testing for mobile apps, which remains an underexplored field. The
most relevant work is natural language-driven GUI interaction[10,
28, 33]. They typically utilize large language models (LLMs) to an-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed alyze natural language commands and GUI pages, and ultimately
for profit or commercial advantage and that copies bear this notice and the full citation generate interaction commands. However, unlike app interaction
on the first page. Copyrights for components of this work owned by others than ACM tasks, which focus on the interaction result and are goal-oriented,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a GUI testing requires checking the correctness of GUI responses
fee. Request permissions from [email protected]. during the interaction process and is, therefore, step-oriented. Con-
Conference’17, July 2017, Washington, DC, USA sequently, the commonly used trial-and-error strategies in these
© 2024 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 interaction methods often result in inefficient interaction traces
https://doi.org/10.1145/nnnnnnn.nnnnnnn and insufficient success rate for GUI testing.
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou
3 TOOL DESIGN AND TESTING WORKFLOW problem [16] of LLMs due to cumulative effects, thereby reducing
the quality of generated UI actions.
3.1 Overview
To address these challenges, the GUI interaction module of
Figure 2 presents an overview of AUITestAgent, which comprises AUITestAgent employs two agent organization patterns,Specific
two main modules: GUI interaction and function verification. Specif- steps pattern and Concise expression pattern, as shown in Figure 3.
ically, AUITestAgent first analyzes the GUI test requirements to The GUI interaction module first uses LLMs to recognize the type of
extract interaction commands and oracles for function verification. input interaction commands. If the interaction commands consist
The GUI interaction module then dynamically selects the appro- of a series of specific steps, AUITestAgent will convert them into a
priate agent organization pattern based on the complexity of the list of single-step UI actions via the specific step pattern. Otherwise,
interaction commands. If the commands consist of specific steps, if the test requirement contains concise interaction commands, the
the interactions will be performed directly by the Executor. Oth- GUI interaction module will use the concise expression pattern to
erwise, if the commands are expressed concisely, a more complex handle them.
pattern involving a Planner and Monitor will be used. Finally, the
function verification module analyzes the interaction logs based 3.2.1 Specific steps pattern. Specific steps pattern executes each op-
on the test oracles. These logs contain information about the GUI eration in the operation list sequentially through the collaboration
pages encountered during the interactions, including screenshots, of three LLM agents: the Observer, the Selector, and the Executor.
LLM outputs, and the UI actions performed. AUITestAgent extracts The Observer agent is to identify all interactive UI elements
oracle-relevant data from the interaction logs and outputs the veri- in the current GUI page and infer their functions. Specifically, the
fication results and reason analysis. Observer processes the screenshot of the current GUI page using
Set-of-Mark[34], assigning numeric IDs and rectangular boxes to
all interactive elements. This approach highlights the interactive
3.2 Natural Language Driven GUI Interaction elements in the GUI screenshots, making it easier for LLMs to
As discussed in Section 2, the interaction commands in different recognize. Finally, a multi-modal large language model (MLLM) is
GUI test requirements vary significantly. As shown in Figure 1(a), used to infer the function of each numbered UI element.
target elements and corresponding UI actions are specified in ex- Although MLLMs possess image analysis capabilities, they do
plicit commands. In contrast, concise commands such as Figure 1(b) not perform well in observing GUI images, especially in recognizing
may abstract multiple UI actions into a single interaction command, UI elements and analyzing their functions due to the complexity of
which is more challenging to perform. Such differences require GUIs[14]. Besides, MLLMs are prone to Optical Character Recog-
AUITestAgent to dynamically organize suitable agents to collabo- nition (OCR) hallucinations when recognizing text in images[15],
rate, ensuring the interaction commands are executed accurately. while text is crucial for understanding the functionality of many el-
Specifically, concise interaction commands require the agent to ements within the GUI. To tackle these challenges, we enhance the
accurately interpret the user’s intent behind the commands, assess GUI analysis capability of the Observer agent through multi-source
the current state from GUI screenshots and interaction records, and input and knowledge base augmented generation.
determine the appropriate UI actions to take. Therefore, performing Specifically, we select GUI screenshots and UI hierarchy file [3]
concise commands demands interaction memory and reasoning as the Observer’s input. For the UI hierarchy file, Observer first
capabilities, which are difficult to achieve through a single agent parses the UI hierarchy file to identify all the interactive nodes(i.e.,
due to the context length limitations. In contrast, while detailed whose clickable, enabled, scrollable, or long-clickable attribute is
interaction commands are easier to execute, using multiple agents set to true). Related attributes of these nodes are also collected,
increases unnecessary costs. It may also amplify the hallucination including coordinate information (from ‘bounds’ attribute), textual
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou
information (from ‘text’ and ‘content-desc’ attribute), and func- • click(target: int): This function is used to click an element listed
tional information (e.g., nodes with a “EditText” class are likely to by Observer. The parameter target is the element’s ID.
be text input fields). For GUI screenshots, Observer uses the vision- • longPress(target: int): This function is similar to click, while its
ui model from Meituan [4] to identify all elements in the GUI and function is to long press a particular element.
perform OCR to recognize the text visible to users. The recognition • type(target: int, text: str): this function is used to type the string
results serve as a supplement to the UI hierarchy, especially in ‘text’ into an input field. It is worth noting that the target param-
cases where obtaining the UI hierarchy file fails or when the page eter is the ID of the input field. The function will click on this
contains WebView [7] components. element before typing to ensure the input field is activated.
In order to make it easier for LLMs to describe the functions • scroll(target: int, dir: str, dist: str): This function is called to scroll
of elements, the Observer then marks the numerical ID and rect- on the element target in a direction ‘dir’(up, down, left, right)
angular boxes of each element on the GUI image based on their and distance ‘dist’(short, medium, long).
coordinates. Finally, a system prompt that contains the processed Finally, the Executor agent performs the UI actions generated by
screenshot and corresponding elements’ data, including element Selector on the app under testing (AUT). For actions targeting spe-
IDs, text information, and other property information, is assembled. cific elements, the Executor first consults the coordinates provided
This prompt is then used as input to the MLLM to infer the function in the Observer’s results, converting the ID of the target element
of UI elements. into specific page coordinates. Then Executor performs the actual
Given that the LLM is not well-acquainted with the design of interaction action through the Android Debug Bridge (adb) tool[1].
mobile apps, particularly those in Chinese, we have observed that
the Observer agent occasionally mispredicts the functions of UI 3.2.2 Concise expression pattern. Concise expression pattern is
elements, especially those UI icons that lack textual information. also achieved through the collaboration of multiple agents. As
To tackle this problem, a UI element function knowledge base is shown in Figure 3, it adds two agents, Planner and Monitor, on top
formed to enhance element recognition. For instance, we manu- of the specific step pattern to handle concise commands.
ally collect and label some elements that are difficult to recognize, The Planner agent formulates a plan based on the concise in-
primarily icons that do not contain text. Each element in the knowl- teraction commands, where each step in the plan is a specific in-
edge base contains its screenshot, appearance descriptions, and teraction command aimed at progressing the completion of the
function. When processing the GUI, the Observer uses CLIP [26] interaction commands. Then, for each step in the plan, the specific
to match elements from the knowledge base with elements on the step pattern is used for GUI action execution.
current page and then includes the matched elements’ appearance Another agent, the Monitor, is involved because, unlike the
descriptions and functions in the system prompt for MLLM. This specific step pattern that requires only the sequential execution of
helps it infer the functions of these error-prone elements. each operation in the list, the concise expression pattern necessi-
The Selector agent is set to select the operation target from tates a role to determine whether the user’s commands have been
the elements listed by Observer based on the natural language completed (i.e., whether the user’s intent has been achieved).
description of a single-step command. It generates the necessary Specifically, the Monitor determines whether the interaction
parameters for the GUI action and then calls the corresponding commands have been completed based on the current GUI screen-
function to perform. Specifically, we have defined the following UI shot and the record of executed operations. If the commands are
actions for Selector: completed, the GUI interaction module outputs the interaction log
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA
and stops. Otherwise, the Monitor describes the current GUI state Table 1: Categorization of interaction log data.
and the operations it believes are necessary to complete the rest Type Description Example
interactions. Its output serves as feedback for the Planner, and in UI element Element functions and properties A submit button at the bottom-right corner.
the next iteration, the Planner will make a new plan. GUI page Page structures and categories A list-structured advertisement page.
Page state A payment page can be reached from an order-
integrate the first two aspects.
transitions placement page by clicking the submit button.
3.3 Function Verification
To our knowledge, AUITestAgent is the first work to investigate
GUI function verification. Since AUITestAgent decouples GUI in- the first two aspects (e.g., a payment page can be reached from an
teraction from verification, its verification is conducted based on order-placement page by clicking Confirm button).
interaction log analysis. Consequently, we formulate function veri- AUITestAgent leverages multiple agents to extract information
fication as a semantic analysis problem of GUI pages. relevant to the current verification point from these dimensions and
Although MLLMs possess image analysis capabilities, we still records it as text. Considering the analysis of GUI element functions
need to simplify GUI function verification tasks to align with the and page types has been performed in Section 3.2, AUITestAgent
capabilities of LLMs, to achieve stable performance. Specifically, will reuse such content to enhance processing speed.
there are three challenges in GUI function verification. First, the
diversity of test oracles makes them difficult to process. Second, 3.3.3 Function Verification. After requirement-related data are col-
the sheer volume of data in interaction logs presents a challenge, lected, AUITestAgent will integrate the verification points with
as LLMs struggle to process excessively lengthy inputs [13], which corresponding data extracted from the logs. Finally, the judgment
may even exceed their maximum input capacity. Lastly, according will be executed using MLLM. Furthermore, we employ the JSON
to our experience, even the most capable MLLMs are not adept at mode [5] to standardize the output, ensuring a structured delivery
semantic analysis of multiple images, especially complex images of judgments and explanations.
like GUI screenshots.
4 EVALUATION
3.3.1 Simplifying Test Oracles. To tackle challenges posed by test
oracles’ complexity, AUITestAgent will convert them to a list of In this section, we evaluate the performance of AUITestAgent by
simple questions, termed verification points in this paper. According answering the following research questions.
to our experiences in Meituan, test requirements often encompass • RQ1: How do the quality and efficiency of AUITestAgent in
multiple verification points, with each relating to different pages generating GUI interactions compare to the existing approaches?
and UI components within the interaction log. As shown in Fig- • RQ2: How do the effectivness of AUITestAgent in perform GUI
ure 4, AUITestAgent initially breaks down the test oracles extracted function verification?
from the test requirements into individual verification points. Each • RQ3: To what extent can AUITestAgent perform natural language-
verification point is tailored to address a single question, thereby driven testing in practical commercial mobile apps?
streamlining the handling of test oracles. RQ1 assesses the effectiveness of AUITestAgent’s GUI interac-
3.3.2 Multi-dimensional Data Extraction. As for the challenges tion model, examining its accuracy and efficiency in converting
posed by the scale of log data and semantic analysis of images, natural language commands into GUI actions. In RQ2, we construct
AUITestAgent approaches information extraction from three per- a dataset containing 20 bugs, simulating real-world scenarios to
spectives, transforming it into textual information. Specifically, evaluate the effectiveness of the GUI function verification module.
data contained in interaction logs can be categorized into three In RQ3, we validate the practical applicability of AUITestAgent by
dimensions, as shown in Table 1. The GUI element dimension con- examining its performance across GUI pages in Meituan.
tains element functions and properties such as location and shape.
Another dimension, the GUI page, includes page types (e.g., form 4.1 Evaluation Setup
page and list page) and categories (e.g., payment page, advertis- 4.1.1 Dataset. To the best of our knowledge, AUITestAgent is the
ing page). And the page state transitions dimension, integrating first automatic natural language-driven GUI testing tool for mobile
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou
apps. Since there is no publicly available dataset, we contract test- categorizing them into three types (as shown in Table 3). Based on
ing scenarios and corresponding requirements for RQ1-RQ2. As these categories, we manually constructed 20 test oracles. Specifi-
shown in Table 2, given the diversity of practical test requirements cally, we selected 20 interaction tasks from the 30 crafted for RQ1
depicted in Figure 1, we construct interaction requirements of vari- and manually created corresponding GUI anomaly manifestations.
ous complexities along with corresponding verification oracles of Consequently, as shown in Table 3, the data for RQ2 includes 20
various types. test oracles, with each oracle associated with two interaction traces,
For RQ1, we assessed the difficulty of interactions from two one correct and the other having conflict with a verification point.
perspectives: the ideal number of interaction steps (i.e., the number Therefore, there are 40 verification tasks for RQ2, half of which
of steps required for manual interaction, recorded as 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 ) and contain anomalies.
the vagueness of the requirements (i.e., metric 𝑆𝑐𝑜𝑟𝑒 𝑣𝑎𝑔 is defined For RQ3, we applied AUITestAgent to 18 test requirements from
as the 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 divided by the number of interaction commands the real business scenarios at Meituan. By presenting the bugs
in the requirements). As shown in Table 2, we use 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 and identified by AUITestAgent in real scenarios, we demonstrate its
𝑆𝑐𝑜𝑟𝑒𝑐𝑜𝑛 to describe the complexity level of interaction tasks. Based practical effectiveness.
on these two perspectives, we have categorized the interaction tasks
into three difficulty levels. We then created ten interaction tasks 4.1.2 Baselines. To demonstrate the advantages of AUITestAgent
for each difficulty level from eight popular apps i.e., Meituan, Little in converting natural language into GUI interaction, we compared it
Red Book, Douban, Facebook, Gmail, LinkedIn, Google Play and with two state-of-the-art natural language-driven GUI interaction
YouTube Music. tools. AppAgent [35] is the first to propose a natural language-
For RQ2, as the apps we selected are not open-source and bugs driven approach for mobile app interaction, featuring an innovative
are not common in the online versions of these mature commercial design that incorporates an exploratory learning phase followed
apps, we needed a robust method to effectively measure AUITestA- by interaction. This addresses the challenge of LLMs’ lack of fa-
gent’s performance in executing function verification. Two authors miliarity with apps by offering two learning modes: unsupervised
of this paper analyzed 50 recent historical anomalies at Meituan, exploration and human demonstration. For fairness, in RQ1, we
utilized the unsupervised exploration mode and set the maximum
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA
L1 L2 L3 Avg.
Difficulty level
TC CS CT SE TC CS CT SE TC CS CT SE TC CS CT SE
AUITestAgent 1.00 0.97 1.00 0.97 0.80 0.91 0.93 1.00 0.50 0.96 0.86 1.00 0.77 0.94 0.93 0.99
MobileAgent 0.70 0.68 0.80 0.90 0.60 0.58 0.78 0.85 0.20 0.54 0.64 1.00 0.50 0.60 0.74 0.90
AppAgent 0.60 0.61 0.95 0.77 0.30 0.43 0.62 1.00 0.30 0.45 0.55 0.81 0.40 0.49 0.70 0.84
4.1.3 Implementation and Configuration. For RQ1 and RQ2, to |𝐿𝐶𝑃 (Trace𝑖𝑥 , Trace𝑚𝑎
𝑖 )|
explore the upper limits of AUITestAgent’s capabilities, we selected 𝐶𝑇𝑖 = (5)
|Trace𝑚𝑎
𝑖 |
one of the state-of-the-art MLLM, GPT-4o [2], as the underlying
model for AUITestAgent and three baselines. As for AUITestAgent’s (
practical implementation within Meituan, we select another MLLM. |Trace_P𝑖𝑥 | / |Trace𝑖𝑥 | if 𝑇𝐶𝑖 == 1
𝑆𝐸𝑖 = (6)
Optimizing for efficiency, and given that all of these methods are 0 Otherwise
device-independent, we conducted our experiments on multiple
As shown in Table 4, AUITestAgent achieved the highest scores
Android devices via the adb tool. For fairness, in RQ1, we set the
across four types of metrics, with an overall task completion rate
maximum number of trial steps for all candidates to 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 + 5.
of 77%, significantly outperforming the two baseline methods. Al-
though AppAgent incorporates an exploratory process, its direct
4.2 RQ1: Quality of GUI Interactions use of the unprocessed UI hierarchy as input for LLMs results in
Initially, three authors manually executed each of the 30 interaction excessively long and noisy data due to the complex GUI interfaces
tasks. For each task, we recorded the manual interaction trace as of the popular apps selected for RQ1. Consequently, this approach
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou
leads to the lowest performance metrics for AppAgent across all Conversely, the baseline method reached an accuracy of 1.0 in
evaluated criteria. This also demonstrates the effectiveness of the the correct function verification task but only achieved an Ora-
multi-source GUI understanding implemented in AUITestAgent. cle Acc. of 0.35 in anomaly detection, indicating its difficulty in
Furthermore, as the difficulty of interaction tasks increases, four understanding questions in RQ3 and its tendency to output a ‘no
metrics for AUITestAgent and the baselines decline. Although anomaly’ judgment.
Mobile-Agent performs well at levels L1 and L2, it struggles with This data not only highlights the challenges LLMs face in han-
tasks that involve numerous interaction commands or require dling GUI functional testing but also validates the effectiveness of
concise execution. This reflects the significance of the design in multi-dimensional data extraction in AUITestAgent.
AUITestAgent ’s GUI interaction module.
Therefore, the conclusion of RQ1 is that AUITestAgent can effec-
tively convert GUI interaction tasks of different difficulty levels into
4.4 RQ3: Performance in Practical
corresponding GUI actions. Since the following function verifica- Implementations
tion requires that interaction tasks be completed first, AUITestAgent In RQ3, we discuss the practical implementation of AUITestAgent in
possesses the capability to carry out subsequent function verifica- real business scenario i.e., Video of Meituan’s app, to show its practi-
tion. cal effectiveness. Specifically, 10 test requirements with L1/L2 level
interaction commands are selected to cover various checkpoints
in short video playback and live show scenarios. These require-
ments have been integrated into the weekly regression automation
process for this business line in Meituan.
4.3 RQ2: Effectiveness of Function Verification In the video scenario within Meituan, in addition to support-
Since some test oracles in Meituan contain multiple verification ing normal video playback functions (pause and play, dragging
points, 8 of the 20 test oracles we designed for RQ2 include mul- the progress bar, and swiping up and down to switch videos), it
tiple verification points. Before conducting the experiments, we also supports clicking on merchant and product cards to view mer-
manually divided these test oracles into 32 verification points to chant information and purchase products. The GUI page in this
facilitate subsequent analysis. scenario is complex, with numerous checkpoints that need testing.
We evaluated the verification performance of AUITestAgent and In the past, each iteration relied on manually writing test scripts
the multi-turn dialogue-based baseline from three perspectives: test or performing manual interactions for verification. These 10 test
oracle, verification point, and reasoning in explanation. To ensure requirements contain 18 verification points in the video scenario,
fairness, we also tracked the number of tokens consumed by each saving 1.5 Person-Days (PD) in each round of regression testing.
method in the experiment. Specifically, we designed five metrics Since AUITestAgent’s launch, it has detected 4 new bugs from the
for RQ2. video scenario of Meituan’s app during 10 rounds of regression
Oracle Acc. represents the proportion of correctly judged tasks. A testing. Details of these bugs are shown in Table 6.
task is considered correctly judged by an oracle only if all included
verification points are accurately assessed and the corresponding
explanations are reasonable. Similarly, Point Acc. and Reasoning 5 THREATS TO VALIDITY
Acc. measure the accuracy of verification point judgments and One potential threat arises from the fact that AUITestAgent employs
the correctness of their explanations. Additionally, we present the a structure that decouples interaction from verification. Specifically,
Completion Tokens and Prompt Tokens for each task, which can in industry testing scenarios, a small number of requirements ne-
indicate the economic costs associated with each method. cessitate performing interactions during function verification (e.g.,
As shown in Table 5, AUITestAgent achieved an accuracy of 0.90 repeatedly clicking the next video on a video playback page until
in making correct judgments on test oracles across two types of a hotel-related live stream appears, then checking the details of
tasks. Due to the greater difficulty of anomaly detection compared the business to ensure the business name matches the live stream).
to correct function verification, AUITestAgent achieved an Oracle Currently, AUITestAgent lacks the capability to verify such require-
Acc. of 0.85 in anomaly detection, slightly lower than in correct ments. Given their rarity, this limitation does not have much impact
function verification. Considering that the 40 cases in RQ2 contain on AUITestAgent ’s applicability in industry settings.
64 verification points, of which 20 have anomalies, AUITestAgent Another threat stems from AUITestAgent’s reliance on general-
successfully identified 18 anomalous points, resulting in 2 false purpose LLMs. General-purpose large language models such as GPT-
positives. This corresponds to a recall rate of 90% and a false positive 4 are trained on extensive datasets [25]. However, these datasets
rate of 4.5%. lack domain knowledge crucial for GUI function testing, such as app
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA
function designs and GUI components. Consequently, we acknowl- and AutoDroid [29] leverage LLMs to process natural language de-
edge the current limitations of AUITestAgent’s capabilities. In the scriptions and GUI pages, translating natural language commands
future, we plan to construct a dataset specific to GUI functions and into GUI actions. Different from these tools, AUITestAgent can
enhance AUITestAgent’s proficiency in UI testing by employing a directly perform GUI testing on mobile apps. Moreover, our experi-
retrieval-augmented generation (RAG) approach [18]. ments demonstrate that the techniques employed by AUITestAgent,
including dynamically organizing agents, lead to the highest quality
6 RELATED WORK of GUI interaction generation.
VisionDroid [23] focuses on non-crash bug detection of GUI
Since the quality of the GUI is closely related to user experience,
pages, highlighting the absence of testing oracles and utilizing
various testing methods targeting bugs in GUI have been pro-
large language models (LLMs) to detect unexpected behaviors. In
posed [6, 17, 31]. Traditionally, developing GUI testing scripts has
contrast, AUITestAgent concentrates on the industry’s practical
been the main method for automating GUI function tests in indus-
needs. By emphasizing the challenges of implementing practical
try practice. Since these scripts are labor-intensive to develop and
testing requirements, we have developed an industry-applicable
are easily obsolete, several studies have been conducted on auto-
automatic natural language-driven GUI testing method.
matically generating or repairing GUI testing scripts. For instance,
AppFlow[12] utilizes machine learning techniques to automatically
7 CONCLUSION
identify screen components, enabling testers to write modular test-
ing libraries for essential functions of applications. CraftDroid[19] In this paper, we propose AUITestAgent, an automatic approach to
uses information retrieval to extract human knowledge from exist- perform natural language-driven GUI testing for mobile apps. In
ing test components of applications to test other programs. CoSer[9] order to extract GUI interactions from test requirements, AUITestA-
constructs UI state transition graphs by analyzing app source code gent utilizes dynamically organized agents and constructs multi-
and test scripts to repair obsolete scripts. Although AUITestAgent source input for them. Following this, a multi-dimensional data
is also a method for testing GUI functions, it differs from these stud- extraction strategy is employed to retrieve data relevant to the test
ies as it neither relies on nor generates hard-coded testing scripts, requirements from the interaction trace to perform verifications.
thereby offering better generalizability and maintainability. Our experiments on customized benchmarks show that AUITestA-
Due to their extensive scale of training data and robust logi- gent significantly outperforms existing methods in GUI interaction
cal reasoning abilities, LLMs are increasingly utilized in mobile generation and could recall 90% injected bugs with a 4.5% FPr. Fur-
app testing. Several methods are proposed using LLMs to generate thermore, unseen bugs detected from Meituan show the practical
GUI interactions or translating natural language commands into benefits of using AUITestAgent to conduct GUI testing for com-
GUI actions. QTypist [21] focuses on generating semantic textual plex commercial apps. These findings highlight the potential of
inputs for form pages to enhance exploration testing coverage. GPT- AUITestAgent to automate the GUI testing procedure in practical
Droid [22] extracts GUI page information and widget functionality mobile apps.
from the UI hierarchy file, using this data to generate human-like
interactions. AppAgent [35], Mobile-Agent [28], DroidBot-GPT [30]
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou
REFERENCES [21] Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and
[1] Accessed: 2024. Android Debug Bridge (adb). https://developer.android.com/ Qing Wang. 2023. Fill in the Blank: Context-aware Automated Text Input Gen-
tools/adb eration for Mobile GUI Testing. In 45th IEEE/ACM International Conference on
[2] Accessed: 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE,
[3] Accessed: 2024. Layouts in Views. https://developer.android.com/develop/ui/ 1355–1367.
views/layout/declaring-layout [22] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che,
[4] Accessed: 2024. Meituan-Dianping/vision-ui: Visual UI analysis tools. https: Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing
//github.com/Meituan-Dianping/vision-ui Human-like Interaction to Mobile GUI Testing via Functionality-aware Deci-
[5] Accessed: 2024. OpenAI Platform: JSON mode. https://platform.openai.com/docs/ sions. In Proceedings of the 46th IEEE/ACM International Conference on Software
guides/text-generation/json-mode Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 100:1–100:13.
[6] Accessed: 2024. UI/Application Exerciser Monkey. https://developer.android.com/ https://doi.org/10.1145/3597503.3639180
studio/test/other-testing-tools/monkey [23] Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun
[7] Accessed: 2024. WebView | Andriod Developers. https://developer.android.com/ Hu, and Qing Wang. 2024. Vision-driven Automated Mobile GUI Testing via
reference/android/webkit/WebView Multimodal Large Language Model. CoRR abs/2407.03037 (2024).
[8] Accessed: 2024. XML Path Language (XPath). https://www.w3.org/TR/1999/REC- [24] Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: multi-objective automated
xpath-19991116/ testing for Android applications. In Proceedings of the 25th International Sympo-
[9] Shaoheng Cao, Minxue Pan, Yu Pei, Wenhua Yang, Tian Zhang, Linzhang Wang, sium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July
and Xuandong Li. 2024. Comprehensive Semantic Repair of Obsolete GUI Test 18-20, 2016, Andreas Zeller and Abhik Roychoudhury (Eds.). ACM, 94–105.
Scripts for Mobile Applications. In Proceedings of the 46th IEEE/ACM International [25] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https:
Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. //doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
ACM, 90:1–90:13. https://doi.org/10.1145/3597503.3639108 [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
[10] Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Android Bug Replay with Large Language Models. In Proceedings of the 46th Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, From Natural Language Supervision. In Proceedings of the 38th International
Portugal, April 14-20, 2024. ACM, 67:1–67:13. https://doi.org/10.1145/3597503. Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Marina
3608137 Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
[11] Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, [27] Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang
Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI testing of Android Pu, Yang Liu, and Zhendong Su. 2017. Guided, stochastic model-based GUI testing
applications via model abstraction and refinement. In Proc. of the 41st International of Android apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of
Conference on Software Engineering, ICSE. IEEE / ACM, 269–280. Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017,
[12] Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman (Eds.).
to synthesize robust, reusable UI tests. In Proc. of the 2018 ACM Joint Meeting on ACM, 245–256.
European Software Engineering Conference and Symposium on the Foundations of [28] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei
Software Engineering, ESEC/FSE. ACM, 269–282. Huang, and Jitao Sang. 2024. Mobile-Agent: Autonomous Multi-Modal Mobile
[13] Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Device Agent with Visual Perception. CoRR abs/2401.16158 (2024). https:
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual //doi.org/10.48550/ARXIV.2401.16158 arXiv:2401.16158
Questions. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI [29] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li,
2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelli- Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering LLM
gence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial to use Smartphone for Intelligent Task Automation. CoRR abs/2308.15272 (2023).
Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, Michael J. [30] Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. DroidBot-GPT:
Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (Eds.). AAAI Press, 2256–2264. GPT-powered UI Automation for Android. CoRR abs/2304.07061 (2023).
https://doi.org/10.1609/AAAI.V38I3.27999 [31] Yiheng Xiong, Mengqian Xu, Ting Su, Jingling Sun, Jue Wang, He Wen, Geguang
[14] Yongxiang Hu, Jiazhen Gu, Shuqing Hu, Yu Zhang, Wenjie Tian, Shiyu Guo, Pu, Jifeng He, and Zhendong Su. 2023. An Empirical Study of Functional Bugs in
Chaoyi Chen, and Yangfan Zhou. 2023. Appaction: Automatic GUI Interaction for Android Apps. In Proceedings of the 32nd ACM SIGSOFT International Symposium
Mobile Apps via Holistic Widget Perception. In Proceedings of the 31st ACM Joint on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023,
European Software Engineering Conference and Symposium on the Foundations René Just and Gordon Fraser (Eds.). ACM, 1319–1331. https://doi.org/10.1145/
of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 3597926.3598138
2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 1786–1797. [32] Yi Xu, Hai Zhao, and Zhuosheng Zhang. 2021. Topic-Aware Multi-turn Dialogue
https://doi.org/10.1145/3611643.3613885 Modeling. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021,
[15] Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Visual Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI
Hallucinations of Multi-modal Large Language Models. CoRR abs/2402.14683 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence,
(2024). https://doi.org/10.48550/ARXIV.2402.14683 arXiv:2402.14683 EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 14176–14184.
[16] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, [33] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang,
Yejin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Jianwei Yang, Yiwu Zhong, Julian J. McAuley, Jianfeng Gao, Zicheng Liu, and
Natural Language Generation. ACM Comput. Surv. 55, 12 (2023), 248:1–248:38. Lijuan Wang. 2023. GPT-4V in Wonderland: Large Multimodal Models for Zero-
https://doi.org/10.1145/3571730 Shot Smartphone GUI Navigation. CoRR abs/2311.07562 (2023). https://doi.org/
[17] Wing Lam, Zhengkai Wu, Dengfeng Li, Wenyu Wang, Haibing Zheng, Hui Luo, 10.48550/ARXIV.2311.07562 arXiv:2311.07562
Peng Yan, Yuetang Deng, and Tao Xie. 2017. Record and replay for Android: are [34] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng
we there yet in industrial cases?. In Proceedings of the 2017 11th Joint Meeting Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in
on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, GPT-4V. CoRR abs/2310.11441 (2023). https://doi.org/10.48550/ARXIV.2310.11441
September 4-8, 2017, Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea arXiv:2310.11441
Zisman (Eds.). ACM, 854–859. https://doi.org/10.1145/3106237.3117769 [35] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang,
[18] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Users. CoRR abs/2312.13771 (2023). https://doi.org/10.48550/ARXIV.2312.13771
täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Gener- arXiv:2312.13771
ation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio
Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
[19] Jun-Wei Lin, Reyhaneh Jabbarvand, and Sam Malek. 2019. Test Transfer Across
Mobile Apps Through Semantic Mapping. In Proc. of the 34th IEEE/ACM Interna-
tional Conference on Automated Software Engineering, ASE. IEEE, 42–53.
[20] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,
Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models
Use Long Contexts. Trans. Assoc. Comput. Linguistics 12 (2024), 157–173. https:
//doi.org/10.1162/TACL_A_00638