0% found this document useful (0 votes)
11 views10 pages

2407.09018v1

Uploaded by

xingzaizhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

2407.09018v1

Uploaded by

xingzaizhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

AUITestAgent: Automatic Requirements Oriented GUI Function

Testing
Yongxiang Hu∗ Xuan Wang∗ Yingchuan Wang∗
School of Computer Science School of Computer Science School of Computer Science
Fudan University Fudan University Fudan University
Shanghai, China Shanghai, China Shanghai, China

Yu Zhang Shiyu Guo Chaoyi Chen


Meituan Meituan Meituan
Shanghai, China Shanghai, China Beijing, China
arXiv:2407.09018v1 [cs.SE] 12 Jul 2024

Xin Wang Yangfan Zhou


School of Computer Science School of Computer Science
Fudan University Fudan University
Shanghai Key Laboratory of Shanghai Key Laboratory of
Intelligent Information Processing Intelligent Information Processing
Shanghai, China Shanghai, China

ABSTRACT KEYWORDS
The Graphical User Interface (GUI) is how users interact with mo- Automatic Testing, Mobile Apps, Functional Bug, In-context Learn-
bile apps. To ensure it functions properly, testing engineers have to ing
make sure it functions as intended, based on test requirements that
are typically written in natural language. While widely adopted ACM Reference Format:
manual testing and script-based methods are effective, they demand Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi
substantial effort due to the vast number of GUI pages and rapid Chen, Xin Wang, and Yangfan Zhou. 2024. AUITestAgent: Automatic Re-
iterations in modern mobile apps. This paper introduces AUITestA- quirements Oriented GUI Function Testing. In Proceedings of ACM Confer-
gent, the first automatic, natural language-driven GUI testing tool ence (Conference’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.
for mobile apps, capable of fully automating the entire process of 1145/nnnnnnn.nnnnnnn
GUI interaction and function verification. Since test requirements
typically contain interaction commands and verification oracles.
AUITestAgent can extract GUI interactions from test requirements
1 INTRODUCTION
via dynamically organized agents. Then, AUITestAgent employs a
multi-dimensional data extraction strategy to retrieve data relevant Since GUI (Graphical User Interface) is the medium through which
to the test requirements from the interaction trace and perform users interact with mobile apps, developers need to verify that it
verification. Experiments on customized benchmarks 1 demonstrate functions as expected. The most direct approach to this verifica-
that AUITestAgent outperforms existing tools in the quality of gen- tion is to check the UI functionality according to requirements,
erated GUI interactions and achieved the accuracy of verifications of specifically by executing GUI interactions and checking GUI re-
94%. Moreover, field deployment in Meituan has shown AUITestA- sponses. Despite being effective, the vast number of GUI pages and
gent’s practical usability, with it detecting 4 new functional bugs the rapid pace of app iterations result in a large volume of test-
during 10 regression tests in two months. ing tasks. Therefore, to avoid expending a large amount of human
effort, automation is essential.
Since the test requirements are usually expressed in natural lan-
∗ This work was performed during the authors’ internship in Meituan. guage, we call this form of testing natural language-driven GUI
1 https://github.com/bz-lab/AUITestAgent
testing for mobile apps, which remains an underexplored field. The
most relevant work is natural language-driven GUI interaction[10,
28, 33]. They typically utilize large language models (LLMs) to an-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed alyze natural language commands and GUI pages, and ultimately
for profit or commercial advantage and that copies bear this notice and the full citation generate interaction commands. However, unlike app interaction
on the first page. Copyrights for components of this work owned by others than ACM tasks, which focus on the interaction result and are goal-oriented,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a GUI testing requires checking the correctness of GUI responses
fee. Request permissions from [email protected]. during the interaction process and is, therefore, step-oriented. Con-
Conference’17, July 2017, Washington, DC, USA sequently, the commonly used trial-and-error strategies in these
© 2024 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 interaction methods often result in inefficient interaction traces
https://doi.org/10.1145/nnnnnnn.nnnnnnn and insufficient success rate for GUI testing.
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

In industry practices, performing natural language-driven GUI


testing still largely relies on intensive manual efforts. For exam-
ple, such GUI testing in Meituan is mainly performed manually or
through GUI testing scripts. Specifically, testing engineers should
read and understand test requirements and manually select UI ele-
ments to interact and check by XPath, a language for pointing to
different parts from a UI hierarchy file [8]. Then, multiple rules are
designed to conduct GUI interactions and verifications. However,
mobile apps today typically have multiple business lines (e.g.,, hotel (a) Requirement with specific steps (b) Requirement in concise expression
booking, taxi booking, and food delivery), each involving a tremen-
dous number of GUI functions and rapid development iterations. Figure 1: Practical Testing Requirements
Consequently, conducting GUI function testing demands substan-
tial manual effort for test script development and maintenance. This
paper aims to reduce such manual efforts in natural language-driven
GUI testing.
• We demonstrate that the step-oriented characteristic of GUI func-
Despite GUI testing having more steps and therefore being more
tional testing is well-suited for LLMs-based automatic testing.
complex than interactions for testing engineers, we found that
• We propose AUITestAgent, the first automatic natural language-
the step-oriented characteristic of GUI testing tasks is advanta-
driven GUI testing tool for mobile apps.
geous. This structure allows us to utilize step descriptions to design
• We show that AUITestAgent is an effective tool via real-world
prompts, enabling us to achieve higher UI interaction success rates
cases on commercial apps that serve tremendous end-users. We
and verification accuracy, thereby attaining industry-level usability.
summarize the lessons learned, which can shed light on the
Although encouraging, there are still challenges posed by diverse
automation of GUI testing.
testing requirements and the complex, huge amount of GUI pages.
The key to reducing such human efforts lies in simplifying GUI
testing tasks to align with LLMs’ capabilities. In this regard, we 2 PRACTICES OF GUI FUNCTION TESTING
designed AUITestAgent, the first automatic natural language-driven The design of AUITestAgent is motivated by the practical testing
GUI testing tool for mobile apps. It takes GUI testing requirements status of Meituan’s mobile app. Meituan is one of the largest online
as input, performs tests on the specified app, and outputs the test shopping platforms over the world, with nearly 700 million trans-
results. acting users and about 10 million active merchants. Bugs on GUI
As for AUITestAgent’s implementation, three strategies are em- pages in such apps would inevitably deteriorate user experience.
ployed to simplify GUI testing tasks. Firstly, although GUI testing The Meituan app features a vast number of GUI pages and func-
contains GUI interaction and function verification, the verification tionalities. Despite numerous automated testing tools proposed in
results do not actually impact the interaction. Therefore, AUITestA- academia [6, 9, 11, 12, 24, 27], they typically focus on detecting non-
gent decouples interaction and verification into two separate mod- functional bugs (e.g., app crash) or require considerable manual
ules, performing verification after interaction. As for diverse inter- intervention. Consequently, these tools fail to reduce the human
action commands, we designed two agent organization patterns. effort required for GUI functional testing substantially.
For simple instructions, AUITestAgent allows the Executor to per- During our practice in Meituan, GUI function verification is
form directly, while complex instructions are broken down by the mainly performed manually or through hard-coded scripts. Specifi-
Planner first. As for GUI function verification, it requires processing cally, testing engineers need to collaborate with both requirements
numerous GUI screenshots, each containing tens of GUI elements. and development teams to define test requirements. As shown in
Since LLMs struggle to handle questions with many images attached Figure 1, these requirements are typically in natural language. After
accurately[13, 20], AUITestAgent extracts requirement-related in- the GUI development is completed, testing engineers verify these
formation from three dimensions: GUI elements, GUI pages, and test requirements manually on real devices. Additionally, regres-
interaction traces. Then, verification will be conducted based on sion tests are set for core functionalities for cost considerations.
this integrated information. Implementing GUI function regression testing involves developing
We analyze the design effectiveness of AUITestAgent with a set hard-coded GUI testing scripts. This requires testing engineers to
of experiments. We prove that AUITestAgent perform the best in select UI elements that need interaction or verification from the UI
GUI interaction compared with AppAgent [35] and MobileAgent [28], hierarchy [3] and perform GUI interaction and function verification
which are both well-known natural language-driven interaction via rule development.
tools. Our verification experiment shows that AUITestAgent can For commonly used commercial apps, there are usually many
recall 90% GUI functional bugs and provide reasonable explanations business lines, each with numerous functions, and each function
while maintaining a false positive rate of less than 5%. We also report involves multiple GUI interfaces. As a result, the workload for
our field experiences in applying AUITestAgent in Meituan. To date, GUI functional testing is substantial. Although GUI test scripts
AUITestAgent has been deployed across xx business lines and has can reduce some of the manual effort, they are prone to becoming
detected xx new functional bugs. We summarize the contributions outdated due to GUI changes brought by app iterations. Therefore,
of this paper as follows. extensive use of these scripts can lead to considerable maintenance
costs.
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA

Figure 2: Overview of AUITestAgent

3 TOOL DESIGN AND TESTING WORKFLOW problem [16] of LLMs due to cumulative effects, thereby reducing
the quality of generated UI actions.
3.1 Overview
To address these challenges, the GUI interaction module of
Figure 2 presents an overview of AUITestAgent, which comprises AUITestAgent employs two agent organization patterns,Specific
two main modules: GUI interaction and function verification. Specif- steps pattern and Concise expression pattern, as shown in Figure 3.
ically, AUITestAgent first analyzes the GUI test requirements to The GUI interaction module first uses LLMs to recognize the type of
extract interaction commands and oracles for function verification. input interaction commands. If the interaction commands consist
The GUI interaction module then dynamically selects the appro- of a series of specific steps, AUITestAgent will convert them into a
priate agent organization pattern based on the complexity of the list of single-step UI actions via the specific step pattern. Otherwise,
interaction commands. If the commands consist of specific steps, if the test requirement contains concise interaction commands, the
the interactions will be performed directly by the Executor. Oth- GUI interaction module will use the concise expression pattern to
erwise, if the commands are expressed concisely, a more complex handle them.
pattern involving a Planner and Monitor will be used. Finally, the
function verification module analyzes the interaction logs based 3.2.1 Specific steps pattern. Specific steps pattern executes each op-
on the test oracles. These logs contain information about the GUI eration in the operation list sequentially through the collaboration
pages encountered during the interactions, including screenshots, of three LLM agents: the Observer, the Selector, and the Executor.
LLM outputs, and the UI actions performed. AUITestAgent extracts The Observer agent is to identify all interactive UI elements
oracle-relevant data from the interaction logs and outputs the veri- in the current GUI page and infer their functions. Specifically, the
fication results and reason analysis. Observer processes the screenshot of the current GUI page using
Set-of-Mark[34], assigning numeric IDs and rectangular boxes to
all interactive elements. This approach highlights the interactive
3.2 Natural Language Driven GUI Interaction elements in the GUI screenshots, making it easier for LLMs to
As discussed in Section 2, the interaction commands in different recognize. Finally, a multi-modal large language model (MLLM) is
GUI test requirements vary significantly. As shown in Figure 1(a), used to infer the function of each numbered UI element.
target elements and corresponding UI actions are specified in ex- Although MLLMs possess image analysis capabilities, they do
plicit commands. In contrast, concise commands such as Figure 1(b) not perform well in observing GUI images, especially in recognizing
may abstract multiple UI actions into a single interaction command, UI elements and analyzing their functions due to the complexity of
which is more challenging to perform. Such differences require GUIs[14]. Besides, MLLMs are prone to Optical Character Recog-
AUITestAgent to dynamically organize suitable agents to collabo- nition (OCR) hallucinations when recognizing text in images[15],
rate, ensuring the interaction commands are executed accurately. while text is crucial for understanding the functionality of many el-
Specifically, concise interaction commands require the agent to ements within the GUI. To tackle these challenges, we enhance the
accurately interpret the user’s intent behind the commands, assess GUI analysis capability of the Observer agent through multi-source
the current state from GUI screenshots and interaction records, and input and knowledge base augmented generation.
determine the appropriate UI actions to take. Therefore, performing Specifically, we select GUI screenshots and UI hierarchy file [3]
concise commands demands interaction memory and reasoning as the Observer’s input. For the UI hierarchy file, Observer first
capabilities, which are difficult to achieve through a single agent parses the UI hierarchy file to identify all the interactive nodes(i.e.,
due to the context length limitations. In contrast, while detailed whose clickable, enabled, scrollable, or long-clickable attribute is
interaction commands are easier to execute, using multiple agents set to true). Related attributes of these nodes are also collected,
increases unnecessary costs. It may also amplify the hallucination including coordinate information (from ‘bounds’ attribute), textual
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

Figure 3: Workflow of GUI Interaction Module

information (from ‘text’ and ‘content-desc’ attribute), and func- • click(target: int): This function is used to click an element listed
tional information (e.g., nodes with a “EditText” class are likely to by Observer. The parameter target is the element’s ID.
be text input fields). For GUI screenshots, Observer uses the vision- • longPress(target: int): This function is similar to click, while its
ui model from Meituan [4] to identify all elements in the GUI and function is to long press a particular element.
perform OCR to recognize the text visible to users. The recognition • type(target: int, text: str): this function is used to type the string
results serve as a supplement to the UI hierarchy, especially in ‘text’ into an input field. It is worth noting that the target param-
cases where obtaining the UI hierarchy file fails or when the page eter is the ID of the input field. The function will click on this
contains WebView [7] components. element before typing to ensure the input field is activated.
In order to make it easier for LLMs to describe the functions • scroll(target: int, dir: str, dist: str): This function is called to scroll
of elements, the Observer then marks the numerical ID and rect- on the element target in a direction ‘dir’(up, down, left, right)
angular boxes of each element on the GUI image based on their and distance ‘dist’(short, medium, long).
coordinates. Finally, a system prompt that contains the processed Finally, the Executor agent performs the UI actions generated by
screenshot and corresponding elements’ data, including element Selector on the app under testing (AUT). For actions targeting spe-
IDs, text information, and other property information, is assembled. cific elements, the Executor first consults the coordinates provided
This prompt is then used as input to the MLLM to infer the function in the Observer’s results, converting the ID of the target element
of UI elements. into specific page coordinates. Then Executor performs the actual
Given that the LLM is not well-acquainted with the design of interaction action through the Android Debug Bridge (adb) tool[1].
mobile apps, particularly those in Chinese, we have observed that
the Observer agent occasionally mispredicts the functions of UI 3.2.2 Concise expression pattern. Concise expression pattern is
elements, especially those UI icons that lack textual information. also achieved through the collaboration of multiple agents. As
To tackle this problem, a UI element function knowledge base is shown in Figure 3, it adds two agents, Planner and Monitor, on top
formed to enhance element recognition. For instance, we manu- of the specific step pattern to handle concise commands.
ally collect and label some elements that are difficult to recognize, The Planner agent formulates a plan based on the concise in-
primarily icons that do not contain text. Each element in the knowl- teraction commands, where each step in the plan is a specific in-
edge base contains its screenshot, appearance descriptions, and teraction command aimed at progressing the completion of the
function. When processing the GUI, the Observer uses CLIP [26] interaction commands. Then, for each step in the plan, the specific
to match elements from the knowledge base with elements on the step pattern is used for GUI action execution.
current page and then includes the matched elements’ appearance Another agent, the Monitor, is involved because, unlike the
descriptions and functions in the system prompt for MLLM. This specific step pattern that requires only the sequential execution of
helps it infer the functions of these error-prone elements. each operation in the list, the concise expression pattern necessi-
The Selector agent is set to select the operation target from tates a role to determine whether the user’s commands have been
the elements listed by Observer based on the natural language completed (i.e., whether the user’s intent has been achieved).
description of a single-step command. It generates the necessary Specifically, the Monitor determines whether the interaction
parameters for the GUI action and then calls the corresponding commands have been completed based on the current GUI screen-
function to perform. Specifically, we have defined the following UI shot and the record of executed operations. If the commands are
actions for Selector: completed, the GUI interaction module outputs the interaction log
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA

Figure 4: Workflow of Function Verification Module

and stops. Otherwise, the Monitor describes the current GUI state Table 1: Categorization of interaction log data.
and the operations it believes are necessary to complete the rest Type Description Example
interactions. Its output serves as feedback for the Planner, and in UI element Element functions and properties A submit button at the bottom-right corner.
the next iteration, the Planner will make a new plan. GUI page Page structures and categories A list-structured advertisement page.
Page state A payment page can be reached from an order-
integrate the first two aspects.
transitions placement page by clicking the submit button.
3.3 Function Verification
To our knowledge, AUITestAgent is the first work to investigate
GUI function verification. Since AUITestAgent decouples GUI in- the first two aspects (e.g., a payment page can be reached from an
teraction from verification, its verification is conducted based on order-placement page by clicking Confirm button).
interaction log analysis. Consequently, we formulate function veri- AUITestAgent leverages multiple agents to extract information
fication as a semantic analysis problem of GUI pages. relevant to the current verification point from these dimensions and
Although MLLMs possess image analysis capabilities, we still records it as text. Considering the analysis of GUI element functions
need to simplify GUI function verification tasks to align with the and page types has been performed in Section 3.2, AUITestAgent
capabilities of LLMs, to achieve stable performance. Specifically, will reuse such content to enhance processing speed.
there are three challenges in GUI function verification. First, the
diversity of test oracles makes them difficult to process. Second, 3.3.3 Function Verification. After requirement-related data are col-
the sheer volume of data in interaction logs presents a challenge, lected, AUITestAgent will integrate the verification points with
as LLMs struggle to process excessively lengthy inputs [13], which corresponding data extracted from the logs. Finally, the judgment
may even exceed their maximum input capacity. Lastly, according will be executed using MLLM. Furthermore, we employ the JSON
to our experience, even the most capable MLLMs are not adept at mode [5] to standardize the output, ensuring a structured delivery
semantic analysis of multiple images, especially complex images of judgments and explanations.
like GUI screenshots.
4 EVALUATION
3.3.1 Simplifying Test Oracles. To tackle challenges posed by test
oracles’ complexity, AUITestAgent will convert them to a list of In this section, we evaluate the performance of AUITestAgent by
simple questions, termed verification points in this paper. According answering the following research questions.
to our experiences in Meituan, test requirements often encompass • RQ1: How do the quality and efficiency of AUITestAgent in
multiple verification points, with each relating to different pages generating GUI interactions compare to the existing approaches?
and UI components within the interaction log. As shown in Fig- • RQ2: How do the effectivness of AUITestAgent in perform GUI
ure 4, AUITestAgent initially breaks down the test oracles extracted function verification?
from the test requirements into individual verification points. Each • RQ3: To what extent can AUITestAgent perform natural language-
verification point is tailored to address a single question, thereby driven testing in practical commercial mobile apps?
streamlining the handling of test oracles. RQ1 assesses the effectiveness of AUITestAgent’s GUI interac-
3.3.2 Multi-dimensional Data Extraction. As for the challenges tion model, examining its accuracy and efficiency in converting
posed by the scale of log data and semantic analysis of images, natural language commands into GUI actions. In RQ2, we construct
AUITestAgent approaches information extraction from three per- a dataset containing 20 bugs, simulating real-world scenarios to
spectives, transforming it into textual information. Specifically, evaluate the effectiveness of the GUI function verification module.
data contained in interaction logs can be categorized into three In RQ3, we validate the practical applicability of AUITestAgent by
dimensions, as shown in Table 1. The GUI element dimension con- examining its performance across GUI pages in Meituan.
tains element functions and properties such as location and shape.
Another dimension, the GUI page, includes page types (e.g., form 4.1 Evaluation Setup
page and list page) and categories (e.g., payment page, advertis- 4.1.1 Dataset. To the best of our knowledge, AUITestAgent is the
ing page). And the page state transitions dimension, integrating first automatic natural language-driven GUI testing tool for mobile
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

Table 2: Difficulty levels of interaction tasks.

Difficulty Levels Definition Examples Step𝑖𝑑𝑒𝑎𝑙 Score𝑣𝑎𝑔


Click the Like button on the first post. 1 1
Click the Top Charts tab, then click to view the
L1 Step𝑖𝑑𝑒𝑎𝑙 + Score𝑣𝑎𝑔 ≤ 3 2 1
details of "Honor of Kings".
Click the "Explore" tab, then click the "New
2 1
releases".
Click the "Movie Showings" button, click on the
first movie under "Now Playing," and then click 3 1
the "Want to Watch" button.
L2 3 < Step𝑖𝑑𝑒𝑎𝑙 + Score𝑣𝑎𝑔 ≤ 7 Click the "Takeout" button, click on the "Food"
section, and then click to enter the first store in 3 1
the store list.
View the detail page of the first notification in the
3 3
notifications list.
Search for "Hello World" and then play any song
4 4
from the search list.
Click the "Jobs" tab, search for QA Engineer in the
search bar, click to view the details of any job from 6 1.5
L3 Step𝑖𝑑𝑒𝑎𝑙 + Score𝑣𝑎𝑔 > 7 the search results, and then click Save.
Send an email to {address1} and {address2} with
8 2.66
the subject "Test" and the body "Hello".

Table 3: The categories of test oracles.


Category Definition Examples
No change after clicking order.
No Response App does not react when it should have.
Stays on message edit page after sending.
Likes decreased by 1 after liking.
Unexpected Result App’s response violates the original functional logic.
Review page opened after clicking install.
Ratings for the same seller are inconsistent across pages.
Data Inconsistency Self-conflicts of data presented in the GUI.
Price does not equal original minus discount.

apps. Since there is no publicly available dataset, we contract test- categorizing them into three types (as shown in Table 3). Based on
ing scenarios and corresponding requirements for RQ1-RQ2. As these categories, we manually constructed 20 test oracles. Specifi-
shown in Table 2, given the diversity of practical test requirements cally, we selected 20 interaction tasks from the 30 crafted for RQ1
depicted in Figure 1, we construct interaction requirements of vari- and manually created corresponding GUI anomaly manifestations.
ous complexities along with corresponding verification oracles of Consequently, as shown in Table 3, the data for RQ2 includes 20
various types. test oracles, with each oracle associated with two interaction traces,
For RQ1, we assessed the difficulty of interactions from two one correct and the other having conflict with a verification point.
perspectives: the ideal number of interaction steps (i.e., the number Therefore, there are 40 verification tasks for RQ2, half of which
of steps required for manual interaction, recorded as 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 ) and contain anomalies.
the vagueness of the requirements (i.e., metric 𝑆𝑐𝑜𝑟𝑒 𝑣𝑎𝑔 is defined For RQ3, we applied AUITestAgent to 18 test requirements from
as the 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 divided by the number of interaction commands the real business scenarios at Meituan. By presenting the bugs
in the requirements). As shown in Table 2, we use 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 and identified by AUITestAgent in real scenarios, we demonstrate its
𝑆𝑐𝑜𝑟𝑒𝑐𝑜𝑛 to describe the complexity level of interaction tasks. Based practical effectiveness.
on these two perspectives, we have categorized the interaction tasks
into three difficulty levels. We then created ten interaction tasks 4.1.2 Baselines. To demonstrate the advantages of AUITestAgent
for each difficulty level from eight popular apps i.e., Meituan, Little in converting natural language into GUI interaction, we compared it
Red Book, Douban, Facebook, Gmail, LinkedIn, Google Play and with two state-of-the-art natural language-driven GUI interaction
YouTube Music. tools. AppAgent [35] is the first to propose a natural language-
For RQ2, as the apps we selected are not open-source and bugs driven approach for mobile app interaction, featuring an innovative
are not common in the online versions of these mature commercial design that incorporates an exploratory learning phase followed
apps, we needed a robust method to effectively measure AUITestA- by interaction. This addresses the challenge of LLMs’ lack of fa-
gent’s performance in executing function verification. Two authors miliarity with apps by offering two learning modes: unsupervised
of this paper analyzed 50 recent historical anomalies at Meituan, exploration and human demonstration. For fairness, in RQ1, we
utilized the unsupervised exploration mode and set the maximum
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA

Table 4: Quality comparison of converted GUI interactions.

L1 L2 L3 Avg.
Difficulty level
TC CS CT SE TC CS CT SE TC CS CT SE TC CS CT SE
AUITestAgent 1.00 0.97 1.00 0.97 0.80 0.91 0.93 1.00 0.50 0.96 0.86 1.00 0.77 0.94 0.93 0.99
MobileAgent 0.70 0.68 0.80 0.90 0.60 0.58 0.78 0.85 0.20 0.54 0.64 1.00 0.50 0.60 0.74 0.90
AppAgent 0.60 0.61 0.95 0.77 0.30 0.43 0.62 1.00 0.30 0.45 0.55 0.81 0.40 0.49 0.70 0.84

Trace𝑖ma , which serves as the ground truth for interaction task 𝑖.


Trace𝑖ma = {action1, . . . , action𝑘 } (1)
The ideal number of interaction steps, Stepideal , is defined as the
length of Trace𝑖ma . For the action sequence generated by tool 𝑥, we
denote it as Trace𝑖𝑥 .
Trace𝑖𝑥 = {action1, . . . , action𝑚 } (2)
Due to the trial-and-error interaction strategy of the baseline inter-
action tool, we incorporated a parser in RQ1 to filter out redundant
actions (i.e., revert and repeat the last action) and meaningless in-
teractions (e.g., do not click on a UI element) in Trace𝑖𝑥 . The filtered
result is denoted as Trace_P𝑖𝑥 .
Four metrics are used to evaluate the interaction quality of
AUITestAgent and baselines.
• Task Completion: Abbreviated as TC, indicates whether interac-
tion tasks have been successfully executed.
• Correct Step (CS): Represents the correct actions in Trace𝑖𝑥 .
Figure 5: Prompt of RQ2 Baseline • Correct Trace (CT): Represents the longest common prefix be-
tween Trace𝑖𝑥 and the ground truth i.e., Trace𝑖ma . It can indicate
the degree of completion of the interaction task 𝑖.
number of exploration steps to 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 + 5. Different from Ap- • Step Efficiency (SE) is used to measure the interaction efficiency
pAgent, MobileAgent [28] employs a visual perception module for of the action traces generated by AUITestAgent and baselines
operation localization based solely on screenshots without needing when task completion occurs. It is represented by the ratio of
underlying files. It utilizes LLM agents for holistic task planning Trace_P𝑖𝑥 to Trace𝑖𝑥 .
and incorporates a self-reflection method to correct errors and halt The formal representations of these metrics are as follows:
once tasks are completed. (
1 if Trace_P𝑖𝑥 == Trace𝑖ma
Since AUITestAgent is the first to focus on natural language- 𝑇𝐶𝑖 = (3)
driven GUI function verification and there are no existing studies in 0 Otherwise
this field, we constructed a verification method based on multi-turn
dialogue [32] as a baseline. Specifically, We constructed a prompt |Trace𝑖𝑥 ∩ Trace𝑚𝑎
𝑖 |
𝐶𝑆𝑖 = (4)
of a two-turn dialogue with MLLM, as shown in figure 5. |Trace𝑖𝑥 |

4.1.3 Implementation and Configuration. For RQ1 and RQ2, to |𝐿𝐶𝑃 (Trace𝑖𝑥 , Trace𝑚𝑎
𝑖 )|
explore the upper limits of AUITestAgent’s capabilities, we selected 𝐶𝑇𝑖 = (5)
|Trace𝑚𝑎
𝑖 |
one of the state-of-the-art MLLM, GPT-4o [2], as the underlying
model for AUITestAgent and three baselines. As for AUITestAgent’s (
practical implementation within Meituan, we select another MLLM. |Trace_P𝑖𝑥 | / |Trace𝑖𝑥 | if 𝑇𝐶𝑖 == 1
𝑆𝐸𝑖 = (6)
Optimizing for efficiency, and given that all of these methods are 0 Otherwise
device-independent, we conducted our experiments on multiple
As shown in Table 4, AUITestAgent achieved the highest scores
Android devices via the adb tool. For fairness, in RQ1, we set the
across four types of metrics, with an overall task completion rate
maximum number of trial steps for all candidates to 𝑆𝑡𝑒𝑝𝑖𝑑𝑒𝑎𝑙 + 5.
of 77%, significantly outperforming the two baseline methods. Al-
though AppAgent incorporates an exploratory process, its direct
4.2 RQ1: Quality of GUI Interactions use of the unprocessed UI hierarchy as input for LLMs results in
Initially, three authors manually executed each of the 30 interaction excessively long and noisy data due to the complex GUI interfaces
tasks. For each task, we recorded the manual interaction trace as of the popular apps selected for RQ1. Consequently, this approach
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

Table 5: Effectiveness comparison of function verification.

Correct Function Verification Anomaly Detection


Task Type
Oracle Point Reason Completion Prompt Oracle Point Reason Completion Prompt
Acc. Acc. Acc. Tokens Tokens Acc. Acc. Acc. Tokens Tokens
AUITestAgent 0.95 0.97 0.91 1345 34987 0.85 0.91 0.84 1288 33715
GPT-4o 0.90 0.95 0.95 123 5695 0.30 0.53 0.55 129 5695

leads to the lowest performance metrics for AppAgent across all Conversely, the baseline method reached an accuracy of 1.0 in
evaluated criteria. This also demonstrates the effectiveness of the the correct function verification task but only achieved an Ora-
multi-source GUI understanding implemented in AUITestAgent. cle Acc. of 0.35 in anomaly detection, indicating its difficulty in
Furthermore, as the difficulty of interaction tasks increases, four understanding questions in RQ3 and its tendency to output a ‘no
metrics for AUITestAgent and the baselines decline. Although anomaly’ judgment.
Mobile-Agent performs well at levels L1 and L2, it struggles with This data not only highlights the challenges LLMs face in han-
tasks that involve numerous interaction commands or require dling GUI functional testing but also validates the effectiveness of
concise execution. This reflects the significance of the design in multi-dimensional data extraction in AUITestAgent.
AUITestAgent ’s GUI interaction module.
Therefore, the conclusion of RQ1 is that AUITestAgent can effec-
tively convert GUI interaction tasks of different difficulty levels into
4.4 RQ3: Performance in Practical
corresponding GUI actions. Since the following function verifica- Implementations
tion requires that interaction tasks be completed first, AUITestAgent In RQ3, we discuss the practical implementation of AUITestAgent in
possesses the capability to carry out subsequent function verifica- real business scenario i.e., Video of Meituan’s app, to show its practi-
tion. cal effectiveness. Specifically, 10 test requirements with L1/L2 level
interaction commands are selected to cover various checkpoints
in short video playback and live show scenarios. These require-
ments have been integrated into the weekly regression automation
process for this business line in Meituan.
4.3 RQ2: Effectiveness of Function Verification In the video scenario within Meituan, in addition to support-
Since some test oracles in Meituan contain multiple verification ing normal video playback functions (pause and play, dragging
points, 8 of the 20 test oracles we designed for RQ2 include mul- the progress bar, and swiping up and down to switch videos), it
tiple verification points. Before conducting the experiments, we also supports clicking on merchant and product cards to view mer-
manually divided these test oracles into 32 verification points to chant information and purchase products. The GUI page in this
facilitate subsequent analysis. scenario is complex, with numerous checkpoints that need testing.
We evaluated the verification performance of AUITestAgent and In the past, each iteration relied on manually writing test scripts
the multi-turn dialogue-based baseline from three perspectives: test or performing manual interactions for verification. These 10 test
oracle, verification point, and reasoning in explanation. To ensure requirements contain 18 verification points in the video scenario,
fairness, we also tracked the number of tokens consumed by each saving 1.5 Person-Days (PD) in each round of regression testing.
method in the experiment. Specifically, we designed five metrics Since AUITestAgent’s launch, it has detected 4 new bugs from the
for RQ2. video scenario of Meituan’s app during 10 rounds of regression
Oracle Acc. represents the proportion of correctly judged tasks. A testing. Details of these bugs are shown in Table 6.
task is considered correctly judged by an oracle only if all included
verification points are accurately assessed and the corresponding
explanations are reasonable. Similarly, Point Acc. and Reasoning 5 THREATS TO VALIDITY
Acc. measure the accuracy of verification point judgments and One potential threat arises from the fact that AUITestAgent employs
the correctness of their explanations. Additionally, we present the a structure that decouples interaction from verification. Specifically,
Completion Tokens and Prompt Tokens for each task, which can in industry testing scenarios, a small number of requirements ne-
indicate the economic costs associated with each method. cessitate performing interactions during function verification (e.g.,
As shown in Table 5, AUITestAgent achieved an accuracy of 0.90 repeatedly clicking the next video on a video playback page until
in making correct judgments on test oracles across two types of a hotel-related live stream appears, then checking the details of
tasks. Due to the greater difficulty of anomaly detection compared the business to ensure the business name matches the live stream).
to correct function verification, AUITestAgent achieved an Oracle Currently, AUITestAgent lacks the capability to verify such require-
Acc. of 0.85 in anomaly detection, slightly lower than in correct ments. Given their rarity, this limitation does not have much impact
function verification. Considering that the 40 cases in RQ2 contain on AUITestAgent ’s applicability in industry settings.
64 verification points, of which 20 have anomalies, AUITestAgent Another threat stems from AUITestAgent’s reliance on general-
successfully identified 18 anomalous points, resulting in 2 false purpose LLMs. General-purpose large language models such as GPT-
positives. This corresponds to a recall rate of 90% and a false positive 4 are trained on extensive datasets [25]. However, these datasets
rate of 4.5%. lack domain knowledge crucial for GUI function testing, such as app
AUITestAgent: Automatic Requirements Oriented GUI Function Testing Conference’17, July 2017, Washington, DC, USA

Table 6: RQ3-bug Information.

Test requirement Bug Image Test Result


“reason”: “Based on the user’s navi-
gation and interaction records, the
test task was not successfully ful-
Swipe up once, tap on the filled. The records do not provide
shop information card at clear evidence that the store de-
the bottom of the screen tails half-popup layer was displayed
(the card contains infor- or dismissed. Specifically, after the
mation such as the shop user clicked on the store informa-
name and rating), then tion card (Action 1), there is no in-
swipe down once. Check dication on Page 2 that the store de-
if the shop details half- tails half-popup layer was shown.
screen modal appears af- Additionally, after the user swiped
ter tapping the shop infor- down (Action 2), Page 3 does not
mation card in step two. indicate that the store details half-
Verify that the shop details popup layer was dismissed. Both
half-screen modal disap- Page 0 and Page 3 explicitly state
pears after swiping down that the store details half-popup
in step three. layer is ‘invisible’. Therefore, the re-
quired checks for the popup layer’s
visibility and dismissal were not
confirmed.”

function designs and GUI components. Consequently, we acknowl- and AutoDroid [29] leverage LLMs to process natural language de-
edge the current limitations of AUITestAgent’s capabilities. In the scriptions and GUI pages, translating natural language commands
future, we plan to construct a dataset specific to GUI functions and into GUI actions. Different from these tools, AUITestAgent can
enhance AUITestAgent’s proficiency in UI testing by employing a directly perform GUI testing on mobile apps. Moreover, our experi-
retrieval-augmented generation (RAG) approach [18]. ments demonstrate that the techniques employed by AUITestAgent,
including dynamically organizing agents, lead to the highest quality
6 RELATED WORK of GUI interaction generation.
VisionDroid [23] focuses on non-crash bug detection of GUI
Since the quality of the GUI is closely related to user experience,
pages, highlighting the absence of testing oracles and utilizing
various testing methods targeting bugs in GUI have been pro-
large language models (LLMs) to detect unexpected behaviors. In
posed [6, 17, 31]. Traditionally, developing GUI testing scripts has
contrast, AUITestAgent concentrates on the industry’s practical
been the main method for automating GUI function tests in indus-
needs. By emphasizing the challenges of implementing practical
try practice. Since these scripts are labor-intensive to develop and
testing requirements, we have developed an industry-applicable
are easily obsolete, several studies have been conducted on auto-
automatic natural language-driven GUI testing method.
matically generating or repairing GUI testing scripts. For instance,
AppFlow[12] utilizes machine learning techniques to automatically
7 CONCLUSION
identify screen components, enabling testers to write modular test-
ing libraries for essential functions of applications. CraftDroid[19] In this paper, we propose AUITestAgent, an automatic approach to
uses information retrieval to extract human knowledge from exist- perform natural language-driven GUI testing for mobile apps. In
ing test components of applications to test other programs. CoSer[9] order to extract GUI interactions from test requirements, AUITestA-
constructs UI state transition graphs by analyzing app source code gent utilizes dynamically organized agents and constructs multi-
and test scripts to repair obsolete scripts. Although AUITestAgent source input for them. Following this, a multi-dimensional data
is also a method for testing GUI functions, it differs from these stud- extraction strategy is employed to retrieve data relevant to the test
ies as it neither relies on nor generates hard-coded testing scripts, requirements from the interaction trace to perform verifications.
thereby offering better generalizability and maintainability. Our experiments on customized benchmarks show that AUITestA-
Due to their extensive scale of training data and robust logi- gent significantly outperforms existing methods in GUI interaction
cal reasoning abilities, LLMs are increasingly utilized in mobile generation and could recall 90% injected bugs with a 4.5% FPr. Fur-
app testing. Several methods are proposed using LLMs to generate thermore, unseen bugs detected from Meituan show the practical
GUI interactions or translating natural language commands into benefits of using AUITestAgent to conduct GUI testing for com-
GUI actions. QTypist [21] focuses on generating semantic textual plex commercial apps. These findings highlight the potential of
inputs for form pages to enhance exploration testing coverage. GPT- AUITestAgent to automate the GUI testing procedure in practical
Droid [22] extracts GUI page information and widget functionality mobile apps.
from the UI hierarchy file, using this data to generate human-like
interactions. AppAgent [35], Mobile-Agent [28], DroidBot-GPT [30]
Conference’17, July 2017, Washington, DC, USA Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

REFERENCES [21] Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and
[1] Accessed: 2024. Android Debug Bridge (adb). https://developer.android.com/ Qing Wang. 2023. Fill in the Blank: Context-aware Automated Text Input Gen-
tools/adb eration for Mobile GUI Testing. In 45th IEEE/ACM International Conference on
[2] Accessed: 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE,
[3] Accessed: 2024. Layouts in Views. https://developer.android.com/develop/ui/ 1355–1367.
views/layout/declaring-layout [22] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che,
[4] Accessed: 2024. Meituan-Dianping/vision-ui: Visual UI analysis tools. https: Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing
//github.com/Meituan-Dianping/vision-ui Human-like Interaction to Mobile GUI Testing via Functionality-aware Deci-
[5] Accessed: 2024. OpenAI Platform: JSON mode. https://platform.openai.com/docs/ sions. In Proceedings of the 46th IEEE/ACM International Conference on Software
guides/text-generation/json-mode Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 100:1–100:13.
[6] Accessed: 2024. UI/Application Exerciser Monkey. https://developer.android.com/ https://doi.org/10.1145/3597503.3639180
studio/test/other-testing-tools/monkey [23] Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun
[7] Accessed: 2024. WebView | Andriod Developers. https://developer.android.com/ Hu, and Qing Wang. 2024. Vision-driven Automated Mobile GUI Testing via
reference/android/webkit/WebView Multimodal Large Language Model. CoRR abs/2407.03037 (2024).
[8] Accessed: 2024. XML Path Language (XPath). https://www.w3.org/TR/1999/REC- [24] Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: multi-objective automated
xpath-19991116/ testing for Android applications. In Proceedings of the 25th International Sympo-
[9] Shaoheng Cao, Minxue Pan, Yu Pei, Wenhua Yang, Tian Zhang, Linzhang Wang, sium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July
and Xuandong Li. 2024. Comprehensive Semantic Repair of Obsolete GUI Test 18-20, 2016, Andreas Zeller and Abhik Roychoudhury (Eds.). ACM, 94–105.
Scripts for Mobile Applications. In Proceedings of the 46th IEEE/ACM International [25] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https:
Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. //doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
ACM, 90:1–90:13. https://doi.org/10.1145/3597503.3639108 [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
[10] Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Android Bug Replay with Large Language Models. In Proceedings of the 46th Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, From Natural Language Supervision. In Proceedings of the 38th International
Portugal, April 14-20, 2024. ACM, 67:1–67:13. https://doi.org/10.1145/3597503. Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Marina
3608137 Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
[11] Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, [27] Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang
Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI testing of Android Pu, Yang Liu, and Zhendong Su. 2017. Guided, stochastic model-based GUI testing
applications via model abstraction and refinement. In Proc. of the 41st International of Android apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of
Conference on Software Engineering, ICSE. IEEE / ACM, 269–280. Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017,
[12] Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman (Eds.).
to synthesize robust, reusable UI tests. In Proc. of the 2018 ACM Joint Meeting on ACM, 245–256.
European Software Engineering Conference and Symposium on the Foundations of [28] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei
Software Engineering, ESEC/FSE. ACM, 269–282. Huang, and Jitao Sang. 2024. Mobile-Agent: Autonomous Multi-Modal Mobile
[13] Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Device Agent with Visual Perception. CoRR abs/2401.16158 (2024). https:
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual //doi.org/10.48550/ARXIV.2401.16158 arXiv:2401.16158
Questions. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI [29] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li,
2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelli- Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering LLM
gence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial to use Smartphone for Intelligent Task Automation. CoRR abs/2308.15272 (2023).
Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, Michael J. [30] Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. DroidBot-GPT:
Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (Eds.). AAAI Press, 2256–2264. GPT-powered UI Automation for Android. CoRR abs/2304.07061 (2023).
https://doi.org/10.1609/AAAI.V38I3.27999 [31] Yiheng Xiong, Mengqian Xu, Ting Su, Jingling Sun, Jue Wang, He Wen, Geguang
[14] Yongxiang Hu, Jiazhen Gu, Shuqing Hu, Yu Zhang, Wenjie Tian, Shiyu Guo, Pu, Jifeng He, and Zhendong Su. 2023. An Empirical Study of Functional Bugs in
Chaoyi Chen, and Yangfan Zhou. 2023. Appaction: Automatic GUI Interaction for Android Apps. In Proceedings of the 32nd ACM SIGSOFT International Symposium
Mobile Apps via Holistic Widget Perception. In Proceedings of the 31st ACM Joint on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023,
European Software Engineering Conference and Symposium on the Foundations René Just and Gordon Fraser (Eds.). ACM, 1319–1331. https://doi.org/10.1145/
of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 3597926.3598138
2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 1786–1797. [32] Yi Xu, Hai Zhao, and Zhuosheng Zhang. 2021. Topic-Aware Multi-turn Dialogue
https://doi.org/10.1145/3611643.3613885 Modeling. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021,
[15] Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Visual Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI
Hallucinations of Multi-modal Large Language Models. CoRR abs/2402.14683 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence,
(2024). https://doi.org/10.48550/ARXIV.2402.14683 arXiv:2402.14683 EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 14176–14184.
[16] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, [33] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang,
Yejin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Jianwei Yang, Yiwu Zhong, Julian J. McAuley, Jianfeng Gao, Zicheng Liu, and
Natural Language Generation. ACM Comput. Surv. 55, 12 (2023), 248:1–248:38. Lijuan Wang. 2023. GPT-4V in Wonderland: Large Multimodal Models for Zero-
https://doi.org/10.1145/3571730 Shot Smartphone GUI Navigation. CoRR abs/2311.07562 (2023). https://doi.org/
[17] Wing Lam, Zhengkai Wu, Dengfeng Li, Wenyu Wang, Haibing Zheng, Hui Luo, 10.48550/ARXIV.2311.07562 arXiv:2311.07562
Peng Yan, Yuetang Deng, and Tao Xie. 2017. Record and replay for Android: are [34] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng
we there yet in industrial cases?. In Proceedings of the 2017 11th Joint Meeting Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in
on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, GPT-4V. CoRR abs/2310.11441 (2023). https://doi.org/10.48550/ARXIV.2310.11441
September 4-8, 2017, Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea arXiv:2310.11441
Zisman (Eds.). ACM, 854–859. https://doi.org/10.1145/3106237.3117769 [35] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang,
[18] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Users. CoRR abs/2312.13771 (2023). https://doi.org/10.48550/ARXIV.2312.13771
täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Gener- arXiv:2312.13771
ation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio
Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
[19] Jun-Wei Lin, Reyhaneh Jabbarvand, and Sam Malek. 2019. Test Transfer Across
Mobile Apps Through Semantic Mapping. In Proc. of the 34th IEEE/ACM Interna-
tional Conference on Automated Software Engineering, ASE. IEEE, 42–53.
[20] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,
Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models
Use Long Contexts. Trans. Assoc. Comput. Linguistics 12 (2024), 157–173. https:
//doi.org/10.1162/TACL_A_00638

You might also like