0% found this document useful (0 votes)
52 views

ChatDev: Communicative Agents For Software Development

Uploaded by

mateus.sousa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

ChatDev: Communicative Agents For Software Development

Uploaded by

mateus.sousa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ChatDev: Communicative Agents for Software Development

Chen Qian⋆ Wei Liu⋆ Hongzhang Liu♠ Nuo Chen⋆ Yufan Dang⋆
Jiahao Li⋆ Cheng Yang♣ Weize Chen⋆ Yusheng Su⋆ Xin Cong⋆
Juyuan Xu⋆ Dahai Li♦ Zhiyuan Liu⋆B Maosong Sun⋆B

Tsinghua University ♠ The University of Sydney ♣ BUPT ♦ Modelbest Inc.
[email protected] [email protected] [email protected]

Software
Abstract Develop a
Gomoku game
Codes Docs

Software development is a complex task that


necessitates cooperation among multiple mem-
bers with diverse skills. Numerous studies used
arXiv:2307.07924v5 [cs.SE] 5 Jun 2024

deep learning to improve specific phases in a


waterfall model, such as design, coding, and
testing. However, the deep learning model
in each phase requires unique designs, lead-
ing to technical inconsistencies across various
phases, which results in a fragmented and in-
effective development process. In this paper,
we introduce ChatDev, a chat-powered soft-
ware development framework in which special-
ized agents driven by large language models main()

(LLMs) are guided in what to communicate


(via chat chain) and how to communicate (via
communicative dehallucination). These agents
actively contribute to the design, coding, and Figure 1: ChatDev, a ::: chat-powered software
testing phases through unified language-based development framework, integrates LLM agents with
:::

communication, with solutions derived from various social roles, working autonomously to develop
their multi-turn dialogues. We found their uti- comprehensive solutions via multi-agent collaboration.
lization of natural language is advantageous
for system design, and communicating in pro-
gramming language proves helpful in debug- diverse downstream applications. Furthermore,
ging. This paradigm demonstrates how linguis- autonomous agents (Richards, 2023; Zhou et al.,
tic communication facilitates multi-agent col- 2023a) have gained attention for enhancing the ca-
laboration, establishing language as a unify- pabilities of LLMs with advanced features such as
ing bridge for autonomous task-solving among context-aware memory (Sumers et al., 2023), multi-
LLM agents. The code and data are available step planning (Liu et al., 2023), and strategic tool
at https://github.com/OpenBMB/ChatDev.
using (Schick et al., 2023).
Software development is a complex task that ne-
1 Introduction cessitates cooperation among multiple members
with diverse skills (e.g., architects, programmers,
Large language models (LLMs) have led to sub- and testers) (Basili, 1989; Sawyer and Guinan,
stantial transformations due to their ability to ef- 1998). This entails extensive communication
fortlessly integrate extensive knowledge expressed among different roles to understand and analyze
in language (Brown et al., 2020; Bubeck et al., requirements through natural language, while also
2023), combined with their strong capacity for role- encompassing development and debugging using
playing within designated roles (Park et al., 2023; programming languages (Ernst, 2017; Banker et al.,
Hua et al., 2023; Chen et al., 2023b). This ad- 1998). Numerous studies use deep learning to im-
vancement eliminates the need for model-specific prove specific phases of the waterfall model in soft-
designs and delivers impressive performance in ware development, such as design, coding, and test-
B: Corresponding Author. ing (Pudlitz et al., 2019; Martín and Abran, 2015;
Gao et al., 2019; Wang et al., 2016). Due to these software requirement descriptions and conducted
technical inconsistencies, methods employed in comprehensive analyses. The results indicate that
different phases remain isolated until now. Every ChatDev notably improves the quality of software,
phase, from data collection and labeling to model leading to improved completeness, executability,
training and inference, requires its unique designs, and better consistency with requirements. Further
leading to a fragmented and less efficient devel- investigations reveal that natural-language com-
opment process in the field (Freeman et al., 2001; munications contribute to comprehensive system
Ernst, 2017; Winkler et al., 2020). design, while programming-language communica-
Motivated by the expert-like potential of au- tions drive software optimization. In summary, the
tonomous agents, we aim to establish language as a proposed paradigm demonstrates how linguistic
unifying bridge—utilizing multiple LLM-powered communication facilitates multi-agent collabora-
agents with specialized roles for cooperative soft- tion, establishing language as a unifying bridge for
ware development through language-based com- autonomous task-solving among LLM agents.
munication across different phases; solutions in
different phases are derived from their multi-turn 2 Related Work
dialogues, whether dealing with text or code. Nev-
ertheless, due to the tendency of LLM hallucina- Trained on vast datasets to comprehend and ma-
tions (Dhuliawala et al., 2023; Zhang et al., 2023b), nipulate billions of parameters, LLMs have be-
the strategy of generating software through com- come pivotal in natural language processing due
municative agents could lead to the non-trivial chal- to their seamless integration of extensive knowl-
lenge of coding hallucinations, which involves the edge (Brown et al., 2020; Bubeck et al., 2023;
generation of source code that is incomplete, unex- Vaswani et al., 2017; Radford et al.; Touvron et al.,
ecutable, or inaccurate, ultimately failing to fulfill 2023; Wei et al., 2022a; Shanahan et al., 2023;
the intended requirements (Agnihotri and Chug, Chen et al., 2021; Brants et al., 2007; Chen et al.,
2020). The frequent occurrence of coding halluci- 2021; Ouyang et al., 2022; Yang et al., 2023a;
nation in turn reflects the constrained autonomy of Qin et al., 2023b; Kaplan et al., 2020). Further-
agents in task completion, inevitably demanding more, LLMs have demonstrated strong role-playing
additional manual intervention and thereby hinder- abilities (Li et al., 2023a; Park et al., 2023; Hua
ing the immediate usability and reliability of the et al., 2023; Chan et al., 2023; Zhou et al., 2023b;
generated software (Ji et al., 2023). Chen et al., 2023b,a; Cohen et al., 2023; Li et al.,
In this paper, we propose ChatDev (see Figure 1), 2023b). Recent progress, particularly in the field
a chat-powered
:::
software-:::
development framework of autonomous agents (Zhou et al., 2023a; Wang
integrating multiple "software agents" for active et al., 2023a; Park et al., 2023; Wang et al., 2023e;
involvement in three core phases of the software Richards, 2023; Osika, 2023; Wang et al., 2023d),
lifecycle: design, coding, and testing. Technically, is largely attributed to the foundational advances
ChatDev uses a chat chain to divide each phase in LLMs. These agents utilize the robust capa-
into smaller subtasks further, enabling agents’ bilities of LLMs, displaying remarkable skills in
multi-turn communications to cooperatively pro- memory (Park et al., 2023; Sumers et al., 2023),
pose and develop solutions (e.g., creative ideas planning (Chen et al., 2023b; Liu et al., 2023) and
or source code). The chain-structured workflow tool use (Schick et al., 2023; Cai et al., 2023; Qin
guides agents on what to communicate, foster- et al., 2023a; Ruan et al., 2023; Yang et al., 2023b),
ing cooperation and smoothly linking natural- and enabling them to reason in complex scenarios (Wei
programming-language subtasks to propel problem- et al., 2022b; Zhao et al., 2023; Zhou et al., 2023a;
solving. Additionally, to minimize coding halluci- Ma et al., 2023; Zhang et al., 2023a; Wang et al.,
nations, ChatDev includes an communicative de- 2023b; Ding et al., 2023; Weng, 2023).
hallucination mechanism, enabling agents to ac- Software development is a multifaceted and in-
tively request more specific details before giving tricate process that requires the cooperation of mul-
direct responses. The communication pattern in- tiple experts from various fields (Yilmaz et al.,
structs agents on how to communicate, enabling 2012; Acuna et al., 2006; Basili, 1989; Sawyer
precise information exchange for effective solu- and Guinan, 1998; Banker et al., 1998; France
tion optimization while reducing coding hallucina- and Rumpe, 2007), encompassing the require-
tions. We built a comprehensive dataset containing ment analysis and system design in natural lan-
Phases Design Coding Testing

Code Code
Subtasks Design Coding
Complete Review
Testing

CEO CTO CTO Reviewer Tester

Instructor

Chat Chain {task} {ideas} {code} {code} {code} {code}

Assistant

CTO Programmer Programmer Programmer Programmer

Figure 2: Upon receiving a preliminary task requirement (e.g., “develop a Gomoku game”), these software
agents engage in multi-turn communication and perform instruction-following along a chain-structured workflow,
collaborating to execute a series of subtasks autonomously to craft a comprehensive solution.

guages (Pudlitz et al., 2019; Martín and Abran, forming textual requirements into functional soft-
2015; Nahar et al., 2022), along with system de- ware in a single step remains a significant challenge.
velopment and debugging in programming lan- ChatDev thus adopts the core principles of the wa-
guages (Gao et al., 2019; Wang et al., 2016; Wan terfall model, using a chat chain (C) with sequential
et al., 2022). Numerous studies employ the wa- phases (P), each comprising sequential subtasks
terfall model, a particular software development (T ). Specifically, ChatDev segments the software
life cycle, to segment the process into discrete development process into three sequential phases:
phases (e.g., design, coding, testing) and apply design, coding, and testing. The coding phase is
deep learning to improve the effectiveness of cer- further subdivided into subtasks of code writing
tain phases (Winkler et al., 2020; Ezzini et al., 2022; and completion, and the testing phase is segmented
Thaller et al., 2019; Zhao et al., 2021; Nijkamp into code review (static testing) and system testing
et al., 2023; Wan et al., 2018; Wang et al., 2021). (dynamic testing), as illustrated in Figure 2. In
every subtask, two agents, each with their own spe-
3 ChatDev cialized roles (e.g., a reviewer skilled at identifying
We introduce ChatDev, a :::: chat-powered software- endless loops and a programmer adept in GUI de-
development framework that integrates multiple sign), perform the functions of an instructor (I)
:::
"software agents" with various social roles (e.g., and an assistant (A). The instructor agent initiates
requirements analysts, professional programmers instructions, instructing (→) the discourse toward
and test engineers) collaborating in the core phases the completion of the subtask, while the assistant
of the software life cycle, see Figure 1. Technically, agent adheres to these instructions and responds
to facilitate cooperative communication, ChatDev with (;) appropriate solutions. They engage in
introduces chat chain to further break down each a multi-turn dialogue (C), working cooperatively
phase into smaller and manageable subtasks, which until they achieve consensus, extracting (τ ) solu-
guides multi-turn communications between differ- tions that can range from the text (e.g., defining a
ent roles to propose and validate solutions for each software function point) to code (e.g., creating the
subtask. In addition, to alleviate unexpected hallu- initial version of source code), ultimately leading
cinations, a communicative pattern named commu- to the completion of the subtask. The entire task-
nicative dehallucination is devised, wherein agents solving process along the agentic workflow can be
request more detailed information before respond- formulated as:
ing directly and then continue the next round of
communication based on these details. C = ⟨P 1 , P 2 , . . . , P |C| ⟩
i
3.1 Chat Chain P i = ⟨T 1 , T 2 , . . . , T |P | ⟩
(1)
T j = τ C(I, A)

Although LLMs show a good understanding of nat-
ural and programming languages, efficiently trans- C(I, A) = ⟨I → A, A ; I⟩⟲
The dual-agent communication design simplifies maintain a complete communication history among
communications by avoiding complex multi-agent all agents and phases. To tackle this issue, based
topologies, effectively streamlining the consensus- on the nature of the chat chain, we accordingly seg-
reaching process (Yin et al., 2023; Chen et al., ment the agents’ context memories based on their
2023b). Subsequently, the solutions from previ- sequential phases, resulting in two functionally dis-
ous tasks serve as bridges to the next phase, allow- tinct types of memory: short-term memory and
ing a smooth transition between subtasks. This long-term memory. Short-term memory is utilized
approach continues until all subtasks are com- to sustain the continuity of the dialogue within a
pleted. It’s worth noting that the conceptually single phase, while long-term memory is leveraged
simple but empirically powerful chain-style struc- to preserve contextual awareness across different
ture guides agents on what to communicate, foster- phases.
ing cooperation and smoothly linking natural- and Formally, short-term memory records an agent’s
programming-language subtasks. It also offers a current phase utterances, aiding context-aware
transparent view of the entire software development decision-making. At the time t during phase P i ,
process, allowing for the examination of intermedi- we use Iti to represent the instructor’s instruction
ate solutions and assisting in identifying possible and Ait for the assistant’s response. The short-term
problems. memory M collects utterances up to time t as:
Agentization To enhance the quality and reduce Mit = ⟨(I1i , Ai1 ), (I2i , Ai2 ), . . . , (Iti , Ait )⟩ (3)
human intervention, ChatDev implements prompt
engineering that only takes place at the start of In the next time step t + 1, the instructor utilizes
each subtask round. As soon as the communica- the current memory to generate a new instruction
i , which is then conveyed to the assistant to pro-
It+1
tion phase begins, the instructor and the assistant
will communicate with each other in an automated duce a new response Ait+1 . The short-term memory
loop, continuing this exchange until the task con- iteratively updates until the number of communica-
cludes. However, simply exchanging responses tions reaches the upper limit |Mi |:
cannot achieve effective multi-round task-oriented i
communication, since it inevitably faces significant It+1 =I(Mit ), Ait+1 = A(Mit , It+1
i
)
(4)
challenges including role flipping, instruction re- Mit+1 = Mit ∪ (It+1
i
, Ait+1 )
peating, and fake replies. As a result, there is a
failure to advance the progression of productive To perceive dialogues through previous phases,
communications and hinders the achievement of the chat chain only transmits the solutions from pre-
meaningful solutions. ChatDev thus employs in- vious phases as long-term memories M̃, integrat-
ception prompting mechanism (Li et al., 2023a) ing them at the start of the next phase and enabling
for initiating, sustaining, and concluding agents’ the cross-phase transmission of long dialogues:
communication to guarantee a robust and efficient i
τ (Mj|Mj | ) (5)
[
workflow. This mechanism is composed of the in- I1i+1 = M̃i ∪ Pi+1 i
I , M̃ =
structor system prompt PI and the assistant system j=1
prompt PA . The system prompts for both roles
where P symbolizes a predetermined prompt that
are mostly symmetrical, covering the overview and
appears exclusively at the start of each phase.
objectives of the current subtask, specialized roles,
By sharing only the solutions of each subtask
accessible external tools, communication protocols,
rather than the entire communication history, Chat-
termination conditions, and constraints or require-
Dev minimizes the risk of being overwhelmed by
ments to avoid undesirable behaviors. Then, an
too much information, enhancing concentration on
instructor I and an assistant A are instantiated by
each task and encouraging more targeted coopera-
hypnotizing LLMs via PI and PA :
tion, while simultaneously facilitating cross-phase
I = ρ(LLM, PI ), A = ρ(LLM, PA ) (2) context continuity.
where ρ is the role customization operation, imple- 3.2 Communicative Dehallucination
mented via system message assignment.
LLM hallucinations manifest when models gen-
Memory Note that the limited context length of erate outputs that are nonsensical, factually incor-
common LLMs typically restricts the ability to rect, or inaccurate (Dhuliawala et al., 2023; Zhang
et al., 2023b). This issue is particularly concern- MetaGPT (Hong et al., 2023) is an advanced frame-
ing in software development, where programming work that allocates specific roles to various LLM-
languages demand precise syntax—the absence of driven software agents and incorporates standard-
even a single line can lead to system failure. We ized operating procedures to enable multi-agent
have observed that LLMs often produce coding hal- participation. In each step agents with specific roles
lucinations, which encompass potential issues like generate solutions by adhering to static instructions
incomplete implementations, unexecutable code, predefined by human experts.
and inconsistencies that don’t meet requirements.
Coding hallucinations frequently appear when the Datasets Note that, as of now, there isn’t a pub-
assistant struggles to precisely follow instructions, licly accessible dataset containing textual descrip-
often due to the vagueness and generality of cer- tions of software requirements in the context of
tain instructions that require multiple adjustments, agent-driven software development. To this end,
making it challenging for agents to achieve full we are actively working towards developing a com-
compliance. Inspired by this, we introduce com- prehensive dataset for software requirement de-
municative dehallucination, which encourages the scriptions, which we refer to as SRDD (Software
assistant to actively seek more detailed suggestions Requirement Description Dataset). Drawing on
from the instructor before delivering a formal re- previous work (Li et al., 2023a), we utilize existing
sponse. software descriptions as initial examples, which
Specifically, a vanilla communication pattern are then further developed through a process that
between the assistant and the instructor follows a combines LLM-based automatic generation with
straightforward instruction-response format: post-processing refinement guided by humans. As
a result, this dataset includes important software
⟨I → A, A ; I⟩⟲ (6) categories from popular platforms such as Ubuntu,
Google Play, Microsoft Store, and Apple Store. It
In contrast, our communicative dehallucination
comprises 1,200 software task prompts that have
mechanism features a deliberate "role reversal",
been carefully categorized into 5 main areas: Edu-
where the assistant takes on an instructor-like role,
cation, Work, Life, Game, and Creation. All these
proactively seeking more specific information (e.g.,
areas are further divided into 40 subcategories, and
the precise name of an external dependency and
each subcategory contains 30 unique task prompts.
its related class) before delivering a conclusive re-
sponse. After the instructor provides a specific Metrics Evaluating software is also a challenging
modification suggestion, the assistant proceeds to task, especially when trying to assess it on a holistic
perform precise optimization: level. Under the current limitation of scarce bench-
mark resources, traditional function-oriented code
⟨I → A, ⟨A → I, I ; A⟩⟲ , A ; I⟩⟲ (7)
generation metrics (e.g., pass@k), cannot seam-
Since this mechanism tackles one concrete issue at lessly transfer to a comprehensive evaluation of
a time, it requires multiple rounds of communica- entire software systems. The main reason for this
tion to optimize various potential problems. The is that it is often impractical to develop manual or
communication pattern instructs agents on how to automated test cases for various types of software,
communicate, enabling finer-grained information especially those involving complex interfaces, fre-
exchange for effective solution optimization, which quent user interactions, or non-deterministic feed-
practically aids in reducing coding hallucinations. back. As an initial strategy, we apply three funda-
mental and objective dimensions that reflect differ-
4 Evaluation ent aspects of coding hallucinations to evaluate the
agent-generated software, and then integrate them
Baselines We chose some representative LLM- to facilitate a more holistic evaluation:
based software development methods as our base-
lines. GPT-Engineer (Osika, 2023) is a fundamen- • Completeness measures the software’s ability to
tal single-agent approach in LLM-driven software fulfill code completion in software development,
agents with a precise understanding of task require- quantified as the percentage of software with-
ments and the application of one-step reasoning, out any "placeholder" code snippets. A higher
which highlights its efficiency in generating de- score indicates a higher probability of automated
tailed software solutions at the repository level. completion.
Method Paradigm Completeness Executability Consistency Quality
GPT-Engineer 0.5022† 0.3583† 0.7887† 0.1419†
MetaGPT 0.4834† 0.4145† 0.7601† 0.1523†
ChatDev 0.5600 0.8800 0.8021 0.3953

Table 1: Overall performance of the LLM-powered software development methods, encompassing both single-agent
( ) and multi-agent ( ) paradigms. Performance metrics are averaged for all tasks. The top scores are in bold,
with second-highest underlined. † indicates significant statistical differences (p≤0.05) between a baseline and ours.

Method Evaluator Baseline Wins ChatDev Wins Draw Method Duration (s) #Tokens #Files #Lines
GPT-4 22.50% 77.08% 00.42% GPT-Engineer 15.6000 7,182.5333 3.9475 70.2041
GPT-Engineer
Human 09.18% 90.16% 00.66% MetaGPT 154.0000 29,278.6510 4.4233 153.3000
GPT-4 37.50% 57.08% 05.42% ChatDev 148.2148 22,949.4450 4.3900 144.3450
MetaGPT
Human 07.92% 88.00% 04.08%

Table 3: Software statistics include Duration (time con-


Table 2: Pairwise evaluation results. sumed), #Tokens (number of tokens used), #Files (num-
ber of code files generated), and #Lines (total lines of
code across all files) in the software generation process.
• Executability assesses the software’s ability to
run correctly within a compilation environment,
quantified as the percentage of software that com- a consensus is reached. We used ChatGPT-3.5 with
piles successfully and can run directly. A higher a temperature of 0.2 and integrated Python-3.11.4
score indicates a higher probability of successful for feedback. All baselines in the evaluation share
execution. the same hyperparameters and settings for fairness.
• Consistency measures how closely the generated
software code aligns with the original require- 4.1 Overall Performance
ment description, quantified as the cosine dis-
tance between the semantic embeddings of the As illustrated in Table 1, ChatDev outperforms
textual requirements and the generated software all baseline methods across all metrics, showing
code1 . A higher score indicates a greater degree a considerable margin of improvement. Firstly,
of consistency with the requirements. the improvement of ChatDev and MetaGPT over
• Quality is a comprehensive metric that integrates GPT-Engineer demonstrates that complex tasks are
various factors to assess the overall quality of difficult to solve in a single-step solution. There-
software, quantified by multiplying2 complete- fore, explicitly decomposing the difficult problem
ness, executability, and consistency. A higher into several smaller, more manageable subtasks en-
quality score suggests a higher overall satisfac- hances the effectiveness of task completion. Addi-
tion with the software generated, implying a tionally, in comparison to MetaGPT, ChatDev sig-
lower need for further manual intervention. nificantly raises the Quality from 0.1523 to 0.3953.
This advancement is largely attributed to the agents
employing a cooperative communication method,
Implementation Details We divided software
which involves autonomously proposing and con-
development into 5 subtasks within 3 phases, as-
tinuously refining source code through a blend of
signing specific roles like CEO, CTO, programmer,
natural and programming languages, as opposed
reviewer, and tester. A subtask would terminate
to merely delivering responses based on human-
and get a conclusion either after two unchanged
predefined instructions. The communicative agents
code modifications or after 10 rounds of commu-
guide each subtask towards integrated and auto-
nication. During the code completion, review, and
mated solutions, efficiently overcoming the restric-
testing, a communicative dehallucination is acti-
tions typically linked to manually established op-
vated. For ease of identifying solutions, the assis-
timization rules, and offering a more versatile and
tant begins responses with "<SOLUTION>" when
adaptable framework for problem-solving.
1
Comments should be excluded from the code to avoid To further understand user preferences in practi-
potential information leakage during evaluations.
2
One can also choose to average the sub-metrics, which cal settings, we use the setting adopted by Li et al.
yields similar trends. (2023a), where agent-generated solutions are com-
1.39%
Target User

2.24%
Variant Completeness Executability Consistency Quality

2.51
3.1 6%
Data Management

%
3.
5%
5.4

4
ChatDev 0.5600 0.8800 0.8021 0.3953
1% 21.44% UI & UX
≤Coding 0.4100 0.7700 0.7958 0.2512 5.92% Customization
≤Complete 0.6250 0.7400 0.7978 0.3690
6.93% Performance
≤Review 0.5750 0.8100 0.7980 0.3717 20.55%
Integration
≤Testing 0.5600 0.8800 0.8021 0.3953
. 78%
7 Real-Time Update
⧹CDH 0.4700 0.8400 0.7983 0.3094 19.23% Recommendation
⧹Roles 0.5400 0.5800 0.7385 0.2212
Platform
Collaboration
Table 4: Ablation study on main components or mech- Security & Privacy
anisms. ≤ x denotes halting the chat chain after the 57.20% Scalability & Maintenance

completion of the x phrase, and ⧹ denotes the removing


operation. CDH denotes the communicative dehalluci- Natural-Language
nation mechanism. 42.80% Programming-Language

pared in pairs by both human participants and the 10


.19 18.53%
prevalent GPT-4 model to identify the preferred % Coding
one.3 Table 2 shows ChatDev consistently outper- 10.19% Code Complete
Code Review
forming other baselines, with higher average win 61.09% System Testing
rates in both GPT-4 and human evaluations.
Furthermore, the software statistics presented
in Table 3 indicates that the multi-agent paradigm,
Figure 3: The utterance distribution of agent communi-
despite being slower and consuming more tokens
cations throughout the entire development process.
than the single-agent method, yields a greater num-
ber of code files and a larger codebase, which may
enhance the software’s functionality and integrity. tions among intelligent agents. Meanwhile, elim-
Analyzing the dialogues of agents suggests that inating communicative dehallucination results in
the multi-agent communication method often leads a decrease across all metrics, indicating its effec-
agents to autonomously offer functional enhance- tiveness in addressing coding hallucinations. Most
ments (e.g., GUI creation or increasing game diffi- interestingly, the most substantial impact on per-
culty), thereby potentially resulting in the incorpo- formance occurs when the roles of all agents are
ration of beneficial features that were not explicitly removed from their system prompts. Detailed dia-
specified in requirements. Taking all these factors logue analysis shows that assigning a "prefer GUI
together, we posit that the fundamental character- design" role to a programmer results in generated
istics of multi-agent software development take source code with relevant GUI implementations; in
on greater significance, surpassing short-term con- the absence of such role indications, it defaults to
cerns like time and economic costs in the current implement unfriend command-line-only programs
landscape. only. Likewise, assigning roles such as a "careful
reviewer for bug detection" enhances the chances
4.2 Ablation Study
of discovering code vulnerabilities; without such
This section examines key components or mech- roles, feedback tends to be high-level, leading to
anisms within our multi-agent cooperation frame- limited adjustments by the programmer. This find-
work by removing particular phases in the chat ing underscores the importance of assigning roles
chain, communicative dehallucination, or the roles in eliciting responses from LLMs, underscoring the
assigned to all agents in their system prompts. Fig- significant influence of multi-agent cooperation on
ure 4 shows that the code complete phase enhances software quality.
Completeness, with testing critical for Executabil-
ity. Quality steadily rises with each step, suggesting 4.3 Communication Analysis
that software development optimization is progres-
sively attained through multi-phase communica- Our agent-driven software development paradigm
promotes cooperative agents through effective com-
3
For fairness, GPT-4’s evaluation mitigated possible po- munication for automated solution optimization.
sitional bias (Wang et al., 2023c), and human experts inde-
pendently assessed the task solutions, randomized to prevent Phases in the chat chain have varying levels of en-
order bias. gagement in natural and programming languages.
Modules Not Imported Method Not Implemented Method Not Implemented

Modules Not Imported


Modules Not Imported
Method Not Implemented

Input Output
Missing Code Segments
Missing Code Segments
Missing Code Segments
No Further Suggestions No Further Suggestions
Not Configure Layout
Missing Initialization Not Configure Layout Missing Comments
Class Not Used Not Correctly Processing Data Not Configure Layout
Missing Comments Missing Comments
Not Correctly Processing Data
Methods Not Called Missing Exception Handling Methods Not Called
Missing Files Missing Files Missing Files
Missing Exception Handling Missing Initialization Missing Exception Handling
No Further Suggestions Missing Imports Class Not Used
Not Correctly Processing Data Calling without Correct Arguments Missing Imports
Calling without Correct Arguments Methods Not Called Calling without Correct Arguments
Class Defined Twice Class Not Used Missing Initialization
Not Handle Cases Class Defined Twice Not Handle Cases
Missing Imports Not Handle Cases Class Defined Twice

Figure 4: The chart demonstrates the distribution of suggestions made by a reviewer agent during a multi-round
reviewing process, where each sector in the chart represents a different category of suggestion.

We now analyze the content of their communica- and the use of "placeholder" tags in Python code,
tions to understand linguistic effects. necessitating additional manual adjustments. Fur-
Figure 3 depicts a communication breakdown, thermore, the "Module Not Imported" issue often
with natural language at 57.20%. In the natural- arises due to code generation omitting crucial de-
language phase (i.e., design), natural language com- tails. Apart from common problems, reviewers
munication plays a crucial role in the thorough often focus on enhancing code robustness by iden-
design of the system, with agents autonomously tifying rare exceptions, unused classes, or potential
discussing and designing aspects like target user, infinite loops.
data management, and user interface. Post-design Likewise, we analyze the tester-programmer
phases show a balanced mix of coding, code com- communication during the testing phase, illus-
pletion, and testing activities, with most communi- trating the dynamic debugging dynamics in their
cation occurring during code reviews. This trend is multi-turn interactions with compiler feedback,
due to agents’ self-reviews and code fixes consis- as depicted in Figure 5. The likelihood of
tently propelling software development; otherwise, successful compilation at each step is generally
progress halts when successive updates don’t show higher than encountering errors, with most er-
significant changes, leading to a natural decrease rors persisting and a lower probability of trans-
in code review communications. forming into different errors. The most fre-
We explore the properties of static debugging quent error is "ModuleNotFound" (45.76%), fol-
dynamics in code reviews resulting from commu- lowed by "NameError" and "ImportError" (each
nication between reviewers and programmers, as at 15.25%). The observation highlights the model’s
depicted in Figure 4. The data uncovers that dur- tendency to overlook basic elements like an "im-
ing the review phase, reviewers may spot different port" statement, underscoring its difficulty in man-
issues through language interactions. The program- aging intricate details during code generation. Be-
mer’s intervention can transform certain issues into sides, the tester also detects rarer errors like im-
different ones or a state where no further sugges- properly initialized GUIs, incorrect method calls,
tions are needed; the increasing proportion of the missing file dependencies, and unused modules.
latter indicates successful software optimization. The communicative dehallucination mechanism ef-
Particularly, the "Method Not Implemented" is- fectively resolves certain errors, frequently result-
sue is most common in communication between ing in "compilation success" after code changes.
reviewers and programmers during code reviews, There’s a significantly low chance of returning to
accounting for 34.85% of discussions. This prob- an error state from a successful compilation. Over
lem usually arises from unclear text requirements time, the multi-turn communication process statisti-
Success Success Success

Success
Input Output

ModuleNotFoundError ModuleNotFoundError ModuleNotFoundError


EOFError EOFError EOFError
KeyError KeyError KeyError ModuleNotFoundError
SSLError SSLError SSLError NameError
NameError NameError NameError SyntaxError
TclError TclError TclError ImportError
ImportError ImportError ImportError AttributeError
AttributeError AttributeError AttributeError TclError
JSONDecodeError JSONDecodeError JSONDecodeError JSONDecodeError
Others Others Others Others
SyntaxError SyntaxError SyntaxError FileNotFoundError
FileNotFoundError FileNotFoundError FileNotFoundError IndentationError
IndentationError IndentationError IndentationError TypeError
TypeError TypeError TypeError DatabaseError
DatabaseError DatabaseError DatabaseError OperationalError
OperationalError OperationalError OperationalError RuntimeError
RuntimeError RuntimeError RuntimeError ValueError
ValueError ValueError ValueError

Figure 5: The diagram illustrates the progression of iterations in a multi-round testing process, where each colored
column represents a dialogue round, showcasing the evolution of the solution through successive stages of testing.

cally shows a consistent decrease in errors, steadily tions; in information management systems, agents
moving towards successful software execution. might retrieve static key-value placeholders instead
of external databases. Therefore, it is crucial to
5 Conclusion clearly define detailed software requirements. Cur-
rently, these technologies are more suitable for pro-
We have introduced ChatDev, an innovative multi-
totype systems rather than complex real-world ap-
agent collaboration framework for software devel-
plications. Secondly, unlike traditional function-
opment that utilizes multiple LLM-powered agents
level code generation, automating the evaluation of
to integrate fragmented phases of the waterfall
general-purpose software is highly complex. While
model into a cohesive communication system. It
some efforts have focused on Human Revision
features chat chain organizing communication tar-
Cost (Hong et al., 2023), manual verification for
gets and dehallucination for resolving coding hal-
large datasets is impractical. Our paper emphasizes
lucinations. The results demonstrate its superiority
completeness, executability, consistency, and over-
and highlight the benefits of multi-turn communi-
all quality, but future research should consider ad-
cations in software optimization. We aim for the
ditional factors such as functionalities, robustness,
insights to advance LLM agents towards increased
safety, and user-friendliness. Thirdly, compared to
autonomy and illuminate the profound effects of
single-agent approaches, multiple agents require
"language" and its empowering role across an even
more tokens and time, increasing computational de-
broader spectrum of applications.
mands and environmental impact. Future research
6 Limitations should aim to enhance agent capabilities with fewer
interactions. Despite these limitations, we believe
Our study explores the potential of cooperative that engaging a broader, technically proficient au-
autonomous agents in software development, but dience can unlock additional potential directions in
certain limitations and risks must be considered by LLM-powered multi-agent collaboration.
researchers and practitioners. Firstly, the capabili-
ties of autonomous agents in software production Acknowledgments
might be overestimated. While they enhance de-
velopment quality, agents often implement simple The work was supported by the National Key R&D
logic, resulting in low information density. With- Program of China (No.2022ZD0116312), the Post-
out clear, detailed requirements, agents struggle to doctoral Fellowship Program of CPSF under Grant
grasp task ideas. For instance, vague guidelines in Number GZB20230348, and Tencent Rhino-Bird
developing a Snake game lead to basic representa- Focused Research Program.
References Collaborative Framework for Game Development. In
arXiv preprint arXiv:2310.08067.
Silvia T Acuna, Natalia Juristo, and Ana M Moreno.
2006. Emphasizing Human Capabilities in Software
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Development. In IEEE Software, volume 23, pages
Henrique Ponde de Oliveira Pinto, Jared Kaplan,
94–101.
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg
Mansi Agnihotri and Anuradha Chug. 2020. A Sys- Brockman, et al. 2021. Evaluating Large Lan-
tematic Literature Survey of Software Metrics, Code guage Models Trained on Code. In arXiv preprint
Smells and Refactoring Techniques. In booktitle of arXiv:2107.03374.
Information Processing Systems, volume 16, pages
915–934. Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang,
Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin,
Rajiv D Banker, Gordon B Davis, and Sandra A Slaugh- Yaxi Lu, Ruobing Xie, et al. 2023b. AgentVerse:
ter. 1998. Software Development Practices, Software Facilitating Multi-agent Collaboration and Explor-
Complexity, and Software Maintenance Performance: ing Emergent Behaviors in Agents. In International
A Field Study. In Management science, volume 44, Conference on Learning Representations (ICLR).
pages 433–450.
Roi Cohen, May Hamri, Mor Geva, and Amir Glober-
Victor R Basili. 1989. Software Development: A son. 2023. LM vs LM: Detecting Factual Er-
Paradigm for The Future. In Proceedings of the An- rors via Cross Examination. In ArXiv, volume
nual International Computer Software and Applica- abs/2305.13281.
tions Conference, pages 471–485. IEEE.
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu,
Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och,
Roberta Raileanu, Xian Li, Asli Celikyilmaz, and
and Jeffrey Dean. 2007. Large Language Models in
Jason Weston. 2023. Chain-of-Verification Reduces
Machine Translation. In Proceedings of the 2007
Hallucination in Large Language Models. In arXiv
Joint Conference on Empirical Methods in Natural
preprint arXiv:2309.11495.
Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pages 858–
867. Shiying Ding, Xinyi Chen, Yan Fang, Wenrui Liu, Yiwu
Qiu, and Chunlei Chai. 2023. DesignGPT: Multi-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Agent Collaboration in Design. In arXiv preprint
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind arXiv:2311.11591.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Michael D. Ernst. 2017. Natural Language is a Pro-
Gretchen Krueger, Tom Henighan, Rewon Child, gramming Language: Applying Natural Language
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Processing to Software Development. In Advances
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- in Programming Languages (SNAPL), volume 71,
teusz Litwin, Scott Gray, Benjamin Chess, Jack pages 4:1–4:14.
Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. 2020. Saad Ezzini, Sallam Abualhaija, Chetan Arora, and
Language Models are Few-Shot Learners. In Ad- Mehrdad Sabetzadeh. 2022. Automated Handling
vances in Neural Information Processing Systems of Anaphoric Ambiguity in Requirements: A Multi-
(NeurIPS), volume 33, pages 1877–1901. solution Study. In International Conference on Soft-
ware Engineering (ICSE), pages 187–199.
Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- Robert France and Bernhard Rumpe. 2007. Model-
ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, driven Development of Complex Software: A Re-
Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, search Roadmap. In Future of Software Engineering
and Yi Zhang. 2023. Sparks of artificial general in- (FOSE), pages 37–54.
telligence: Early experiments with gpt-4. In arXiv
preprint arXiv:2303.12712. Peter Freeman, Donald J. Bagert, Hossein Saiedian,
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Mary Shaw, Robert Dupuis, and J. Barrie Thompson.
and Denny Zhou. 2023. Large Language Models as 2001. Software Engineering Body of Knowledge
Tool Makers. In arXiv preprint arXiv:2305.17126. (SWEBOK). In Proceedings of the International
Conference on Software Engineering (ICSE), pages
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, 693–696.
Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan
Liu. 2023. ChatEval: Towards Better LLM-based Sa Gao, Chunyang Chen, Zhenchang Xing, Yukun Ma,
Evaluators through Multi-Agent Debate. In arXiv Wen Song, and Shang-Wei Lin. 2019. A Neural
preprint arXiv:2308.07201. Model for Method Name Generation from Functional
Description. In 26th IEEE International Conference
Dake Chen, Hanbin Wang, Yunhao Huo, Yuzhao Li, on Software Analysis, Evolution and Reengineering
and Haoyang Zhang. 2023a. GameGPT: Multi-agent (SANER), pages 411–421.
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan
Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Wang, Yingbo Zhou, Silvio Savarese, and Caiming
Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Xiong. 2023. CodeGen: An Open Large Language
Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Model for Code with Multi-Turn Program Synthe-
and Jürgen Schmidhuber. 2023. MetaGPT: Meta Pro- sis. In The International Conference on Learning
gramming for A Multi-Agent Collaborative Frame- Representations (ICLR).
work. In International Conference on Learning Rep-
resentations (ICLR). Anton Osika. 2023. GPT-Engineer. In
https://github.com/AntonOsika/gpt-engineer.
Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei,
Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
Yongfeng Zhang. 2023. War and Peace (WarA- Carroll L. Wainwright, Pamela Mishkin, Chong
gent): Large Language Model-based Multi-Agent Zhang, Sandhini Agarwal, Katarina Slama, Alex
Simulation of World Wars. In arXiv preprint Ray, John Schulman, Jacob Hilton, Fraser Kelton,
arXiv:2311.17227. Luke Miller, Maddie Simens, Amanda Askell, Pe-
ter Welinder, Paul Christiano, Jan Leike, and Ryan
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Lowe. 2022. Training Language Models to Follow
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Instructions with Human Feedback. In arXiv preprint
Madotto, and Pascale Fung. 2023. Survey of Hal- arXiv:2203.02155.
lucination in Natural Language Generation. In ACM
Computing Surveys, volume 55, pages 1–38. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
ith Ringel Morris, Percy Liang, and Michael S. Bern-
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. stein. 2023. Generative Agents: Interactive Simu-
Brown, Benjamin Chess, Rewon Child, Scott Gray, lacra of Human Behavior. In Proceedings of the 36th
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Annual ACM Symposium on User Interface Software
Scaling Laws for Neural Language Models. In arXiv and Technology (UIST).
preprint arXiv:2001.08361.
Florian Pudlitz, Florian Brokhausen, and Andreas Vo-
Guohao Li, Hasan Abed Al Kader Hammoud, Hani gelsang. 2019. Extraction of System States from
Itani, Dmitrii Khizbullin, and Bernard Ghanem. Natural Language Requirements. In IEEE Interna-
2023a. CAMEL: Communicative Agents for "Mind" tional Requirements Engineering Conference (RE),
Exploration of Large Scale Language Model Society. pages 211–222.
In Thirty-seventh Conference on Neural Information
Processing Systems (NeurIPS). Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
Yuan Li, Yixuan Zhang, and Lichao Sun. 2023b. Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
MetaAgents: Simulating Interactions of Human Bill Qian, et al. 2023a. ToolLLM: Facilitating Large
Behaviors for LLM-based Task-oriented Coordina- Language Models to Master 16000+ Real-World
tion via Collaborative Generative Agents. In arXiv APIs. In arXiv preprint arXiv:2307.16789.
preprint arXiv:2310.06500.
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Don-
Shelby Heinecke, Rithesh Murthy, Yihao Feng, ald Metzler, Xuanhui Wang, and Michael Bendersky.
Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, 2023b. Large Language Models are Effective Text
Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Rankers with Pairwise Ranking Prompting. In arXiv
Silvio Savarese. 2023. BOLAA: Benchmarking and preprint arXiv:2306.17563.
Orchestrating LLM-augmented Autonomous Agents.
In arXiv preprint arXiv:2308.05960. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. Language Mod-
Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiao- els are Unsupervised Multitask Learners. In OpenAI
man Pan, and Dong Yu. 2023. LASER: LLM Agent Blog, volume 1, page 9.
with State-Space Exploration for Web Navigation. In
arXiv preprint arXiv:2309.08172. Toran Bruce Richards. 2023. AutoGPT. In
https://github.com/Significant-Gravitas/AutoGPT.
Cuauhtémoc López Martín and Alain Abran. 2015. Neu-
ral networks for predicting the duration of new soft- Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu,
ware projects. In J. Syst. Softw., volume 101, pages Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu
127–135. Mao, Ziyue Li, Xingyu Zeng, and Rui Zhao. 2023.
TPTU: Large Language Model-based AI Agents for
Nadia Nahar, Shurui Zhou, Grace A. Lewis, and Task Planning and Tool Usage. In arXiv preprint
Christian Kästner. 2022. Collaboration Challenges arXiv:2308.03427.
in Building ML-Enabled Systems: Communica-
tion, Documentation, Engineering, and Process. In Steve Sawyer and Patricia J. Guinan. 1998. Software
IEEE/ACM International Conference on Software development: Processes and Performance. In IBM
Engineering (ICSE), pages 413–425. Systems Journal, volume 37, pages 552–569.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Song Wang, Taiyue Liu, and Lin Tan. 2016. Automati-
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola cally Learning Semantic Features for Defect Predic-
Cancedda, and Thomas Scialom. 2023. ToolFormer: tion. In Proceedings of the International Conference
Language Models Can Teach Themselves to Use on Software Engineering (ICSE), pages 297–308.
Tools. In arXiv preprint arXiv:2302.04761.
Song Wang, Nishtha Shrestha, Abarna Kucheri Subbu-
Murray Shanahan, Kyle McDonell, and Laria Reynolds. raman, Junjie Wang, Moshi Wei, and Nachiappan
2023. Role Play with Large Language Models. In Nagappan. 2021. Automatic Unit Test Generation
Nature, volume 623, pages 493–498. for Machine Learning Libraries: How Far Are We?
In IEEE/ACM International Conference on Software
Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, Engineering (ICSE), pages 1548–1560.
and Thomas L. Griffiths. 2023. Cognitive Archi-
tectures for Language Agents. In arXiv preprint Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Hao-
arXiv:2309.02427. tian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing,
and Zhiting Hu. 2023d. PromptAgent: Strategic
Hannes Thaller, Lukas Linsbauer, and Alexander Egyed.
Planning with Language Models Enables Expert-
2019. Feature Maps: A Comprehensible Software
level Prompt Optimization. In arXiv preprint
Representation for Design Pattern Detection. In
arXiv:2310.16427.
IEEE International Conference on Software Anal-
ysis, Evolution and Reengineering (SANER), pages Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu.
207–217. 2023e. Humanoid Agents: Platform for Simulating
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Human-like Generative Agents. In arXiv preprint
Martinet, Marie-Anne Lachaux, Timothée Lacroix, arXiv:2310.05418.
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Azhar, et al. 2023. LLaMA: Open and Efficient
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Foundation Language Models. In arXiv preprint
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
arXiv:2302.13971.
2022a. Emergent Abilities of Large Language Mod-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob els. In arXiv preprint arXiv:2206.07682.
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
You Need. In Advances in Neural Information Pro- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
cessing Systems (NeurIPS), volume 30. et al. 2022b. Chain-of-Thought Prompting Elicits
Reasoning in Large Language Models. In Advances
Chengcheng Wan, Shicheng Liu, Sophie Xie, Yifan in Neural Information Processing Systems (NeurIPS),
Liu, Henry Hoffmann, Michael Maire, and Shan Lu. volume 35, pages 24824–24837.
2022. Automated Testing of Software that Uses Ma-
chine Learning APIs. In IEEE/ACM International Lilian Weng. 2023. LLM-powered Autonomous Agents.
Conference on Software Engineering (ICSE), pages In lilianweng.github.io.
212–224.
Jonas Winkler, Jannis Grönberg, and Andreas Vogel-
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, sang. 2020. Predicting How to Test Requirements:
Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Im- An Automated Approach. In Software Engineering,
proving Automatic Source Code Summarization via volume P-300 of LNI, pages 141–142.
Deep Reinforcement Learning. In Proceedings of the
ACM/IEEE International Conference on Automated Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu,
Software Engineering (ASE), pages 397–407. Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023a.
Large Language Models as Optimizers. In arXiv
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- preprint arXiv:2309.03409.
dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An-
ima Anandkumar. 2023a. Voyager: An Open-ended Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao
Embodied Agent with Large Language Models. In Ge, Xiu Li, and Ying Shan. 2023b. GPT4Tools:
arXiv preprint arXiv:2305.16291. Teaching Large Language Model to Use Tools via
Self-instruction. In Advances in Neural Information
Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Processing Systems (NeurIPS).
Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Rui-
hua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Murat Yilmaz, Rory V O’Connor, and Paul Clarke. 2012.
Jun Wang, and Ji-Rong Wen. 2023b. When Large A Systematic Approach to the Comparison of Roles
Language Model based Agent Meets User Behavior in the Software Development Processes. In Inter-
Analysis: A Novel User Simulation Paradigm. In national Conference on Software Process Improve-
arXiv preprint arXiv:2306.02552. ment and Capability Determination, pages 198–209.
Springer.
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu,
Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo,
Zhifang Sui. 2023c. Large Language Models are not Junqi Dai, Xuanjing Huang, and Xipeng Qiu. 2023.
Fair Evaluators. In arXiv preprint arXiv:2305.17926. Exchange-of-Thought: Enhancing Large Language
Model Capabilities through Cross-Model Communi-
cation. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 15135–15153.
An Zhang, Leheng Sheng, Yuxin Chen, Hao Li, Yang
Deng, Xiang Wang, and Tat-Seng Chua. 2023a. On
Generative Agents in Recommendation. In arXiv
preprint arXiv:2310.10108.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Yulong Chen, et al. 2023b. Siren’s Song in the AI
Ocean: A Survey on Hallucination in Large Lan-
guage Models. In arXiv preprint arXiv:2309.01219.
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu
Lin, Yong-Jin Liu, and Gao Huang. 2023. ExpeL:
LLM Agents Are Experiential Learners. In AAAI
Conference on Artificial Intelligence (AAAI).
Tianming Zhao, Chunyang Chen, Yuanning Liu, and
Xiaodong Zhu. 2021. GUIGAN: Learning to Gener-
ate GUI Designs Using Generative Adversarial Net-
works. In IEEE/ACM International Conference on
Software Engineering (ICSE), pages 748–760.
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
Bisk, Daniel Fried, Uri Alon, et al. 2023a. We-
bArena: A Realistic Web Environment for Build-
ing Autonomous Agents. In arXiv preprint
arXiv:2307.13854.
Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li,
Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang,
Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu
Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen,
Peng Cui, and Mrinmaya Sachan. 2023b. Agents: An
Open-source Framework for Autonomous Language
Agents. In arXiv preprint arXiv:2309.07870.

You might also like