03-Towards An Understanding of Large Language
03-Towards An Understanding of Large Language
Zibin Zheng
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Kaiwen Ning
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Jiachi Chen
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Yanlin Wang
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Wenqing Chen
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Lianghong Guo
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Weicheng Wang
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
2 Zibin Zheng et al.
software engineering, aiming to answer two questions: (1) What are the current
integrations of LLMs with software engineering? (2) Can LLMs effectively
handle software engineering tasks? To find the answers, we have collected
related literature as extensively as possible from seven mainstream databases,
and selected 123 papers for analysis. We have categorized these papers in detail
and reviewed the current research status of LLMs from the perspective of seven
major software engineering tasks, hoping this will help researchers better grasp
the research trends and address the issues when applying LLMs. Meanwhile,
we have also organized and presented papers with evaluation content to reveal
the performance and effectiveness of LLMs in various software engineering
tasks, providing guidance for researchers and developers to optimize.
1 Introduction
RQ1: What are the current works focusing on combining LLMs and
software engineering?
To answer these questions, we categorized the 123 selected papers according
to the software engineering tasks involved. Based on the specific content of the
software engineering tasks, such as code generation and vulnerability detection,
we divided them into seven categories. They are Code Generation, Code
Summarization, Code translation, Vulnerability Detection, Code Evaluation,
Code Management and Q&A Interaction. For each category, we elaborate on
their definitions and examples, which can help researchers continue to discover
and solve potential issues when applying LLMs to software engineering tasks.
RQ2: Can LLMs truly help better perform current software
engineering tasks?
While LLMs have demonstrated outstanding performance in text
generation tasks, their performance in software engineering tasks like code
generation requires further validation. To address this issue, we conducted
a selection of literature containing evaluations related to LLMs. Considering
that the selected LLMs and software engineering tasks in these works may
vary, we also organized and compiled this information during the screening
process. Our findings indicate that currently, LLMs excel in tasks that demand
an understanding of syntax, such as code summarization and code repair.
However, their performance tends to be less satisfactory in tasks that require
comprehension of code semantics, such as code generation and vulnerability
detection. Nevertheless, we also observed that LLMs continue to make strides
with each version and model iteration, indicating they still possess the
potential to achieve better performance in the future.
The main contributions of this paper are:
• To the best of our knowledge, this is the first systematic review
that investigates the intersection of software engineering and large-scale
language models. We manually selected 123 relevant studies from an
extensive collection of articles in six databases;
• We categorized these tasks manually into seven types based on different
software engineering tasks. For each category, we provided application
examples of LLM and delved into detailed explanations. This can assist
researchers in better identifying and addressing potential challenges when
applying LLM to software engineering tasks;
• We have conducted a comprehensive analysis and compilation of the
performance of LLM in various software engineering tasks and explored
the underlying reasons for variations in its performance across these tasks.
The statistical findings presented herein can assist developers in optimizing
LLM more effectively.
The organization of this paper is as follows. In Section 2, we provide
background knowledge on LLMs; In Section 4, we summarize and categorize
the collected literature and propose an answer to research question 1 (RQ1);
In Section 5, we address research question 2 (RQ2); In Section 6, we elucidate
related work; Finally, in Section 7, we summarize the entire paper.
Title Suppressed Due to Excessive Length 5
2 Background
2.1 Transformer
tasks. In general, large language models can be categorized into three main
architecture types: encoder-decoder structures, causal-decoder, and prefix
decoder (Zhao et al., 2023b), each with its own characteristics and applications.
3 Methodology
Based on the previous literature review (Chen et al., 2021), we have selected
six search engines: ACM Digital Library, IEEE Xplore Digital Library,
dblp, Elsevier Science Direct, Google Scholar, and arXiv. These search
engines allow us to find peer-reviewed research papers published in journals,
conferences, workshops, and symposiums. Additionally, they provide access to
a considerable number of preprint papers and the latest industry developments.
We conducted searches using the following six keywords: “SE LLM,”
“Software Engineering Large Language Model,” “Software Engineering LLM,”
“SE Large Language Model,” “Code LLM,” and “Code Large Language
8 Zibin Zheng et al.
dblp 8 4 1 105 7 62
Model” on the aforementioned six paper databases. The obtained results are
presented in Table 1. It is worth noting that there might be a significant
number of duplicate papers and irrelevant articles resulting from different
keyword searches within the same engine or the same keyword across different
engines. Therefore, we need to manually screen and select these papers, which
is known as literature screening or literature selection.
During this stage, we need to eliminate not only duplicate papers but also
irrelevant ones. For instance, some papers may primarily focus on LLM or
the field of software engineering but do not mention LLM or “Large Language
Model” in their abstracts. Additionally, since the definition of “Large” in LLM
is subject to change, some earlier literature might have been considered LLM
at the time but may not meet the criteria from perspective today (Zhao et al.,
2023b; Lin et al., 2022a; Zan et al., 2023; Meade et al., 2022; Li et al., 2022a;
Wang et al., 2023a). Therefore, we have excluded research conducted before
2022. We applied the following seven exclusion criteria to screen the literature:
Exclusion Criteria
• Studies are not written in English.
• Master or Ph.D. theses.
• Keynote papers.
Title Suppressed Due to Excessive Length 9
The definition of “large” in LLM changes over time. For this reason, we have
filtered out work that does not meet the definition of LLM in (Zhao et al.,
2023b) and ensured that all such work will be made public after 2022. We
used open card sorting to help find the answers to these two RQs. We read
the article carefully and looked closely for answers to the same two questions
shown in Table 2, i.e.,(1). What are the current works focusing on combining
LLMs and software engineering? (2). Can LLMs truly help to better execute
current software engineering tasks? If we cannot find any answers from a paper,
then that paper will be removed from our list.
For the answers to (1), we primarily examined whether the papers
mentioned the application of LLM in software engineering tasks. We organized
this information and categorized the literature based on the types of tasks.
The number of papers included for each task is shown in Fig. 2. We can see
10 Zibin Zheng et al.
Table 3 The Definition of Seven Types of Software Engineering Tasks and the Role of
LLMs.
Vulnerability To identify and fix code errors Check for potential vulnerabilities in
Detection that may cause program crashes, the code, etc.
performance degradation, or security
issues.
Code To perform static analysis on the Generate test cases or test code
Evalua- code to identify potential issues and performance, usability, and other
tion improvement points. indicators, etc.
Code Generation
Fig. 3 An example of code generation with LLMs.
Example: Fig. 3 is an example of using LLM for code generation with GPT-
3.5 (Bhaskar et al., 2023). Users input requirements into LLM in natural
language, and LLM automatically generates code based on their needs. In
this example, we use ChatGPT to generate a Quicksort (Chen et al., 2021)
code implemented by Python.
Code generation heavily relies on search and inference, systematically
searching and reasoning to find code that satisfies the given requirements
within the entire code space (Zheng et al., 2023). LLMs has demonstrated
impressive capabilities in text generation tasks, attracting significant research
efforts to evaluate and improve the performance of LLM in code generation
tasks (Li et al., 2023b).
These research efforts can be roughly categorized into two main themes.
The first theme mainly evaluates or discusses the capabilities of LLMs in code
generation tasks or specific contexts of code generation (Houde et al., 2022;
Sarsa et al., 2022). The evaluation perspectives vary, with some focusing on
the correctness of code generation in different programming languages (Bareiß
et al., 2022; Thakur et al., 2023a; Kande et al., 2023), while others propose
new benchmark frameworks or testing methods to better evaluate the code
generated by LLMs (Liu et al., 2023b; Vaithilingam et al., 2022), providing
directions for improving LLMs in this task.
However, it is important to note that no current technology, including
LLMs, can guarantee that automatically generated code is always completely
usable, as there may be obvious or subtle errors in the code. Therefore, the
second theme of these research efforts is to enhance the code generation
capabilities of LLMs. This includes automatically fixing bugs in code generated
by LLMs (Ni et al., 2023; Jain et al., 2022; Dinh et al., 2023), improving the
Title Suppressed Due to Excessive Length 13
quality of code generated by LLMs (Zhang et al., 2023a; Wang et al., 2023c;
Barke et al., 2023a; Mouselinos et al., 2023; Lahiri et al., 2022; Dong et al.,
2023; Jiang et al., 2023), addressing security and reliability concerns (Poesia
et al., 2022; Zhu and Zhang, 2023), and enhancing the efficiency of LLM code
generation (Li et al., 2023e; Wang et al., 2023b; Murali et al., 2023), among
others.
Code Summarization
Fig. 4 An example of code summarization with LLMs. Generating comments for fast
Sorting algorithm
There are also works that utilize LLMs for code summarization to construct
documentation, generate test cases, or address limitations in LLM-based code
summarization (Khan and Uddin, 2023; Zhang et al., 2023b). For example,
enhancing the robustness of LLM-based code summarization (Zhuo et al.,
2023) or improving the interaction capabilities with developers (MacNeil et al.,
2022a).
Title Suppressed Due to Excessive Length 15
Code Translation
Fig. 5 An example of code translation with LLMsm, converting the fast Sorting algorithm
implemented by Python into Go language.
Definition: Code translation, also named code conversion, refer to the process
of converting code between different programming languages without altering
its functionality or logic. This task holds significant practical value, such as in
system migration, code refactoring, and educational scenarios.
Example: As shown in Fig. 5. In this example, we try to use LLM (GPT-3.5)
to translate the Fast Sorting algorithm built by python into Go language.
Traditional code translation methods often require substantial manual
effort, implementing specific syntax and semantic rules through hard coding.
Moreover, for new or less common programming languages, it may necessitate
the redevelopment and maintenance of translation rules. Currently, the
impressive natural language translation capabilities exhibited by LLMs have
16 Zibin Zheng et al.
been recognized (Zhang et al., 2023c). However, there is limited focus on the
performance of LLMs in code conversion tasks in current research.
Pearce et al.(Pearce et al., 2022b) studied the capability of LLMs in
software reverse engineering. The study explored Codex’s ability to identify
the purpose, functionality, and important variable names or values in code,
thus evaluating the decompilation ability of LLMs. Additionally, Lin et al.(Lin
et al., 2022b) proposed a Cross-language Code representation with a large-
scale pre-training (XCode) method and further introduced a Shared Encoder-
Decoder (SED) architecture.
Currently, there is relatively little research on LLMs in code translation
tasks, and applying LLMs to code translation tasks still faces many challenges.
Firstly, code correctness and precision are crucial, as even small translation
errors can render the generated code non-functional. Secondly, acquiring a
large amount of high-quality source code and target code pairs is challenging,
which may limit the model’s learning and generalization capabilities. Thirdly,
further research is needed on evaluating the performance of code translation
models, as in many cases, there can be multiple different implementations of
the same functionality in code. These issues require further exploration and
resolution in future research.
Vulnerability Detection
Fig. 6 An example of code vulnerability detection with LLMs.
task (Olausson et al., 2023; Prenner et al., 2022; Pearce et al., 2022a; Madaan
et al., 2023), for example, Khoury et al. (Khoury et al., 2023) evaluated
the security of ChatGPT in code generation tasks. The second type focuses
on improving the correctness and performance of LLMs in vulnerability
detection (Xia and Zhang, 2023a,b), such as combining LLMs with formal
verification techniques (Charalambous et al., 2023), incorporating previously
related edits (Gupta et al., 2023), self-debugging algorithms (Chen et al.,
2023b), among others. The third type applies LLMs to vulnerability mining or
other related tasks (Ahmad et al., 2023b; Chan et al., 2023; Fan et al., 2023b;
Xia et al., 2022), including decompilation work (Xu et al., 2023), security
analysis of hardware designs (Ahmad et al., 2023c), and black-box testing (Li
et al., 2023f).
18 Zibin Zheng et al.
Fig. 7 An example of test case generation with LLMs. Generating test cases for fast Sorting
algorithm
collaborative
development
Code
Management
Code version
control
Definition: Interaction between humans and tools has always been a focus
in the field of software engineering, as good interaction can enhance task
performance (Xu et al., 2017). Before the widespread application of LLMs,
an important way for developers to obtain information and solve problems
was through Q&A website, e.g., Stack Overflow7 (Zhang et al., 2022a). The
emergence of LLMs changed this by being able to answer users’ questions,
including professional knowledge in software engineering. As a promising new
tool to help developers solve code issues, LLMs also gave rise to much research
on how to improve the efficiency and convenience of Q&A Interaction (Gao
et al., 2022). Furthermore, since the output generated by LLMs is influenced
by the structure and content of user-provided prompts, research on prompts,
known as prompt engineering (White et al., 2023a).
Example: Fig. 10 is an example of using different prompts to generate quick
sort code for LLM, and the code generated by different prompts varies. The
quality of the prompt also greatly affects the results of LLM feedback.
It is important to note that this section focuses on investigations related
to Q&A Interaction and prompt engineering in the context of software
engineering.
7 www.stackoverflow.com/
Title Suppressed Due to Excessive Length 21
This body of work can also be categorized into two main types. The first
type focuses on the interaction between software practitioners (developers,
beginners, etc.) and LLMs, and involves the development of prototype systems
or interaction frameworks (Ross et al., 2023; Zamfirescu-Pereira et al., 2023;
Cai et al., 2023). Among them, Zamfirescu-Pereira et al. (Zamfirescu-Pereira
et al., 2023) discusses the role of non-AI practitioners in “user cue engineering”
and designs BotDesigner, a cue-based chatbot design tool; Ross et al. (Ross
et al., 2023) demonstrates the role and potential of developer-LLM interactions
for processes such as software development, through interviews with 42
software developers; and Cai et al. (Cai et al., 2023) describes Low-code
LLM, a framework for human-LLM interactions, to better support visual
programming.
The second type consists of research-oriented work, which can be further
divided into several directions. The first direction evaluates the interaction
between LLMs and software developers (Barke et al., 2023b), such as
whether LLMs address the same parts of natural language descriptions as
developers (Kou et al., 2023), or whether they can act as a DevBot (Ahmad
et al., 2023a).
The second direction primarily focuses on prompt engineering (White et al.,
2023b,a; Shrivastava et al., 2023b), aiming to design more efficient prompt
formats or automatically populate prompt content based on different subtasks
and objectives. The third direction addresses security and efficiency issues in
LLM interaction with developers (Sarkar et al., 2022; Sandoval et al., 2023).
In addition to the aforementioned topics, there are other works that combine
LLMs with software engineering. These works either discuss the performance of
LLMs in specific subtasks (Ozkaya, 2023; Sadik et al., 2023; Xing et al., 2023),
such as visualization (Maddigan and Susnjak, 2023), information extraction (Li
et al., 2023a,c), and modeling (Nichols et al., 2023), propose their own solutions
to existing problems, such as addressing performance issues (Jain et al., 2023),
develop tools or datasets, such as code-text datasets (Manh et al., 2023; Liu
et al., 2023c), or identify issues related to LLMs (Treude and Hata, 2023;
22 Zibin Zheng et al.
Khlaaf et al., 2022). Additionally, some works focus on exploring the potential
and applications of LLMs in the field of education (MacNeil et al., 2022b).
In this section, we primarily discuss RQ2. First, we screened papers from our
collection that evaluated the performance of LLMs on software engineering
tasks. Next, we extracted the LLMs used and software engineering tasks
targeted in these works. Finally, some works in Section 4 also evaluated and
discussed the performance of LLMs on some specific tasks. Therefore, we will
summarize these works here and emphasize their evaluation results.
A significant portion of the work conducted has empirically analyzed
the performance of ChatGPT, one of the most popular LLM models, as a
programming assistant (Tian et al., 2023; Sridhara et al., 2023; Li et al., 2023d;
Liu et al., 2023a). These studies have found that ChatGPT’s performance
varies across different tasks. For instance, it performs well in tasks such as log
summarization, referential resolution, and code summarization, but struggles
in vulnerability detection and test case generation. Particularly in vulnerability
detection, ChatGPT finds it challenging to identify subtle code differences
when two versions have similar syntax (Li et al., 2023d). In some tasks such
as Text-to-SQL (Liu et al., 2023a), answering software testing questions (Jalil
et al., 2023), and test case generation (Tang et al., 2023b), although ChatGPT
did not achieve outstanding performance, the authors still maintain a positive
outlook. Some studies also highlight the limitations of ChatGPT’s attention
scope (Sridhara et al., 2023).
Furthermore, some works analyze ChatGPT’s performance in software
engineering tasks from different perspectives. For instance, Ma et al.(Ma et al.,
2023) investigates ChatGPT’s understanding of code syntax and semantic
structure, concluding that while ChatGPT excels in understanding code
syntax (e.g., Abstract Syntax Trees), it faces difficulties in understanding code
semantics, especially dynamic semantics. Feng et al.(Feng et al., 2023) explores
ChatGPT’s code generation abilities through analyzing comments on Twitter
and Reddit, examining people’s sentiment towards ChatGPT’s code generation
capabilities.
There are also detailed evaluations of LLMs’ performance in specific tasks,
such as reverse engineering (Pearce et al., 2022b), code explanation (Zhuo
et al., 2023), code analysis (Feiyu, 2023), and vulnerability repair (Pearce
et al., 2022a). These studies generally provide more critical conclusions,
suggesting that LLMs still lag behind state-of-the-art methods in these tasks.
However, two works evaluating LLMs in automated program repair (Fan et al.,
2023b; Xia et al., 2022) present very positive findings. Additionally, several
evaluations on specific tasks yield more positive conclusions or affirm the
potential of LLMs in those tasks, such as code generation (Vaithilingam et al.,
2022; Kande et al., 2023; Thakur et al., 2023b) and error fixing (Prenner et al.,
2022). (gangz, 2023) evaluates the ability of large models to generate test cases
Title Suppressed Due to Excessive Length 23
6 Related work
their capabilities and limitations, and providing directions for future research
and optimization.
Current works also reveal some future directions worth discussing: (1)
We can see that a large part of the work in Section 4 proposes methods to
improve the performance of LLMs on one or several software engineering tasks.
Although most of them do not provide detailed evaluations or discussions on
the performance of LLMs on these tasks, this might suggest that the current
performance of LLMs on these tasks is not good enough or not stable; (2) Most
current evaluations are based on general large models, such as ChatGPT, and
detailed evaluations of code-centric large models like Codex are still lacking;
(3) Do we need to fine-tune large models for specific software engineering tasks
to create large model products tailored for specific tasks? We will gradually
seek the answers to these questions in the future.
8 Acknowledgements
References
2307.14469
Fan L, Li L, Ma Z, Lee S, Yu H, Hemphill L (2023a) A bibliometric review of
large language models research from 2017 to 2023. 2304.02020
Fan Z, Gao X, Mirchev M, Roychoudhury A, Tan SH (2023b) Automated
repair of programs from large language models. 2205.10583
Feiyu (2023) Wechat. URL https://tern.cc/o150R4
Feldt R, Kang S, Yoon J, Yoo S (2023) Towards autonomous testing agents
via conversational large language models. 2306.05152
Feng S, Ma S, Yu J, Chen C, Zhou T, Zhen Y (2021) Auto-icon: An automated
code generation tool for icon designs assisting in UI development. In:
Hammond T, Verbert K, Parra D, Knijnenburg BP, O’Donovan J, Teale P
(eds) IUI ’21: 26th International Conference on Intelligent User Interfaces,
College Station, TX, USA, April 13-17, 2021, ACM, pp 59–69, DOI 10.1145/
3397481.3450671, URL https://doi.org/10.1145/3397481.3450671
Feng Y, Vanam S, Cherukupally M, Zheng W, Qiu M, Chen H (2023)
Investigating code generation performance of chat-gpt with crowdsourcing
social data. In: Proceedings of the 47th IEEE Computer Software and
Applications Conference, pp 1–10
Ferrario MA, Winter E (2023) Applying human values theory to software
engineering practice: Lessons and implications. IEEE Trans Software Eng
49(3):973–990, DOI 10.1109/TSE.2022.3170087, URL https://doi.org/
10.1109/TSE.2022.3170087
gangz (2023) Gitee. URL
https://gitee.com/gangz2009/tetris-by-chat-gpt/
Gao J, Guo Y, Lim G, Zhang T, Zhang Z, Li TJJ, Perrault ST (2023)
Collabcoder: A gpt-powered workflow for collaborative qualitative analysis.
2304.07366
Gao Z, Xia X, Lo D, Grundy JC (2022) Technical q&a site answer
recommendation via question boosting. CoRR abs/2210.15846, DOI 10.
48550/arXiv.2210.15846, URL https://doi.org/10.48550/arXiv.2210.
15846, 2210.15846
Gozalo-Brizuela R, Garrido-Merchan EC (2023) Chatgpt is not all you need.
a state of the art review of large generative ai models. 2301.04655
Gupta P, Khare A, Bajpai Y, Chakraborty S, Gulwani S, Kanade A,
Radhakrishna A, Soares G, Tiwari A (2023) Grace: Generation using
associated code edits. 2305.14129
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image
recognition. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp 770–778
Hellas A, Leinonen J, Sarsa S, Koutcheme C, Kujanpää L, Sorva J (2023)
Exploring the responses of large language models to beginner programmers’
help requests. 2306.05715
Hoffmann M, Méndez D, Fagerholm F, Luckhardt A (2023) The human side
of software engineering teams: An investigation of contemporary challenges.
IEEE Trans Software Eng 49(1):211–225, DOI 10.1109/TSE.2022.3148539,
URL https://doi.org/10.1109/TSE.2022.3148539
32 Zibin Zheng et al.
Xia CS, Zhang L (2023b) Keep the conversation going: Fixing 162 out of 337
bugs for $0.42 each using chatgpt. 2304.00385
Xia CS, Wei Y, Zhang L (2022) Practical program repair in the era of large
pre-trained language models. 2210.14179
Xiao Z, Yuan X, Liao QV, Abdelghani R, Oudeyer PY (2023) Supporting
qualitative analysis with large language models: Combining codebook with
GPT-3 for deductive coding. In: 28th International Conference on Intelligent
User Interfaces, ACM, DOI 10.1145/3581754.3584136, URL https://doi.
org/10.1145%2F3581754.3584136
Xing Z, Huang Q, Cheng Y, Zhu L, Lu Q, Xu X (2023) Prompt sapper:
Llm-empowered software engineering infrastructure for ai-native services.
2306.02230
Xu B, Xing Z, Xia X, Lo D (2017) Answerbot: automated generation of answer
summary to developersź technical questions. In: Rosu G, Penta MD, Nguyen
TN (eds) Proceedings of the 32nd IEEE/ACM International Conference on
Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30
- November 03, 2017, IEEE Computer Society, pp 706–716, DOI 10.1109/
ASE.2017.8115681, URL https://doi.org/10.1109/ASE.2017.8115681
Xu C, McAuley J (2023) A survey on model compression and acceleration
for pretrained language models. Proceedings of the AAAI Conference on
Artificial Intelligence 37(9):10566–10575, DOI 10.1609/aaai.v37i9.26255,
URL https://ojs.aaai.org/index.php/AAAI/article/view/26255
Xu X, Zhang Z, Feng S, Ye Y, Su Z, Jiang N, Cheng S, Tan L, Zhang X
(2023) Lmpa: Improving decompilation by synergy of large language model
and program analysis. 2306.02546
Yang J, Prabhakar A, Narasimhan K, Yao S (2023) Intercode: Standardizing
and benchmarking interactive coding with execution feedback. 2306.14898
Yuan Z, Lou Y, Liu M, Ding S, Wang K, Chen Y, Peng X (2023) No more
manual tests? evaluating and improving chatgpt for unit test generation.
2305.04207
Zamfirescu-Pereira J, Wong RY, Hartmann B, Yang Q (2023) Why johnny
can’t prompt: How non-ai experts try (and fail) to design llm prompts. In:
Proceedings of the 2023 CHI Conference on Human Factors in Computing
Systems, Association for Computing Machinery, New York, NY, USA
Zan D, Chen B, Zhang F, Lu D, Wu B, Guan B, Wang Y, Lou JG (2023)
Large language models meet nl2code: A survey. 2212.09420
Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W,
Xia X, et al. (2022) Glm-130b: An open bilingual pre-trained model. arXiv
preprint arXiv:221002414
Zhang K, Li Z, Li J, Li G, Jin Z (2023a) Self-edit: Fault-aware code editor for
code generation. 2305.04087
Zhang K, Wang D, Xia J, Wang WY, Li L (2023b) Algo: Synthesizing
algorithmic programs with generated oracle verifiers. 2305.14591
Zhang N, Huang Q, Xia X, Zou Y, Lo D, Xing Z (2022a) Chatbot4qr:
Interactive query refinement for technical question retrieval. IEEE Trans
Software Eng 48(4):1185–1211, DOI 10.1109/TSE.2020.3016006, URL
Title Suppressed Due to Excessive Length 41
https://doi.org/10.1109/TSE.2020.3016006
Zhang R, Cahyawijaya S, Cruz JCB, Aji AF (2023c) Multilingual large
language models are not (yet) code-switchers. 2305.14235
Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M,
Li X, Lin XV, et al. (2022b) Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:220501068
Zhao J, Rong Y, Guo Y, He Y, Chen H (2023a) Understanding programs by
exploiting (fuzzing) test cases. 2305.13592
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang
J, Dong Z, et al. (2023b) A survey of large language models. arXiv preprint
arXiv:230318223
Zheng Q, Xia X, Zou X, Dong Y, Wang S, Xue Y, Wang Z, Shen L, Wang
A, Li Y, Su T, Yang Z, Tang J (2023) Codegeex: A pre-trained model
for code generation with multilingual evaluations on humaneval-x. CoRR
abs/2303.17568, DOI 10.48550/arXiv.2303.17568, URL https://doi.org/
10.48550/arXiv.2303.17568, 2303.17568
Zhu R, Zhang C (2023) How robust is a large pre-trained language model
for code generationƒ a case on attacking gpt2. In: 2023 IEEE International
Conference on Software Analysis, Evolution and Reengineering (SANER),
pp 708–712, DOI 10.1109/SANER56733.2023.00076
Zhuo TY (2023) Large language models are state-of-the-art evaluators of code
generation. 2304.14317
Zhuo TY, Li Z, Huang Y, Shiri F, Wang W, Haffari G, Li YF (2023)
On robustness of prompt-based semantic parsing with large pre-trained
language model: An empirical study on codex. In: Proceedings of the 17th
Conference of the European Chapter of the Association for Computational
Linguistics, Association for Computational Linguistics, Dubrovnik, Croatia,
pp 1090–1102