0% found this document useful (0 votes)
20 views

03-Towards An Understanding of Large Language

Uploaded by

thomasyyyoung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

03-Towards An Understanding of Large Language

Uploaded by

thomasyyyoung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Noname manuscript No.

(will be inserted by the editor)

Towards an Understanding of Large Language


Models in Software Engineering Tasks

Zibin Zheng · Kaiwen Ning · Jiachi


Chen · Yanlin Wang · Wenqing Chen ·
Lianghong Guo · Weicheng Wang

Received: date / Accepted: date


arXiv:2308.11396v1 [cs.SE] 22 Aug 2023

Abstract Large Language Models (LLMs) have drawn widespread attention


and research due to their astounding performance in tasks such as text
generation and reasoning. Derivative products, like ChatGPT, have been
extensively deployed and highly sought after. Meanwhile, the evaluation
and optimization of LLMs in software engineering tasks, such as code
generation, have become a research focus. However, there is still a lack of
systematic research on the application and evaluation of LLMs in the field
of software engineering. Therefore, this paper is the first to comprehensively
investigate and collate the research and products combining LLMs with

Jiachi Chen is the corresponding author.

Zibin Zheng
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Kaiwen Ning
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Jiachi Chen
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Yanlin Wang
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Wenqing Chen
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Lianghong Guo
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
Weicheng Wang
School of Software Engineering, Sun Yat-sen University, China
E-mail: [email protected]
2 Zibin Zheng et al.

software engineering, aiming to answer two questions: (1) What are the current
integrations of LLMs with software engineering? (2) Can LLMs effectively
handle software engineering tasks? To find the answers, we have collected
related literature as extensively as possible from seven mainstream databases,
and selected 123 papers for analysis. We have categorized these papers in detail
and reviewed the current research status of LLMs from the perspective of seven
major software engineering tasks, hoping this will help researchers better grasp
the research trends and address the issues when applying LLMs. Meanwhile,
we have also organized and presented papers with evaluation content to reveal
the performance and effectiveness of LLMs in various software engineering
tasks, providing guidance for researchers and developers to optimize.

Keywords Empirical Study · Literature Review · Large Language Models ·


Software Engineering

1 Introduction

Large language models(LLMs) refer to neural network language models trained


on massive text data, with model sizes reaching hundreds of billions or more
parameters (Zhao et al., 2023b). The most advanced LLMs to date, such as
GPT-4 (Liu et al., 2023e; Savelka et al., 2023) and LaMDA (Thoppilan et al.,
2022), have demonstrated remarkable language comprehension and generation
capabilities, able to perform well on a variety of natural language processing
tasks, such as text summarization (Tang et al., 2023a). The derivative products
and applications of LLMs, such as ChatGPT (Gozalo-Brizuela and Garrido-
Merchan, 2023) and Claude (tse Huang et al., 2023), have also gained
widespread attention in both academic and industrial communities. Given the
outstanding performance of LLMs, there is a growing focus on exploring their
potential in software engineering tasks, seeking new opportunities to address
challenges in the field of software engineering (Zan et al., 2023).
The primary objective of software engineering is to develop high-
quality and easily maintainable software products (Kotti et al., 2023).
This process involves multiple stages, including requirement analysis,
design, development, testing, and maintenance (Hoffmann et al., 2023).
Throughout this process, engineers need to handle various software engineering
tasks, such as understanding complex requirements, writing correct and
reliable code, constructing comprehensive test cases, etc. In general, these
software engineering tasks often rely on the expertise and experience of
software engineers. This dependence not only incurs significant human
resource costs, but also increases the difficulty of software development
and maintenance. Additionally, with the increasing complexity of user
demands and the emergence of new types of applications, engineers also
face additional challenges when dealing with software engineering tasks, such
as collaborative development, enhancing software quality, and shortening
development cycles (Ferrario and Winter, 2023).
Title Suppressed Due to Excessive Length 3

Therefore, researching how to enhance the level of automation in


software engineering tasks to achieve more efficient software production and
maintenance is a critical and widely discussed topic. Currently, numerous
efforts are devoted to developing relevant tools and algorithms to assist
or replace engineers in completing these software engineering tasks. For
instance, automated code generation (Feng et al., 2021), automatic test case
generation (Wang et al., 2022), and vulnerability detection (Liu et al., 2023f)
are some of the areas of focus. However, these individual automation methods
often face challenges when applied universally (Wang et al., 2023e), and some
existing automated solutions may introduce new issues or errors. For example,
automatically generated code may contain potential vulnerabilities (Thakur
et al., 2023a), automatically generated test cases often fail to achieve
comprehensive coverage (Lemieux et al., 2023), etc.
Fortunately, LLMs have great potential to solve the above problems due to
their excellent performance on complex tasks such as text generation. LLMs
usually need to be trained on a large-scale corpus containing a code base
with multiple segments (Wei et al., 2022). This allows it to better learn the
code syntax and semantics, and has the potential to be better equipped for
tasks such as code completion and code summarization (Wang et al., 2023d).
Currently, as more and more LLMs designed for software engineering tasks are
deployed (Huaggingface, 2021; Nijkamp et al., 2023; Zheng et al., 2023; Wang
et al., 2021; Li et al., 2022b; Chai et al., 2022; Wang et al., 2023d), many
research works focused on the application of LLMs in the software engineering
domain (Ahmad et al., 2023a; Zhang et al., 2023b; Li et al., 2023a; Barke
et al., 2023a; Zhuo et al., 2023; Gozalo-Brizuela and Garrido-Merchan, 2023).
The ability of LLMs such as generate high-quality code and high-coverage
test cases has become a hot topic in the software engineering domain (Zan
et al., 2023). However, in the existing literature, adequate systematic reviews
and surveys have been conducted on LLMs applications in areas such as
education (Leinonen et al., 2023b), but a systematic review of the application
and evaluation of LLMs in the field of software engineering is still missing.
In this study, our goal is to conduct a comprehensive review of the
intersecting theme of LLMs and software engineering. To facilitate this, we
initially gathered as much relevant literature as possible from six major
academic databases, namely ACM Digital Library1 , IEEE Xplore Digital
Library2 , dblp3 , Elsevier Science Direct4 , Google Scholar5 , and arXiv6 .
Through the method of card sorting (Reese et al., 2018), we have eliminated
duplicate and irrelevant literature. After the screening process, we have
obtained 123 relevant and valid research papers. With this study, we aim to
address the following two questions:
1 https://dl.acm.org/
2 https://ieeexplore.ieee.org/Xplore/home.jsp
3 https://dblp.uni-trier.de/
4 https://www.sciencedirect.com/
5 https://scholar.google.com
6 https://arxiv.org/
4 Zibin Zheng et al.

RQ1: What are the current works focusing on combining LLMs and
software engineering?
To answer these questions, we categorized the 123 selected papers according
to the software engineering tasks involved. Based on the specific content of the
software engineering tasks, such as code generation and vulnerability detection,
we divided them into seven categories. They are Code Generation, Code
Summarization, Code translation, Vulnerability Detection, Code Evaluation,
Code Management and Q&A Interaction. For each category, we elaborate on
their definitions and examples, which can help researchers continue to discover
and solve potential issues when applying LLMs to software engineering tasks.
RQ2: Can LLMs truly help better perform current software
engineering tasks?
While LLMs have demonstrated outstanding performance in text
generation tasks, their performance in software engineering tasks like code
generation requires further validation. To address this issue, we conducted
a selection of literature containing evaluations related to LLMs. Considering
that the selected LLMs and software engineering tasks in these works may
vary, we also organized and compiled this information during the screening
process. Our findings indicate that currently, LLMs excel in tasks that demand
an understanding of syntax, such as code summarization and code repair.
However, their performance tends to be less satisfactory in tasks that require
comprehension of code semantics, such as code generation and vulnerability
detection. Nevertheless, we also observed that LLMs continue to make strides
with each version and model iteration, indicating they still possess the
potential to achieve better performance in the future.
The main contributions of this paper are:
• To the best of our knowledge, this is the first systematic review
that investigates the intersection of software engineering and large-scale
language models. We manually selected 123 relevant studies from an
extensive collection of articles in six databases;
• We categorized these tasks manually into seven types based on different
software engineering tasks. For each category, we provided application
examples of LLM and delved into detailed explanations. This can assist
researchers in better identifying and addressing potential challenges when
applying LLM to software engineering tasks;
• We have conducted a comprehensive analysis and compilation of the
performance of LLM in various software engineering tasks and explored
the underlying reasons for variations in its performance across these tasks.
The statistical findings presented herein can assist developers in optimizing
LLM more effectively.
The organization of this paper is as follows. In Section 2, we provide
background knowledge on LLMs; In Section 4, we summarize and categorize
the collected literature and propose an answer to research question 1 (RQ1);
In Section 5, we address research question 2 (RQ2); In Section 6, we elucidate
related work; Finally, in Section 7, we summarize the entire paper.
Title Suppressed Due to Excessive Length 5

2 Background

In this section, we will introduce the background of large language models,


including the transformer model, the architecture of large language models,
and their emergent capabilities.

2.1 Transformer

Currently, the mainstream large language models are based on the


Transformer (Vaswani et al., 2017) model, such as GPT-3 (Brown et al., 2020)
and PaLM (Chowdhery et al., 2022). Compared to traditional deep learning
model structures like recurrent neural networks (RNNs) and convolutional
neural networks (CNNs), the Transformer model relies on a attention
mechanism, allowing for parallel computation and more effective capture of
long-range dependencies. This provides a foundation for effectively training
large language models on large-scale text data.
The Transformer model first introduced by Vaswani et al (Vaswani et al.,
2017) in 2017, is a sequence-to-sequence model consisting of an encoder and
a decoder. Both the encoder and decoder are composed of multiple identical
blocks stacked together. Each block in the encoder primarily includes a multi-
head self-attention module and a position-wise feed-forward network (FFN),
with residual connections (He et al., 2016) and layer normalization (Ba et al.,
2016) applied to enable deeper model building. (Lin et al., 2022a) In the
decoder, cross-attention modules are inserted between the multi-head self-
attention modules and the position-wise FFNs, allowing for the incorporation
of information from the encoder. Notably, the self-attention modules in the
decoder are modified to prevent attending to subsequent positions.
The core component of the Transformer is the attention mechanism,
which allows the model to weigh the importance of different words or
tokens in a sequence when generating or understanding the context. By
attending to relevant parts of the input sequence, the Transformer model can
effectively model the relationships between words and capture rich contextual
information. Additionally, the attention mechanism does not involve sequential
operations, enabling parallel computation and resulting in higher efficiency
during training and inference in the Transformer model.

2.2 Model Architecture

The Transformer architecture, proposed by Vaswani et al. in 2017 (Vaswani


et al., 2017), has emerged as the leading choice for developing large language
models (LLMs) due to its exceptional parallelizability and capacity (Zhao
et al., 2023b). This scalability allows language models to be expanded to
include hundreds or even thousands of billions of parameters, enabling them to
capture more complex language patterns and improve performance on various
6 Zibin Zheng et al.

tasks. In general, large language models can be categorized into three main
architecture types: encoder-decoder structures, causal-decoder, and prefix
decoder (Zhao et al., 2023b), each with its own characteristics and applications.

Encoder-decoder Architecture: The vanilla Transformer proposed


in (Vaswani et al., 2017) is based on encoder-decoder architecture, which
comprises separate encoder and decoder components. The encoder processes
the input sequence and captures its latent representation, which is then
used by the decoder to generate the output sequence autoregressively. This
architecture is well-suited for tasks involving sequence-to-sequence mapping,
such as machine translation, text summarization, and dialogue generation.
Encoder-decoder pretrained model. Encoder-decoder pretrained models, such
as BART (Lewis et al., 2019) and T5 (Raffel et al., 2020), have demonstrated
excellent performance across various downstream tasks. However, currently,
there are only a few large language models based on the encoder-decoder
architecture, such as Flan-T5 (Chung et al., 2022) and CodeT5+ (Wang et al.,
2023e).

Causal Decoder Architecture: The causal decoder architecture is


commonly implemented as a stack of decoder layers. It utilizes a diagonal mask
matrix, allowing each token to only have access to information from preceding
tokens. This constraint ensures a unidirectional and autoregressive generation
process. The GPT series model, initially introduced by OpenAI (Radford et al.,
2018, 2019; Brown et al., 2020), represents one of the most prominent examples
of the causal decoder architecture. While GPT (Radford et al., 2018) and
GPT-2 (Radford et al., 2018) did not exhibit the same level of performance
as GPT-3 (Brown et al., 2020), with the increase in model size and the
amount of data used for pretraining, GPT-3 (Brown et al., 2020) showcased a
remarkable few-shot capability that earlier models did not possess. Today, the
causal decoder architecture has become the prevailing choice for large language
model architectures, giving rise to a wide range of powerful LLMs such as
PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al., 2023), OPT (Zhang
et al., 2022b), Bloom (Scao et al., 2022). The causal decoder architecture and
the prefix decoder architecture, which will be discussed next, are collectively
referred to as decoder-only architecture (Zhao et al., 2023b).

Prefix Decoder Architecture: The prefix decoder, similar to the causal


decoder architecture, consists of decoder layers. However, the key distinction
is in their attention mechanism. The prefix decoder utilizes bidirectional
attention for the prefix tokens, incorporating information from both preceding
and succeeding tokens. In contrast, unidirectional attention is applied only to
the generated tokens, ensuring a unidirectional flow of information during the
generation process. This combination of attention mechanisms in the prefix
decoder enables flexible and controlled generation, conditioned on both the
prefix and the generated tokens. Some commonly known models based on
the prefix decoder architecture include U-PaLM (Tay et al., 2022) and GLM-
130B (Zeng et al., 2022).
Title Suppressed Due to Excessive Length 7

Literature Literature Data


Search Selection Analysis

Seven Exclusion Paper Reading


Key Words Search
Criteria
Six Search Engines Manually Context Literature
Analysis classification
Including Using card
published Discussion and
classification
literature and Unified
methods to check
preprint literature Conclusion
data

Literature collection and screening process: large-scale collection of


literature and accurate screening of relevant literature

Fig. 1 Overview of methodology design.

2.3 Emergent Abilities

According to the scaling law of large language models (Kaplan et al.,


2020), as the model parameters and training data increase, the model’s
capacity and capabilities also improve. When scaling surpasses a certain
threshold, LLMs demonstrate emergent abilities that are not present in
smaller models (Wei et al., 2022). These emergent abilities are considered
the most notable characteristic that sets large models apart from their smaller
counterparts. Such as in-context learning (Brown et al., 2020), instruction
following (Sanh et al., 2021; Ouyang et al., 2022; Wei et al., 2021), and step-
by-step reasoning (Shanahan, 2022).

3 Methodology

In this section, we introduce the detailed steps of conducting a literature


review. We follow the method provided by (Kitchenham, 2007) for the
literature review. Generally speaking, it consists of three steps: literature
search, literature screening, and data analysis. As shown in Fig. 1.

3.1 Literature Search

Based on the previous literature review (Chen et al., 2021), we have selected
six search engines: ACM Digital Library, IEEE Xplore Digital Library,
dblp, Elsevier Science Direct, Google Scholar, and arXiv. These search
engines allow us to find peer-reviewed research papers published in journals,
conferences, workshops, and symposiums. Additionally, they provide access to
a considerable number of preprint papers and the latest industry developments.
We conducted searches using the following six keywords: “SE LLM,”
“Software Engineering Large Language Model,” “Software Engineering LLM,”
“SE Large Language Model,” “Code LLM,” and “Code Large Language
8 Zibin Zheng et al.

Table 1 Number of keyword searches returned by each search engine.

SE Software Software SE Large Code LLM Code


LLM Engineer- Engineer- Language Large
ing Large ing LLM Model Language
Language Model
Model

ACM 4983 55907 39032 54532 27894 54990


Digital
Library

IEEE 0 240 7 14 14 328


Xplore
Digital
Library

dblp 8 4 1 105 7 62

Elsevier 56 11273 54 7063 64 20649


Science
Direct

Google 10500 16400 4020 25400 11400 17700


Scholar

arXiv 5 182 70 17 463 1461

Model” on the aforementioned six paper databases. The obtained results are
presented in Table 1. It is worth noting that there might be a significant
number of duplicate papers and irrelevant articles resulting from different
keyword searches within the same engine or the same keyword across different
engines. Therefore, we need to manually screen and select these papers, which
is known as literature screening or literature selection.

3.2 Literature Selection

During this stage, we need to eliminate not only duplicate papers but also
irrelevant ones. For instance, some papers may primarily focus on LLM or
the field of software engineering but do not mention LLM or “Large Language
Model” in their abstracts. Additionally, since the definition of “Large” in LLM
is subject to change, some earlier literature might have been considered LLM
at the time but may not meet the criteria from perspective today (Zhao et al.,
2023b; Lin et al., 2022a; Zan et al., 2023; Meade et al., 2022; Li et al., 2022a;
Wang et al., 2023a). Therefore, we have excluded research conducted before
2022. We applied the following seven exclusion criteria to screen the literature:
Exclusion Criteria
• Studies are not written in English.
• Master or Ph.D. theses.
• Keynote papers.
Title Suppressed Due to Excessive Length 9

• Studies not related to LLM.


• Studies not related to software engineering.
• Duplicate literature.
• Studies up to 2022 (not including 2022).
In this study, we are specifically focused on the intersection of LLM and
software engineering. Therefore, we will exclude papers that solely focus on
LLM or software engineering. We are interested in the following topics:
Inclusion Topics
• LLM in Software Engineering.
• Application of LLM in Software Engineering (for example, code
generation).
• Empirical Studies of LLM on Software Engineering Tasks
To improve the accuracy of the literature screening process, we will use the
card sorting method to evaluate the collected data. Card sorting is a common
technique used to assess data and derive categories from it. There are three
types of card sorting methods: open card sorting, closed card sorting, and
hybrid card sorting. Among these three types, closed card sorting involves
predefined categories (Kitchenham, 2007). In our case, we applied closed card
sorting to select relevant papers since we have only two categories: relevant
and irrelevant. Each card will have a title (paper title) and a description
(paper abstract). By using this method, we can systematically evaluated and
categorized the papers based on their relevance to our research.
Six experienced researchers, including one non-coauthors, independently
conducted a thorough review of the search engine results from the six
databases. After individually organizing the papers, they engaged in a
collaborative discussion to align their findings. Through this rigorous process,
we ultimately identified 123 relevant papers that met the criteria for inclusion
in our study.

3.3 Data Analysis

The definition of “large” in LLM changes over time. For this reason, we have
filtered out work that does not meet the definition of LLM in (Zhao et al.,
2023b) and ensured that all such work will be made public after 2022. We
used open card sorting to help find the answers to these two RQs. We read
the article carefully and looked closely for answers to the same two questions
shown in Table 2, i.e.,(1). What are the current works focusing on combining
LLMs and software engineering? (2). Can LLMs truly help to better execute
current software engineering tasks? If we cannot find any answers from a paper,
then that paper will be removed from our list.
For the answers to (1), we primarily examined whether the papers
mentioned the application of LLM in software engineering tasks. We organized
this information and categorized the literature based on the types of tasks.
The number of papers included for each task is shown in Fig. 2. We can see
10 Zibin Zheng et al.

Fig. 2 Number of literature on different software engineering tasks.

Table 2 Data Collection for Each RQ.

RQs Type of Data We Collected

RQ1 What are the current works focusing on combining


LLMs and software engineering? Such as code
summarization, code translation, code generation,
interaction with software developers, etc.

RQ2 Can LLMs truly help to better execute current


software engineering tasks? Such as performance
status, defects, potential, completion, deficiencies, etc.

that this is a relatively large number of Code Generation-oriented studies


with 24 papers; conversely, Code translation-oriented is the least, with 3
papers. The number of papers in remaining 6 software engineering tasks are
similar, with the least being Code Managenment-oriented work, with 6 papers.
And Vulnerability Detection oriented tasks are the most with 17 papers.
The specifics of these tasks are presented in 4. For the answers to (2), we
focused on reading whether the papers provided critical or explicit opinions
on the performance of LLM in software engineering. Particularly, we examined
instance studies and survey articles in this regard.

4 LLMs in Software Engineering Tasks

In this section, we primarily respond to RQ1. We categorize the collected


work into seven categories based on different tasks, and then provide separate
explanations for each category, as shown in Table 3.
Title Suppressed Due to Excessive Length 11

Table 3 The Definition of Seven Types of Software Engineering Tasks and the Role of
LLMs.

Task Definition The possible role of LLMs

Code Gen- Automatically generating source code (auxiliary) Generate code


eration based on user requirements and or provide developers with ideas and
specified constraints. programming ’starting points’,etc.

Code To automatically generate clear, Code summaries that


Summa- accurate, and useful code comments to assist with different granularity (such
rization aid developers in understanding and as functions) or explain the intent of
maintaining code. the code,etc.

Code Converting code between different Auxiliary code conversion, Reverse


transla- programming languages without engineering,etc.
tion altering its functionality or logic.

Vulnerability To identify and fix code errors Check for potential vulnerabilities in
Detection that may cause program crashes, the code, etc.
performance degradation, or security
issues.

Code To perform static analysis on the Generate test cases or test code
Evalua- code to identify potential issues and performance, usability, and other
tion improvement points. indicators, etc.

Code Manage information such as code Team collaborative development,


Manage- versions and developers. version control, etc
ment

Q&A In- Interaction between software Program assistant, Prompt


teraction developers and LLM engineering, etc

4.1 Code Generation

Definition: Code generation, also known as program synthesis, is the


process of automatically generating source code based on user requirements
and specified constraints (Chen et al., 2023a). In most cases, it involves
transforming text into code.
The history of code generation can be traced back to using theorem proving
to construct user-provided specifications and extract corresponding logical
programs. Code generation typically involves two steps: first, parsing the
user’s requirements and constraints and converting them into an intermediate
representation, such as an abstract syntax tree (AST) (Hu et al., 2023b) or
intermediate representation (IR) (Li et al., 2022c); then, generating the target
code based on this intermediate representation. The first step is requirement
understanding, which involves transforming natural language into a formal,
computer-understandable representation; the second step is code synthesis,
which involves generating code based on this representation (Jiang et al.,
2023).
12 Zibin Zheng et al.

Please write a quick sort code using Python

Code Generation
Fig. 3 An example of code generation with LLMs.

Example: Fig. 3 is an example of using LLM for code generation with GPT-
3.5 (Bhaskar et al., 2023). Users input requirements into LLM in natural
language, and LLM automatically generates code based on their needs. In
this example, we use ChatGPT to generate a Quicksort (Chen et al., 2021)
code implemented by Python.
Code generation heavily relies on search and inference, systematically
searching and reasoning to find code that satisfies the given requirements
within the entire code space (Zheng et al., 2023). LLMs has demonstrated
impressive capabilities in text generation tasks, attracting significant research
efforts to evaluate and improve the performance of LLM in code generation
tasks (Li et al., 2023b).
These research efforts can be roughly categorized into two main themes.
The first theme mainly evaluates or discusses the capabilities of LLMs in code
generation tasks or specific contexts of code generation (Houde et al., 2022;
Sarsa et al., 2022). The evaluation perspectives vary, with some focusing on
the correctness of code generation in different programming languages (Bareiß
et al., 2022; Thakur et al., 2023a; Kande et al., 2023), while others propose
new benchmark frameworks or testing methods to better evaluate the code
generated by LLMs (Liu et al., 2023b; Vaithilingam et al., 2022), providing
directions for improving LLMs in this task.
However, it is important to note that no current technology, including
LLMs, can guarantee that automatically generated code is always completely
usable, as there may be obvious or subtle errors in the code. Therefore, the
second theme of these research efforts is to enhance the code generation
capabilities of LLMs. This includes automatically fixing bugs in code generated
by LLMs (Ni et al., 2023; Jain et al., 2022; Dinh et al., 2023), improving the
Title Suppressed Due to Excessive Length 13

quality of code generated by LLMs (Zhang et al., 2023a; Wang et al., 2023c;
Barke et al., 2023a; Mouselinos et al., 2023; Lahiri et al., 2022; Dong et al.,
2023; Jiang et al., 2023), addressing security and reliability concerns (Poesia
et al., 2022; Zhu and Zhang, 2023), and enhancing the efficiency of LLM code
generation (Li et al., 2023e; Wang et al., 2023b; Murali et al., 2023), among
others.

4.2 Code Summarization

Definition: Code summarization is an important technique for program


understanding and automatic documentation generation. The main goal of this
task is to automatically generate clear, accurate, and useful code comments
to aid developers in understanding and maintaining code.
Example: Fig. 4 attempts to explain the code generated by Fig. 3 and
automatically fill in comments. Typically, it involves taking source code as
input and generating natural language output. Recently, there have been
numerous works applying LLMs (GPT-3.5) to natural language description
tasks for programs (Hu et al., 2023c).
Code summarization can be performed at different granularities, such as
line-level, function-level, or module-level. Different levels of summarization
require handling different contextual information. For example, function-level
summarization may need to understand the inputs, outputs, functionality, and
invocation relationships of functions, while module-level summarization may
require understanding the functionality and interfaces with other modules.
Traditional code summarization methods mainly rely on rules and
templates, using specific rules and templates to extract and represent key
information from code. However, these methods often require a significant
amount of manual effort and have poor adaptability to new programming
languages and frameworks. With the advancement of machine learning, neural
network-based code summarization methods have gained attention. These
methods use neural networks to learn the complex mapping between code
and text. Additionally, some methods employ encoder-decoder architectures,
where the encoder is used to understand the source code and the decoder is
used to generate natural language descriptions.
The use of Large Language Models (LLMs) is a recent research direction in
code summarization. These models, such as OpenAI’s GPT series, have also
received considerable attention in the specific task of code summarization.
Some works evaluate the code summarization capabilities of different LLMs,
such as ChatGPT (Sun et al., 2023), Codex (Sarsa et al., 2022; Leinonen et al.,
2023b), GPT-3 (MacNeil et al., 2022c), among others. Other works assess the
capabilities of LLMs in code summarization from different perspectives, such
as causal reasoning abilities (Liu et al., 2023d), effectiveness with minimal
training (Ahmed and Devanbu, 2022), and interpretability (Leinonen et al.,
2023a). Although these works provide incomplete conclusions, they generally
recognize the strong potential of LLMs.
14 Zibin Zheng et al.

Please explain the meaning of this code and provide


comments:

This code implements the Quick Sort algorithm,


which is based on the divide-and-conquer approach.

Code Summarization
Fig. 4 An example of code summarization with LLMs. Generating comments for fast
Sorting algorithm

There are also works that utilize LLMs for code summarization to construct
documentation, generate test cases, or address limitations in LLM-based code
summarization (Khan and Uddin, 2023; Zhang et al., 2023b). For example,
enhancing the robustness of LLM-based code summarization (Zhuo et al.,
2023) or improving the interaction capabilities with developers (MacNeil et al.,
2022a).
Title Suppressed Due to Excessive Length 15

Please implement this code in Go language:

Code Translation
Fig. 5 An example of code translation with LLMsm, converting the fast Sorting algorithm
implemented by Python into Go language.

4.3 Code translation

Definition: Code translation, also named code conversion, refer to the process
of converting code between different programming languages without altering
its functionality or logic. This task holds significant practical value, such as in
system migration, code refactoring, and educational scenarios.
Example: As shown in Fig. 5. In this example, we try to use LLM (GPT-3.5)
to translate the Fast Sorting algorithm built by python into Go language.
Traditional code translation methods often require substantial manual
effort, implementing specific syntax and semantic rules through hard coding.
Moreover, for new or less common programming languages, it may necessitate
the redevelopment and maintenance of translation rules. Currently, the
impressive natural language translation capabilities exhibited by LLMs have
16 Zibin Zheng et al.

been recognized (Zhang et al., 2023c). However, there is limited focus on the
performance of LLMs in code conversion tasks in current research.
Pearce et al.(Pearce et al., 2022b) studied the capability of LLMs in
software reverse engineering. The study explored Codex’s ability to identify
the purpose, functionality, and important variable names or values in code,
thus evaluating the decompilation ability of LLMs. Additionally, Lin et al.(Lin
et al., 2022b) proposed a Cross-language Code representation with a large-
scale pre-training (XCode) method and further introduced a Shared Encoder-
Decoder (SED) architecture.
Currently, there is relatively little research on LLMs in code translation
tasks, and applying LLMs to code translation tasks still faces many challenges.
Firstly, code correctness and precision are crucial, as even small translation
errors can render the generated code non-functional. Secondly, acquiring a
large amount of high-quality source code and target code pairs is challenging,
which may limit the model’s learning and generalization capabilities. Thirdly,
further research is needed on evaluating the performance of code translation
models, as in many cases, there can be multiple different implementations of
the same functionality in code. These issues require further exploration and
resolution in future research.

4.4 Vulnerability Detection

Definition: Code vulnerability detection and repair is an important task


in the field of software engineering, crucial for improving the reliability and
security of software. It is a critical step in the software development lifecycle.
The main goal of this task is to identify and fix code errors that may cause
program crashes, performance degradation, or security issues.
Example: In the example shown in Fig. 6, we used LLMs to identify errors
in the fast sort code. We can see that LLMs (GPT-3.5) accurately identified
the Syntax error in the code.
Due to the significance of this task, various software and hardware-based
detection methods have been developed. Traditional vulnerability detection
methods mainly include static analysis and dynamic analysis. Static analysis
analyzes the source code before program execution to identify potential errors
and vulnerabilities, while dynamic analysis analyzes program behavior during
runtime to identify actual errors and vulnerabilities. These methods have their
own advantages and disadvantages: static analysis can cover all code paths
but may produce a large number of false positives, while dynamic analysis can
precisely pinpoint actual errors but may miss errors that are not triggered in
the test cases.
Recently, LLMs have been used in code vulnerability detection and repair
tasks due to their advanced semantic understanding capabilities, which are
effective in addressing logical flaws and linguistic issues in natural language.
Similarly, in this task, current research can be roughly categorized into three
types. The first type evaluates the capabilities of different LLMs in this
Title Suppressed Due to Excessive Length 17

Please identify the errors in the following code:

The comparison operator === in the middle


list comprehension is incorrect.

Vulnerability Detection
Fig. 6 An example of code vulnerability detection with LLMs.

task (Olausson et al., 2023; Prenner et al., 2022; Pearce et al., 2022a; Madaan
et al., 2023), for example, Khoury et al. (Khoury et al., 2023) evaluated
the security of ChatGPT in code generation tasks. The second type focuses
on improving the correctness and performance of LLMs in vulnerability
detection (Xia and Zhang, 2023a,b), such as combining LLMs with formal
verification techniques (Charalambous et al., 2023), incorporating previously
related edits (Gupta et al., 2023), self-debugging algorithms (Chen et al.,
2023b), among others. The third type applies LLMs to vulnerability mining or
other related tasks (Ahmad et al., 2023b; Chan et al., 2023; Fan et al., 2023b;
Xia et al., 2022), including decompilation work (Xu et al., 2023), security
analysis of hardware designs (Ahmad et al., 2023c), and black-box testing (Li
et al., 2023f).
18 Zibin Zheng et al.

generate test cases

Empty array or array with one element.


assert quick_sort([]) == []
assert quick_sort([42]) == [42]
Array with unique elements.
input_array = [64, 34, 25, 12, 22, 11, 90]
expected_output = [11, 12, 22, 25, 34, 64, 90]

Array with duplicate elements.


input_array = [3, 1, 5, 2, 3, 4, 1]
expected_output = [1, 1, 2, 3, 3, 4, 5]

Fig. 7 An example of test case generation with LLMs. Generating test cases for fast Sorting
algorithm

4.5 Code Evaluation

Definition: Code evaluation and testing are crucial tasks in software


engineering that help ensure code quality, reliability, and expected
functionality. Code evaluation aims to perform static analysis on the code
to identify potential issues and improvement points.
Code testing involves executing the code and checking if its behavior aligns
with expectations to validate its correctness. However, code testing can be
time-consuming and labour-intensive. Exploring the application of LLMs to
automatically generate effective test cases based on a given code or directly
evaluate the quality of a given code has attracted significant research attention.
Example: We use GPT-3.5 to generate test cases for fast sorting code, as
shown in Fig. 7. LLMs generated test cases for sample code from different
testing focuses.
Title Suppressed Due to Excessive Length 19

In current research on applying LLMs to code testing, a significant portion


of the work focuses on test case generation. For example, empirical studies have
explored the unit test generation capabilities of LLMs like ChatGPT (Yuan
et al., 2023; Tang et al., 2023b), or utilizing LLMs to generate test cases (Zhao
et al., 2023a; Lemieux et al., 2023; Chen et al., 2022; Tu et al., 2023; Li et al.,
2023d).
Additionally, several works have proposed code testing frameworks utilizing
LLMs based on different usage scenarios (Kang et al., 2022; Zhuo, 2023), such
as combining LLMs with fuzz testing (Hu et al., 2023a) or black-box testing (Li
et al., 2023f). Furthermore, a few works have developed LLM-based testing
assistants or code testing models (Feldt et al., 2023; Lee et al., 2022).

4.6 Code Management

Definition: Code management is a crucial aspect of the software engineering


development and maintenance process. It involves version control (Maruf
et al., 2021), collaborative management (Potluri et al., 2022), and release
management of source code during software development (Jing et al., 2021).
Example: CollabCoder (Gao et al., 2023) is a good example of using LLM
for code management, as shown in Fig. 8. In this example, the LLM is mainly
involved in three stages - independent open coding, iterative discussions, and
the development of a final codebook, to assist with code generation, code
merging and version control.

Open Code Merge and Codebook


Coding Discuss Development
Seek suggestions/
relevant codes
Make code decisions Codes autoclustering
from
coding history

Fig. 8 An example of LLM-assist code management.

Code management is complex and challenging. Parallel development


and collaboration can lead to code conflicts. Moreover, in large-scale
projects, branch management and version iterations significantly increase the
difficulty of code management. Traditional code management tools, such as
Git (Escamilla et al., 2023), provide a powerful set of commands and rules to
handle version control and branch management tasks. However, using these
tools still requires developers to have expertise, especially when dealing with
complex merge conflicts and branch management. As shown in Fig. 9, LLMs
are applied to code collaborative development and test management
20 Zibin Zheng et al.

collaborative
development
Code
Management

Code version
control

Fig. 9 LLMs for code collaborative development and test management.

Due to the various challenges in code management tasks, some cutting-edge


works aim to leverage the power of LLMs to alleviate the complexity of the code
management process, and even achieve automated code management without
manual intervention (Xiao et al., 2023; Bi et al., 2023). For example, Toufique
et al.(Ahmed et al., 2023) evaluates the effectiveness of LLMs in assisting
engineers in managing security incidents in cloud services, while Shrivastava
et al.(Shrivastava et al., 2023a) addresses the issue of LLM models struggling
to understand the context present in repositories.
Additionally, some works based on LLMs have developed code
management tools or frameworks to assist code managers, aiding in version
management (Gao et al., 2023) and personnel training (Lanciano. et al., 2023).

4.7 Q&A Interaction

Definition: Interaction between humans and tools has always been a focus
in the field of software engineering, as good interaction can enhance task
performance (Xu et al., 2017). Before the widespread application of LLMs,
an important way for developers to obtain information and solve problems
was through Q&A website, e.g., Stack Overflow7 (Zhang et al., 2022a). The
emergence of LLMs changed this by being able to answer users’ questions,
including professional knowledge in software engineering. As a promising new
tool to help developers solve code issues, LLMs also gave rise to much research
on how to improve the efficiency and convenience of Q&A Interaction (Gao
et al., 2022). Furthermore, since the output generated by LLMs is influenced
by the structure and content of user-provided prompts, research on prompts,
known as prompt engineering (White et al., 2023a).
Example: Fig. 10 is an example of using different prompts to generate quick
sort code for LLM, and the code generated by different prompts varies. The
quality of the prompt also greatly affects the results of LLM feedback.
It is important to note that this section focuses on investigations related
to Q&A Interaction and prompt engineering in the context of software
engineering.
7 www.stackoverflow.com/
Title Suppressed Due to Excessive Length 21

Prompt 1: Please write a quick sort code using Python.


Prompt 2: Write a quick sort code using Python.
Prompt 3: Please use Python to write a quick sort code
that is as concise as possible.
Prompt 4: You are a software development engineer
and now you need to use Python to write a quick sort
code.

Fig. 10 Design different prompts for the same goal.

This body of work can also be categorized into two main types. The first
type focuses on the interaction between software practitioners (developers,
beginners, etc.) and LLMs, and involves the development of prototype systems
or interaction frameworks (Ross et al., 2023; Zamfirescu-Pereira et al., 2023;
Cai et al., 2023). Among them, Zamfirescu-Pereira et al. (Zamfirescu-Pereira
et al., 2023) discusses the role of non-AI practitioners in “user cue engineering”
and designs BotDesigner, a cue-based chatbot design tool; Ross et al. (Ross
et al., 2023) demonstrates the role and potential of developer-LLM interactions
for processes such as software development, through interviews with 42
software developers; and Cai et al. (Cai et al., 2023) describes Low-code
LLM, a framework for human-LLM interactions, to better support visual
programming.
The second type consists of research-oriented work, which can be further
divided into several directions. The first direction evaluates the interaction
between LLMs and software developers (Barke et al., 2023b), such as
whether LLMs address the same parts of natural language descriptions as
developers (Kou et al., 2023), or whether they can act as a DevBot (Ahmad
et al., 2023a).
The second direction primarily focuses on prompt engineering (White et al.,
2023b,a; Shrivastava et al., 2023b), aiming to design more efficient prompt
formats or automatically populate prompt content based on different subtasks
and objectives. The third direction addresses security and efficiency issues in
LLM interaction with developers (Sarkar et al., 2022; Sandoval et al., 2023).

4.8 Other Works

In addition to the aforementioned topics, there are other works that combine
LLMs with software engineering. These works either discuss the performance of
LLMs in specific subtasks (Ozkaya, 2023; Sadik et al., 2023; Xing et al., 2023),
such as visualization (Maddigan and Susnjak, 2023), information extraction (Li
et al., 2023a,c), and modeling (Nichols et al., 2023), propose their own solutions
to existing problems, such as addressing performance issues (Jain et al., 2023),
develop tools or datasets, such as code-text datasets (Manh et al., 2023; Liu
et al., 2023c), or identify issues related to LLMs (Treude and Hata, 2023;
22 Zibin Zheng et al.

Khlaaf et al., 2022). Additionally, some works focus on exploring the potential
and applications of LLMs in the field of education (MacNeil et al., 2022b).

5 Performance of LLM in SE Tasks

In this section, we primarily discuss RQ2. First, we screened papers from our
collection that evaluated the performance of LLMs on software engineering
tasks. Next, we extracted the LLMs used and software engineering tasks
targeted in these works. Finally, some works in Section 4 also evaluated and
discussed the performance of LLMs on some specific tasks. Therefore, we will
summarize these works here and emphasize their evaluation results.
A significant portion of the work conducted has empirically analyzed
the performance of ChatGPT, one of the most popular LLM models, as a
programming assistant (Tian et al., 2023; Sridhara et al., 2023; Li et al., 2023d;
Liu et al., 2023a). These studies have found that ChatGPT’s performance
varies across different tasks. For instance, it performs well in tasks such as log
summarization, referential resolution, and code summarization, but struggles
in vulnerability detection and test case generation. Particularly in vulnerability
detection, ChatGPT finds it challenging to identify subtle code differences
when two versions have similar syntax (Li et al., 2023d). In some tasks such
as Text-to-SQL (Liu et al., 2023a), answering software testing questions (Jalil
et al., 2023), and test case generation (Tang et al., 2023b), although ChatGPT
did not achieve outstanding performance, the authors still maintain a positive
outlook. Some studies also highlight the limitations of ChatGPT’s attention
scope (Sridhara et al., 2023).
Furthermore, some works analyze ChatGPT’s performance in software
engineering tasks from different perspectives. For instance, Ma et al.(Ma et al.,
2023) investigates ChatGPT’s understanding of code syntax and semantic
structure, concluding that while ChatGPT excels in understanding code
syntax (e.g., Abstract Syntax Trees), it faces difficulties in understanding code
semantics, especially dynamic semantics. Feng et al.(Feng et al., 2023) explores
ChatGPT’s code generation abilities through analyzing comments on Twitter
and Reddit, examining people’s sentiment towards ChatGPT’s code generation
capabilities.
There are also detailed evaluations of LLMs’ performance in specific tasks,
such as reverse engineering (Pearce et al., 2022b), code explanation (Zhuo
et al., 2023), code analysis (Feiyu, 2023), and vulnerability repair (Pearce
et al., 2022a). These studies generally provide more critical conclusions,
suggesting that LLMs still lag behind state-of-the-art methods in these tasks.
However, two works evaluating LLMs in automated program repair (Fan et al.,
2023b; Xia et al., 2022) present very positive findings. Additionally, several
evaluations on specific tasks yield more positive conclusions or affirm the
potential of LLMs in those tasks, such as code generation (Vaithilingam et al.,
2022; Kande et al., 2023; Thakur et al., 2023b) and error fixing (Prenner et al.,
2022). (gangz, 2023) evaluates the ability of large models to generate test cases
Title Suppressed Due to Excessive Length 23

Table 4 Part 1 - The performance of LLMs in software engineering tasks.

Paper LLM(s) Subject of evaluation Behavior

(Tian ChatGPT code generation, Finished Well


et al., program repair, and code
2023) summarization

(Sridhara ChatGPT 15 common software Well done: log summary,


et al., engineering tasks anaphora parsing, code
2023) summarization
(method name generation)
code clone detection
, etc Not doing well: code
vulnerability detection, etc

(Ma et al., ChatGPT syntax understanding, Well done: Code


2023) static behaviour syntax understanding Not
understanding, dynamic doing well: Code semantic
behaviour understanding, understanding
capacity to comprehend
code syntax, and capacity
to comprehend semantic
structures

(Hellas Codex, code interpretation Finished Well


et al., and GPT-
2023) 3.5

(Li et al., ChatGPT code defect detection Poor Performance


2023d)

(Pearce code- software reverse Poor Performance


et al., davinci- engineering
2022b) 001

(gangz, ChatGPT test case generation Finished Well


2023)

(Pearce Codex code defect detection Poor Performance


et al.,
2022a)

(Bareiß Codex code generation Finished Well


et al.,
2022)

(Tang ChatGPT test case generation Poor Performance


et al.,
2023b)

(Sarsa Codex code interpretation Finished Well


et al.,
2022)

(Jalil ChatGPT software testing Finished Well


et al.,
2023)

(Savelka GPT-4 code generation Finished Well and obvious


et al., progress
2023)
24 Zibin Zheng et al.

Table 5 Part 2 - The performance of LLMs in software engineering tasks.

Paper LLM(s) Subject of evaluation Behavior

(Zhuo Codex code interpretation Poor performance


et al., (robustness)
2023)

(Feiyu, GPT-4 code analysis Poor Performance


2023)

(Fan et al., Codex program repair Finished Well


2023b)

(Xia et al., GPT-Neo, program repair Finished Well


2022) GPT-J,
GPT-
NeoX,
Codex,
CodeT5
and IN-
CODER

(Shirafuji Codex, code generation Poor performance


et al., CodeGen, (robustness)
2023) Instruct-
GPT, and
ChatGPT

(Feng ChatGPT code generation Ambiguity


et al.,
2023)

(Kande code- code generation Poor Performance


et al., davinci-
2023) 002

(Thakur CodeGen, code generation Poor Performance


et al., and Codex
2023b)

(Khoury ChatGPT code generation Finished Well


et al.,
2023)

(Prenner Codex program repair Finished Well


et al.,
2022)

(Vaithilingam Codex program repair Poor Performance


et al.,
2022)

(Liu et al., ChatGPT Text-to-SQL Finished Well


2023a)

(Rajkumar Codex Text-to-SQL Finished Well


et al.,
2022)
Title Suppressed Due to Excessive Length 25

on a simple game, reporting positive results. (Bareiß et al., 2022) acknowledges


the performance of LLMs in code generation capabilities.
Moreover, a considerable portion of the research evaluates the performance
of LLMs in education and programming courses (Leinonen et al., 2023a;
Hellas et al., 2023; Sarsa et al., 2022; Savelka et al., 2023), with positive
feedback. Additionally, compared to GPT-3, GPT-4 has made significant
advancements (Savelka et al., 2023).
A small number of works focus on the robustness and security issues of
LLMs in solving programming problems (Shirafuji et al., 2023; Khoury et al.,
2023), and they have yielded important findings as well.
We have organized the conclusions derived from the work above, as shown
in Table 5.We mainly screened the evaluation results from the abstract
and conclusion sections of the papers, and categorized the results into:
Finished Well, Poor Performance and Ambiguity. It is worth noting that the
blue shading and the red shading respectively indicate that, although the
evaluated LLM(s) in the article did not perform well on the task, the article
still provided a positive attitude; and that even with decent evaluation results,
there are still limitations or even not good enough.
From the table and the above articles, we can reach a preliminary
conclusion:
• Code generation, being a challenging task, current LLMs often do not
achieve a “ready-to-use” level. The performance of LLMs on this task is not
outstanding or consistent. However, encouragingly, some articles(Kande
et al., 2023; Thakur et al., 2023b), even though the evaluation conclusion
did not affirm LLMs, still offered a positive attitude;
• On tasks like program repair, Text-to-SQL, LLMs generally perform well;
• In program vulnerability detection, LLMs typically do not perform well;
• Although there is a contradiction in the conclusions of (gangz, 2023) and
(Tang et al., 2023b) about LLMs on the task of test case generation, they
both expressed a relatively positive attitude, still believing that LLMs have
potential in this task;
• In tasks like code summarization and code explanation, LLMs usually
perform well, but lack robustness;
• According to the results of (Savelka et al., 2023), newer LLMs have made
significant improvements in the task of code generation.
In summary, we can reach a fundamental conclusion: LLMs perform well
on some software engineering tasks that require understanding of code syntax,
such as code summarization and code repair; on some tasks that require
understanding of code semantics, such as code generation and vulnerability
detection, they typically do not perform well enough; LLMs continue to
improve with the iteration of versions/models, and still possess great potential.
Therefore, at the current stage, LLMs still cannot replace human
labor in software engineering tasks, but they can serve as excellent
assistants to software developers, as search tools, and as prompting
tools.
26 Zibin Zheng et al.

6 Related work

The tremendous potential of LLM has attracted numerous investigations


regarding its applications, either within the field of LLM itself or in specific
domains. In this section, we will showcase these research efforts and explain
their distinctions from our work. To the best of our knowledge, we are the
first to systematically investigate, analyze, and compile the research progress
of LLM in the context of software engineering tasks.
Zhao et al. (Zhao et al., 2023b) is a detailed article that introduces the
background, development history, technical roadmap, and latest advancements
of Large Language Models (LLM). The article primarily focuses on large-scale
models (with a size greater than 10B) and does not cover early pre-trained
language models such as BERT and GPT-2. The research primarily revolves
around four main aspects of LLM: pre-training, fine-tuning, applications,
and performance evaluation. Additionally, the author considers certain tasks
in the software engineering domain as fundamental capabilities of LLM,
for instance, treating Code Synthesis as part of the Language Generation
capability. Consequently, the article does not delve into a comprehensive
discussion and summary of LLM’s application, performance, and limitations
in software engineering tasks.
Gozalo-Brizuela et al. (Gozalo-Brizuela and Garrido-Merchan, 2023), on
the other hand, classifies generative AI models based on their input and
output formats, dividing them into nine categories as shown in Table 6.
The paper demonstrates the capabilities of generative AI models through
these classifications. While the author believes these models possess significant
creativity and potential, he also acknowledges the numerous constraints they
face, particularly in terms of data acquisition. Additionally, concerns were
raised over the significant computational resource consumption during their
training and the time cost of model construction. Despite summarizing the
capabilities of some generative AI models, the paper does not focus on LLM,
nor does it discuss the performance of LLM in specific tasks, including software
engineering tasks.
Liu et al. (Liu et al., 2023e) provides a comprehensive review of ChatGPT
and GPT4, highlighting their potential applications and contributions in
the field of Natural Language Processing (NLP). Meanwhile, it outlines
several potential ethical issues related to the development and use of LLMs,
advocating for a focus on addressing these ethical concerns, exploring new
applications, and ensuring the responsible use of ChatGPT and GPT-4. Fan
et al. (Fan et al., 2023a) claims that they conducted a bibliometric analysis of
over 5,000 LLM research papers from 2017 to early 2023, investigating their
applications across various fields. The paper calls for accelerated cooperation
among stakeholders, such as government agencies, universities, companies,
infrastructure service providers, etc., to responsibly develop and apply LLMs.
Wei et al. (Wei et al., 2023) provides a comprehensive overview of
traditional language models (CLMs) and their successors, pre-trained language
models (PLMs). CLMs are designed to predict language sequence probabilities
Title Suppressed Due to Excessive Length 27

Table 6 A taxonomy of the most popular generative AI models in (Gozalo-Brizuela and


Garrido-Merchan, 2023)

Generative AI models categories Examples

Text-to-image Models DALL·E 2, IMAGEN, Muse

Text-to-3D models Dreamfusion, Magic3D

Image-to-Text models Flamingo, VisualGPT

Text-to-Video models Phenaki, Soundify

Text-to-Audio models AudioLM, Jukebox, Whisper

Text-to-Text models ChatGPT, LaMDA, PEER

Text-to-Code models Codex, Alphacode

Text-to-Science models Galactica, Minerva

Other models AlphaTensor, GATO, ChatBCG

in a causal manner, while PLMs can be fine-tuned for causal sequence


modeling and downstream applications. While the article does not delve into
the application of large models in the field of software engineering, it offers
a brilliant overview of the current state and development progress of large
models, and clearly points out future research directions for these models.
Additionally, some research work has reviewed and empirically analyzed
the application of LLM in a specific field or particular tasks.
Li et al. (Li et al., 2022a) provides an overview of representative
research achievements in text generation based on PLMs, reviews various
evaluation metrics, open-source libraries, and common applications, with the
aim to assist practitioners in assessing, selecting, and utilizing appropriate
PLMs, and proposes some future research directions. Yang et al. (Yang
et al., 2023) investigates the application of Transformer-based PLMs for the
Controllable Text Generation (CTG) task, summarizing typical applications,
key methodologies, and evaluation systems of PLMs in CTG. Min et al. (Min
et al., 2023) explores three trending paradigms of using pre-trained language
models for NLP, They are Pre-train then Fine-tune, Prompt-based Learning,
and NLP as Text Generation. Simultaneously, the paper mentions that their
theoretical understanding of these paradigms is preliminary.
Wang et al. (Wang et al., 2023a) primarily summarizes the latest progress
of PLMs in the biomedical field and their applications in downstream
biomedical tasks, and discusses the development trends and directions.
In response to the existing LLM-based recommendation systems, Wu et
al. (Wu et al., 2023) proposes a classification that divides these models
into two major paradigms, namely, Discriminative LLM for Recommendation
(DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec).
The paper systematically reviews and analyzes the existing LLM-based
28 Zibin Zheng et al.

recommendation systems in these two paradigms and discusses their methods


and performance.
Other research mainly investigates and analyzes the shortcomings of LLMs
in their applications. For instance, Meade et al. (Meade et al., 2022) survey and
summarize the bias mitigation techniques in pre-trained language models. The
paper empirically studies the five currently popular bias mitigation techniques
and proposes three intrinsic bias benchmarks to quantify the effectiveness
of each technique. Huang et al. (Huang and Chang, 2023) comprehensively
describes the reasoning capabilities of LLMs, including technologies to improve
and induce reasoning capabilities in these models, methods and benchmarks for
assessing reasoning capabilities, and suggestions for future directions. As LLMs
sometimes provide unrealistic yet seemingly plausible predictions (termed as
hallucinations, see (Welleck et al., 2020)), Mialon et al. (Mialon et al., 2023)
reviews two methods to enhance the abilities of LLMs, namely, by leveraging
reasoning skills and invoking external tools, aiming to improve context and
curb hallucinations. Xu et al. (Xu and McAuley, 2023) pays special attention
to the inference stage during the construction of LLMs and reviews the current
state of model compression and acceleration techniques, including benchmarks,
metrics, and methods.
Zan et al. (Zan et al., 2023) investigates the performance of 27 large
models in the field of generating code from a natural language description
(or NL2Code). The main contribution of the paper is an intuitive comparison
of the NL2Code capabilities of LLMs on the HumanEval benchmark. However,
its research is limited to the software engineering task of code generation. It
does not summarize the applications of other LLMs in code generation, nor
does it investigate other software engineering tasks, such as code conversion.
Wong et al. (Wong et al., 2023) introduces some popular Language Model-
based Learning (LLM) approaches and their applications in downstream tasks
related to AI-assisted programming. The tasks covered in the article include
code generation, code completion, code translation, code refinement, code
summarization, defect detection, and clone detection. However, it is worth
noting that the AI-assisted methods discussed in this article are not limited to
LLM but also encompass various other AI techniques. Furthermore, the focus
of the article is solely on AI-assisted programming tasks.

7 Conclusion and Future Work

This paper comprehensively reviews the applications of Large Language


Models (LLMs) in the field of software engineering. Firstly, we collected
and screened 123 works and literature related to the intersection of software
engineering and LLM. Secondly, by categorizing the selected literature
based on software engineering tasks, we revealed the research focus and
identified existing deficiencies in the integration of LLM with various software
engineering tasks. Finally, we carefully examined and summarized papers
evaluating the performance of LLMs in software engineering tasks, exposing
Title Suppressed Due to Excessive Length 29

their capabilities and limitations, and providing directions for future research
and optimization.
Current works also reveal some future directions worth discussing: (1)
We can see that a large part of the work in Section 4 proposes methods to
improve the performance of LLMs on one or several software engineering tasks.
Although most of them do not provide detailed evaluations or discussions on
the performance of LLMs on these tasks, this might suggest that the current
performance of LLMs on these tasks is not good enough or not stable; (2) Most
current evaluations are based on general large models, such as ChatGPT, and
detailed evaluations of code-centric large models like Codex are still lacking;
(3) Do we need to fine-tune large models for specific software engineering tasks
to create large model products tailored for specific tasks? We will gradually
seek the answers to these questions in the future.

8 Acknowledgements

References

Ahmad A, Waseem M, Liang P, Fehmideh M, Aktar MS, Mikkonen T (2023a)


Towards human-bot collaborative software architecting with chatgpt. 2302.
14600
Ahmad B, Tan B, Karri R, Pearce H (2023b) Flag: Finding line anomalies (in
code) with generative ai. 2306.12643
Ahmad B, Thakur S, Tan B, Karri R, Pearce H (2023c) Fixing hardware
security bugs with large language models. 2302.01215
Ahmed T, Devanbu PT (2022) Few-shot training llms for project-specific
code-summarization. In: 37th IEEE/ACM International Conference on
Automated Software Engineering, ASE 2022, Rochester, MI, USA, October
10-14, 2022, ACM
Ahmed T, GHOSH S, Bansal C, Zimmermann T, Zhang X, Rajmohan S (2023)
Recommending root-cause and mitigation steps for cloud incidents using
large language models. In: ICSE 2023
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint
arXiv:160706450
Bareiß P, Souza B, d’Amorim M, Pradel M (2022) Code generation tools
(almost) for free? a study of few-shot, pre-trained language models on code.
2206.01335
Barke S, James MB, Polikarpova N (2023a) Grounded copilot: How
programmers interact with code-generating models. OOPSLA1 7, DOI
10.1145/3586030, URL https://doi.org/10.1145/3586030
Barke S, James MB, Polikarpova N (2023b) Grounded copilot: How
programmers interact with code-generating models. OOPSLA1 7
Bhaskar A, Fabbri AR, Durrett G (2023) Prompted opinion summarization
with GPT-3.5. In: Rogers A, Boyd-Graber JL, Okazaki N (eds) Findings of
the Association for Computational Linguistics: ACL 2023, Toronto, Canada,
30 Zibin Zheng et al.

July 9-14, 2023, Association for Computational Linguistics, pp 9282–


9300, DOI 10.18653/v1/2023.findings-acl.591, URL https://doi.org/10.
18653/v1/2023.findings-acl.591
Bi Z, Chen J, Jiang Y, Xiong F, Guo W, Chen H, Zhang N (2023) Codekgc:
Code language model for generative knowledge graph construction. 2304.
09048
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan
A, Shyam P, Sastry G, Askell A, et al. (2020) Language models are few-shot
learners. Advances in neural information processing systems 33:1877–1901
Cai Y, Mao S, Wu W, Wang Z, Liang Y, Ge T, Wu C, You W, Song T, Xia
Y, Tien J, Duan N (2023) Low-code llm: Visual programming over llms.
2304.08103
Chai Y, Wang S, Pang C, Sun Y, Tian H, Wu H (2022) Ernie-code: Beyond
english-centric cross-lingual pretraining for programming languages. CoRR
abs/2212.06742, DOI 10.48550/arXiv.2212.06742, URL https://doi.org/
10.48550/arXiv.2212.06742, 2212.06742
Chan A, Kharkar A, Moghaddam RZ, Mohylevskyy Y, Helyar A, Kamal
E, Elkamhawy M, Sundaresan N (2023) Transformer-based vulnerability
detection in code at edittime: Zero-shot, few-shot, or fine-tuning? 2306.
01754
Charalambous Y, Tihanyi N, Jain R, Sun Y, Ferrag MA, Cordeiro LC (2023) A
new era in software security: Towards self-healing software via large language
models and formal verification. 2305.14752
Chen A, Scheurer J, Korbak T, Campos JA, Chan JS, Bowman SR, Cho
K, Perez E (2023a) Improving code generation by training with natural
language feedback. 2303.16749
Chen B, Zhang F, Nguyen A, Zan D, Lin Z, Lou JG, Chen W (2022) Codet:
Code generation with generated tests. 2207.10397
Chen J, Xia X, Lo D, Grundy J, Yang X (2021) Maintenance-related concerns
for post-deployed ethereum smart contract development: issues, techniques,
and future challenges. Empirical Software Engineering 26(6):1–44
Chen X, Lin M, Schärli N, Zhou D (2023b) Teaching large language models
to self-debug. 2304.05128
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham
P, Chung HW, Sutton C, Gehrmann S, et al. (2022) Palm: Scaling language
modeling with pathways. arXiv preprint arXiv:220402311
Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X,
Dehghani M, Brahma S, et al. (2022) Scaling instruction-finetuned language
models. arXiv preprint arXiv:221011416
Dinh T, Zhao J, Tan S, Negrinho R, Lausen L, Zha S, Karypis G (2023)
Large language models of code fail at completing code with potential bugs.
2306.03438
Dong Y, Jiang X, Jin Z, Li G (2023) Self-collaboration code generation via
chatgpt. 2304.07590
Escamilla E, Salsabil L, Klein M, Wu J, Weigle MC, Nelson ML (2023) It’s not
just github: Identifying data and software sources included in publications.
Title Suppressed Due to Excessive Length 31

2307.14469
Fan L, Li L, Ma Z, Lee S, Yu H, Hemphill L (2023a) A bibliometric review of
large language models research from 2017 to 2023. 2304.02020
Fan Z, Gao X, Mirchev M, Roychoudhury A, Tan SH (2023b) Automated
repair of programs from large language models. 2205.10583
Feiyu (2023) Wechat. URL https://tern.cc/o150R4
Feldt R, Kang S, Yoon J, Yoo S (2023) Towards autonomous testing agents
via conversational large language models. 2306.05152
Feng S, Ma S, Yu J, Chen C, Zhou T, Zhen Y (2021) Auto-icon: An automated
code generation tool for icon designs assisting in UI development. In:
Hammond T, Verbert K, Parra D, Knijnenburg BP, O’Donovan J, Teale P
(eds) IUI ’21: 26th International Conference on Intelligent User Interfaces,
College Station, TX, USA, April 13-17, 2021, ACM, pp 59–69, DOI 10.1145/
3397481.3450671, URL https://doi.org/10.1145/3397481.3450671
Feng Y, Vanam S, Cherukupally M, Zheng W, Qiu M, Chen H (2023)
Investigating code generation performance of chat-gpt with crowdsourcing
social data. In: Proceedings of the 47th IEEE Computer Software and
Applications Conference, pp 1–10
Ferrario MA, Winter E (2023) Applying human values theory to software
engineering practice: Lessons and implications. IEEE Trans Software Eng
49(3):973–990, DOI 10.1109/TSE.2022.3170087, URL https://doi.org/
10.1109/TSE.2022.3170087
gangz (2023) Gitee. URL
https://gitee.com/gangz2009/tetris-by-chat-gpt/
Gao J, Guo Y, Lim G, Zhang T, Zhang Z, Li TJJ, Perrault ST (2023)
Collabcoder: A gpt-powered workflow for collaborative qualitative analysis.
2304.07366
Gao Z, Xia X, Lo D, Grundy JC (2022) Technical q&a site answer
recommendation via question boosting. CoRR abs/2210.15846, DOI 10.
48550/arXiv.2210.15846, URL https://doi.org/10.48550/arXiv.2210.
15846, 2210.15846
Gozalo-Brizuela R, Garrido-Merchan EC (2023) Chatgpt is not all you need.
a state of the art review of large generative ai models. 2301.04655
Gupta P, Khare A, Bajpai Y, Chakraborty S, Gulwani S, Kanade A,
Radhakrishna A, Soares G, Tiwari A (2023) Grace: Generation using
associated code edits. 2305.14129
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image
recognition. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp 770–778
Hellas A, Leinonen J, Sarsa S, Koutcheme C, Kujanpää L, Sorva J (2023)
Exploring the responses of large language models to beginner programmers’
help requests. 2306.05715
Hoffmann M, Méndez D, Fagerholm F, Luckhardt A (2023) The human side
of software engineering teams: An investigation of contemporary challenges.
IEEE Trans Software Eng 49(1):211–225, DOI 10.1109/TSE.2022.3148539,
URL https://doi.org/10.1109/TSE.2022.3148539
32 Zibin Zheng et al.

Houde S, et al. (2022) User and technical perspectives of controllable code


generation. In: NeurIPS HCAI workshop
Hu J, Zhang Q, Yin H (2023a) Augmenting greybox fuzzing with generative
ai. 2306.06782
Hu T, Xu Z, Fang Y, Wu Y, Yuan B, Zou D, Jin H (2023b) Fine-grained code
clone detection with block-based splitting of abstract syntax tree. In: Just
R, Fraser G (eds) Proceedings of the 32nd ACM SIGSOFT International
Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA,
USA, July 17-21, 2023, ACM, pp 89–100, DOI 10.1145/3597926.3598040,
URL https://doi.org/10.1145/3597926.3598040
Hu Y, Yang H, Lin Z, Zhang M (2023c) Code prompting: a neural symbolic
method for complex reasoning in large language models. 2305.18507
Huaggingface (2021) Training codeparrot from scratch. URL https://
huggingface.co/blog/codeparrot.
Huang J, Chang KCC (2023) Towards reasoning in large language models: A
survey. 2212.10403
tse Huang J, Lam MH, Li EJ, Ren S, Wang W, Jiao W, Tu Z, Lyu MR
(2023) Emotionally numb or empathetic? evaluating how llms feel using
emotionbench. 2308.03656
Jain A, Adiole C, Chaudhuri S, Reps T, Jermaine C (2023) Tuning models of
code with compiler-generated reinforcement learning feedback. 2305.18341
Jain N, Vaidyanath S, Iyer AS, Natarajan N, Parthasarathy S, Rajamani SK,
Sharma R (2022) Jigsaw: Large language models meet program synthesis. In:
44th IEEE/ACM 44th International Conference on Software Engineering,
ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, ACM, pp 1219–
1231, DOI 10.1145/3510003.3510203, URL https://doi.org/10.1145/
3510003.3510203
Jalil S, Rafi S, LaToza TD, Moran K, Lam W (2023) ChatGPT and
software testing education: Promises & perils. In: 2023 IEEE International
Conference on Software Testing, Verification and Validation Workshops
(ICSTW), IEEE, DOI 10.1109/icstw58534.2023.00078, URL https://doi.
org/10.1109%2Ficstw58534.2023.00078
Jiang S, Wang Y, Wang Y (2023) Selfevolve: A code evolution framework via
large language models. 2306.02907
Jing N, Liu Q, Sugumaran V (2021) A blockchain-based code copyright
management system. Inf Process Manag 58(3):102518, DOI 10.1016/j.ipm.
2021.102518, URL https://doi.org/10.1016/j.ipm.2021.102518
Kande R, Pearce H, Tan B, Dolan-Gavitt B, Thakur S, Karri R, Rajendran J
(2023) Llm-assisted generation of hardware assertions. 2306.14027
Kang S, Yoon J, Yoo S (2022) Large language models are few-shot testers:
Exploring llm-based general bug reproduction. 2209.11515
Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray
S, Radford A, Wu J, Amodei D (2020) Scaling laws for neural language
models. arXiv preprint arXiv:200108361
Khan JY, Uddin G (2023) Automatic code documentation generation using
gpt-3. In: Proceedings of the 37th IEEE/ACM International Conference on
Title Suppressed Due to Excessive Length 33

Automated Software Engineering, Association for Computing Machinery,


New York, NY, USA, ASE, DOI 10.1145/3551349.3559548, URL https:
//doi.org/10.1145/3551349.3559548
Khlaaf H, Mishkin P, Achiam J, Krueger G, Brundage M (2022) A hazard
analysis framework for code synthesis large language models. 2207.14157
Khoury R, Avila AR, Brunelle J, Camara BM (2023) How secure is code
generated by chatgpt? 2304.09655
Kitchenham BA (2007) Kitchenham, b.: Guidelines for performing systematic
literature reviews in software engineering. ebse technical report ebse-2007-
01. IEEE Computer Society
Kotti Z, Galanopoulou R, Spinellis D (2023) Machine learning for software
engineering: A tertiary study. ACM Comput Surv 55(12):256:1–256:39, DOI
10.1145/3572905, URL https://doi.org/10.1145/3572905
Kou B, Chen S, Wang Z, Ma L, Zhang T (2023) Is model attention aligned
with human attention? an empirical study on large language models for code
generation. 2306.01220
Lahiri SK, Naik A, Sakkas G, Choudhury P, von Veh C, Musuvathi M, Inala
JP, Wang C, Gao J (2022) Interactive code generation via test-driven user-
intent formalization. 2208.05950
Lanciano G, Stein M, Hilt V, Cucinotta T (2023) Analyzing declarative
deployment code with large language models. In: Proceedings of
the 13th International Conference on Cloud Computing and Services
Science - CLOSER, INSTICC, SciTePress, pp 289–296, DOI 10.5220/
0011991200003488
Lee J, Han K, Yu H (2022) A light bug triage framework for applying large
pre-trained language model. In: 37th IEEE/ACM International Conference
on Automated Software Engineering, ASE 2022, Rochester, MI, USA,
October 10-14, 2022, ACM, pp 3:1–3:11, DOI 10.1145/3551349.3556898,
URL https://doi.org/10.1145/3551349.3556898
Leinonen J, Denny P, MacNeil S, Sarsa S, Bernstein S, Kim J, Tran A,
Hellas A (2023a) Comparing code explanations created by students and
large language models. 2304.03938
Leinonen J, Hellas A, Sarsa S, Reeves B, Denny P, Prather J, Becker
BA (2023b) Using large language models to enhance programming error
messages. In: Proceedings of the 54th ACM Technical Symposium on
Computer Science Education V. 1, Association for Computing Machinery,
New York, NY, USA, SIGCSE 2023, p 563–569
Lemieux C, Inala JP, Lahiri S, Sen S (2023) Codamosa: Escaping coverage
plateaus in test generation with pre-trained large language models. In:
ICSE’23
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V,
Zettlemoyer L (2019) Bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. arXiv preprint
arXiv:191013461
Li B, Fang G, Yang Y, Wang Q, Ye W, Zhao W, Zhang S (2023a) Evaluating
chatgpt’s information extraction capabilities: An assessment of performance,
34 Zibin Zheng et al.

explainability, calibration, and faithfulness. 2304.11633


Li J, Tang T, Zhao WX, Nie JY, Wen JR (2022a) Pretrained language models
for text generation: A survey. 2201.05273
Li J, Li G, Li Y, Jin Z (2023b) Enabling programming thinking in large
language models toward code generation. 2305.06599
Li P, Sun T, Tang Q, Yan H, Wu Y, Huang X, Qiu X (2023c) Codeie:
Large code generation models are better few-shot information extractors.
2305.05711
Li TO, Zong W, Wang Y, Tian H, Wang Y, Cheung SC, Kramer J (2023d)
Finding failure-inducing test cases with chatgpt. 2304.11686
Li XY, Xue JT, Xie Z, Li M (2023e) Think outside the code: Brainstorming
boosts large language models in code generation. 2305.10679
Li Y, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, Eccles T,
Keeling J, Gimeno F, Lago AD, Hubert T, Choy P, de Masson d’Autume
C, Babuschkin I, Chen X, Huang PS, Welbl J, Gowal S, Cherepanov A,
Molloy J, Mankowitz DJ, Robson ES, Kohli P, de Freitas N, Kavukcuoglu
K, Vinyals O (2022b) Competition-level code generation with alphacode.
Science 378(6624):1092–1097, DOI 10.1126/science.abq1158, URL https:
//www.science.org/doi/abs/10.1126/science.abq1158, https://www.
science.org/doi/pdf/10.1126/science.abq1158
Li Z, Ma P, Wang H, Wang S, Tang Q, Nie S, Wu S (2022c) Unleashing
the power of compiler intermediate representation to enhance neural
program embeddings. In: 44th IEEE/ACM 44th International Conference
on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27,
2022, ACM, pp 2253–2265, DOI 10.1145/3510003.3510217, URL https:
//doi.org/10.1145/3510003.3510217
Li Z, Wang C, Liu Z, Wang H, Chen D, Wang S, Gao C (2023f) Cctest: Testing
and repairing code completion systems. 2208.08289
Lin T, Wang Y, Liu X, Qiu X (2022a) A survey of transformers. AI Open
Lin Z, Li G, Zhang J, Deng Y, Zeng X, Zhang Y, Wan Y (2022b) Xcode:
Towards cross-language code representation with large-scale pre-training.
ACM Trans Softw Eng Methodol 31(3), DOI 10.1145/3506696, URL https:
//doi.org/10.1145/3506696
Liu A, Hu X, Wen L, Yu PS (2023a) A comprehensive evaluation of chatgpt’s
zero-shot text-to-sql capability. 2303.13547
Liu J, Xia CS, Wang Y, Zhang L (2023b) Is your code generated by
chatgpt really correct? rigorous evaluation of large language models for code
generation. 2305.01210
Liu MX, Sarkar A, Negreanu C, Zorn B, Williams J, Toronto N, Gordon
AD (2023c) “what it wants me to say”: Bridging the abstraction gap
between end-user programmers and code-generating large language models.
Association for Computing Machinery, New York, NY, USA, CHI ’23
Liu X, Yin D, Zhang C, Feng Y, Zhao D (2023d) The magic of if: Investigating
causal reasoning abilities in large language models of code. 2305.19213
Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, He H, Li A, He M, Liu Z,
Wu Z, Zhu D, Li X, Qiang N, Shen D, Liu T, Ge B (2023e) Summary of
Title Suppressed Due to Excessive Length 35

chatgpt/gpt-4 research and perspective towards the future of large language


models. 2304.01852
Liu Z, Qian P, Wang X, Zhuang Y, Qiu L, Wang X (2023f) Combining graph
neural networks with expert knowledge for smart contract vulnerability
detection. IEEE Trans Knowl Data Eng 35(2):1296–1310, DOI 10.
1109/TKDE.2021.3095196, URL https://doi.org/10.1109/TKDE.2021.
3095196
Ma W, Liu S, Wang W, Hu Q, Liu Y, Zhang C, Nie L, Liu Y (2023) The scope
of chatgpt in software engineering: A thorough investigation. 2305.12138
MacNeil S, Tran A, Hellas A, Kim J, Sarsa S, Denny P, Bernstein S, Leinonen
J (2022a) Experiences from using code explanations generated by large
language models in a web software development e-book. 2211.02265
MacNeil S, Tran A, Leinonen J, Denny P, Kim J, Hellas A, Bernstein S,
Sarsa S (2022b) Automatically generating CS learning materials with large
language models. In: Proceedings of the 54th ACM Technical Symposium
on Computer Science Education V. 2, ACM, DOI 10.1145/3545947.3569630,
URL https://doi.org/10.1145%2F3545947.3569630
MacNeil S, Tran A, Mogil D, Bernstein S, Ross E, Huang Z (2022c)
Generating diverse code explanations using the gpt-3 large language model.
In: Proceedings of the 2022 ACM Conference on International Computing
Education Research - Volume 2, Association for Computing Machinery, New
York, NY, USA
Madaan A, Shypula A, Alon U, Hashemi M, Ranganathan P, Yang Y, Neubig
G, Yazdanbakhsh A (2023) Learning performance-improving code edits.
2302.07867
Maddigan P, Susnjak T (2023) Chat2vis: Generating data visualisations via
natural language using chatgpt, codex and gpt-3 large language models.
2302.02094
Manh DN, Hai NL, Dau ATV, Nguyen AM, Nghiem K, Guo J, Bui NDQ
(2023) The vault: A comprehensive multilingual dataset for advancing code
understanding and generation. 2305.06156
Maruf AA, Lambaria N, Abdelfattah AS, Cerný T (2021) Using version
control and issue tickets to detect code debt and economical cost. In: 36th
IEEE/ACM International Conference on Automated Software Engineering,
ASE 2021, Melbourne, Australia, November 15-19, 2021, IEEE, pp 1208–
1209, DOI 10.1109/ASE51524.2021.9678532, URL https://doi.org/10.
1109/ASE51524.2021.9678532
Meade N, Poole-Dayan E, Reddy S (2022) An empirical survey of the
effectiveness of debiasing techniques for pre-trained language models. 2110.
08527
Mialon G, Dessı̀ R, Lomeli M, Nalmpantis C, Pasunuru R, Raileanu R, Rozière
B, Schick T, Dwivedi-Yu J, Celikyilmaz A, Grave E, LeCun Y, Scialom T
(2023) Augmented language models: a survey. 2302.07842
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E,
Heintz I, Roth D (2023) Recent advances in natural language processing
via large pre-trained language models: A survey. ACM Comput Surv DOI
36 Zibin Zheng et al.

10.1145/3605943, URL https://doi.org/10.1145/3605943


Mouselinos S, Malinowski M, Michalewski H (2023) A simple, yet effective
approach to finding biases in code generation. 2211.00609
Murali V, Maddila C, Ahmad I, Bolin M, Cheng D, Ghorbani N, Fernandez
R, Nagappan N (2023) Codecompose: A large-scale industrial deployment
of ai-assisted code authoring. 2305.12050
Ni A, Iyer S, Radev D, Stoyanov V, tau Yih W, Wang SI, Lin XV (2023) Lever:
Learning to verify language-to-code generation with execution. 2302.08468
Nichols D, Marathe A, Menon H, Gamblin T, Bhatele A (2023) Modeling
parallel programs using large language models. 2306.17281
Nijkamp E, Pang B, Hayashi H, Tu L, Wang H, Zhou Y, Savarese S, Xiong
C (2023) Codegen: An open large language model for code with multi-
turn program synthesis. In: The Eleventh International Conference on
Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023,
OpenReview.net, URL https://openreview.net/pdf?id=iaYcJKpY2B_
Olausson TX, Inala JP, Wang C, Gao J, Solar-Lezama A (2023) Demystifying
gpt self-repair for code generation. 2306.09896
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C,
Agarwal S, Slama K, Ray A, et al. (2022) Training language models to
follow instructions with human feedback. Advances in Neural Information
Processing Systems 35:27730–27744
Ozkaya I (2023) Application of large language models to software engineering
tasks: Opportunities, risks, and implications. IEEE Software 40(3):4–8, DOI
10.1109/MS.2023.3248401
Pearce H, Tan B, Ahmad B, Karri R, Dolan-Gavitt B (2022a) Examining
zero-shot vulnerability repair with large language models. 2112.02125
Pearce H, Tan B, Krishnamurthy P, Khorrami F, Karri R, Dolan-Gavitt B
(2022b) Pop quiz! can a large language model help with reverse engineering?
2202.01142
Poesia G, Polozov O, Le V, Tiwari A, Soares G, Meek C, Gulwani S (2022)
Synchromesh: Reliable code generation from pre-trained language models.
2201.11227
Potluri V, Pandey M, Begel A, Barnett M, Reitherman S (2022) Codewalk:
Facilitating shared awareness in mixed-ability collaborative software
development. In: Froehlich J, Shinohara K, Ludi S (eds) Proceedings of
the 24th International ACM SIGACCESS Conference on Computers and
Accessibility, ASSETS 2022, Athens, Greece, October 23-26, 2022, ACM,
pp 20:1–20:16, DOI 10.1145/3517428.3544812, URL https://doi.org/10.
1145/3517428.3544812
Prenner JA, Babii H, Robbes R (2022) Can openai’s codex fix bugs?
an evaluation on quixbugs. In: Proceedings of the Third International
Workshop on Automated Program Repair, Association for Computing
Machinery, New York, NY, USA, p 69–75
Radford A, Narasimhan K, Salimans T, Sutskever I, et al. (2018) Improving
language understanding by generative pre-training
Title Suppressed Due to Excessive Length 37

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. (2019)


Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W,
Liu PJ (2020) Exploring the limits of transfer learning with a unified text-
to-text transformer. The Journal of Machine Learning Research 21(1):5485–
5551
Rajkumar N, Li R, Bahdanau D (2022) Evaluating the text-to-sql capabilities
of large language models. 2204.00498
Reese TJ, Segall N, Nesbitt P, Fiol GD, Waller R, Macpherson BC, Tonna
JE, Wright MC (2018) Patient information organization in the intensive
care setting: expert knowledge elicitation with card sorting methods. J Am
Medical Informatics Assoc 25(8):1026–1035, DOI 10.1093/jamia/ocy045,
URL https://doi.org/10.1093/jamia/ocy045
Ross SI, Martinez F, Houde S, Muller M, Weisz JD (2023) The programmer’s
assistant: Conversational interaction with a large language model for
software development. In: Proceedings of the 28th International Conference
on Intelligent User Interfaces, Association for Computing Machinery, New
York, NY, USA, p 491–514
Sadik AR, Ceravola A, Joublin F, Patra J (2023) Analysis of chatgpt on source
code. 2306.00597
Sandoval G, Pearce H, Nys T, Karri R, Garg S, Dolan-Gavitt B (2023) Lost
at c: A user study on the security implications of large language model code
assistants. 2208.09727
Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, Chaffin A,
Stiegler A, Scao TL, Raja A, et al. (2021) Multitask prompted training
enables zero-shot task generalization. arXiv preprint arXiv:211008207
Sarkar A, Gordon AD, Negreanu C, Poelitz C, Ragavan SS, Zorn B (2022)
What is it like to program with artificial intelligence? 2208.06213
Sarsa S, Denny P, Hellas A, Leinonen J (2022) Automatic generation of
programming exercises and code explanations using large language models.
In: Proceedings of the 2022 ACM Conference on International Computing
Education Research - Volume 1, Association for Computing Machinery, New
York, NY, USA
Savelka J, Agarwal A, An M, Bogart C, Sakr M (2023) Thrilled
by your progress! large language models (GPT-4) no longer struggle
to pass assessments in higher education programming courses. CoRR
abs/2306.10073, DOI 10.48550/arXiv.2306.10073, URL https://doi.org/
10.48550/arXiv.2306.10073
Scao TL, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni
AS, Yvon F, Gallé M, et al. (2022) Bloom: A 176b-parameter open-access
multilingual language model. arXiv preprint arXiv:221105100
Shanahan M (2022) Talking about large language models. arXiv preprint
arXiv:221203551
Shirafuji A, Watanobe Y, Ito T, Morishita M, Nakamura Y, Oda Y, Suzuki
J (2023) Exploring the robustness of large language models for solving
programming problems. 2306.14583
38 Zibin Zheng et al.

Shrivastava D, Kocetkov D, de Vries H, Bahdanau D, Scholak T (2023a)


Repofusion: Training code models to understand your repository. 2306.
10998
Shrivastava D, Larochelle H, Tarlow D (2023b) Repository-level prompt
generation for large language models of code. 2206.12839
Sridhara G, G RH, Mazumdar S (2023) Chatgpt: A study on its utility for
ubiquitous software engineering tasks. 2305.16837
Sun W, Fang C, You Y, Miao Y, Liu Y, Li Y, Deng G, Huang S, Chen Y,
Zhang Q, Qian H, Liu Y, Chen Z (2023) Automatic code summarization via
chatgpt: How far are we? 2305.12865
Tang R, Chuang YN, Hu X (2023a) The science of detecting llm-generated
texts. 2303.07205
Tang Y, Liu Z, Zhou Z, Luo X (2023b) Chatgpt vs sbst: A comparative
assessment of unit test suite generation. 2307.00588
Tay Y, Wei J, Chung HW, Tran VQ, So DR, Shakeri S, Garcia X, Zheng HS,
Rao J, Chowdhery A, et al. (2022) Transcending scaling laws with 0.1%
extra compute. arXiv preprint arXiv:221011399
Thakur S, Ahmad B, Fan Z, Pearce H, Tan B, Karri R, Dolan-Gavitt B, Garg
S (2023a) Benchmarking large language models for automated verilog rtl
code generation. In: 2023 Design, Automation & Test in Europe Conference
& Exhibition (DATE), pp 1–6, DOI 10.23919/DATE56975.2023.10137086
Thakur S, Ahmad B, Fan Z, Pearce H, Tan B, Karri R, Dolan-Gavitt B, Garg
S (2023b) Benchmarking large language models for automated verilog rtl
code generation. In: 2023 Design, Automation & Test in Europe Conference
& Exhibition (DATE), pp 1–6
Thoppilan R, Freitas DD, Hall J, Shazeer N, Kulshreshtha A, Others (2022)
Lamda: Language models for dialog applications. CoRR abs/2201.08239,
URL https://arxiv.org/abs/2201.08239, 2201.08239
Tian H, et al. (2023) Is chatgpt the ultimate programming assistant – how far
is it? 2304.11938
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière
B, Goyal N, Hambro E, Azhar F, et al. (2023) Llama: Open and efficient
foundation language models. arXiv preprint arXiv:230213971
Treude C, Hata H (2023) She elicits requirements and he tests: Software
engineering gender bias in large language models. 2303.10131
Tu H, Zhou Z, Jiang H, Yusuf INB, Li Y, Jiang L (2023) Llm4cbi: Taming llms
to generate effective test programs for compiler bug isolation. 2307.00593
Vaithilingam P, Zhang T, Glassman EL (2022) Expectation vs. experience:
Evaluating the usability of code generation tools powered by large language
models. In: Extended Abstracts of the 2022 CHI Conference on Human
Factors in Computing Systems, Association for Computing Machinery, New
York, NY, USA
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser
L, Polosukhin I (2017) Attention is all you need. Advances in neural
information processing systems 30
Title Suppressed Due to Excessive Length 39

Wang B, Xie Q, Pei J, Chen Z, Tiwari P, Li Z, fu J (2023a) Pre-trained


language models in biomedical domain: A systematic survey. 2110.05006
Wang C, Pastore F, Goknil A, Briand LC (2022) Automatic generation of
acceptance test cases from use case specifications: An nlp-based approach.
IEEE Trans Software Eng 48(2):585–616, DOI 10.1109/TSE.2020.2998503,
URL https://doi.org/10.1109/TSE.2020.2998503
Wang H, Gonzalez-Pumariega G, Sharma Y, Choudhury S (2023b)
Demo2code: From summarizing demonstrations to synthesizing code via
extended chain-of-thought. 2305.16744
Wang X, Li S, Ji H (2023c) Code4struct: Code generation for few-shot event
structure prediction. 2210.12810
Wang Y, Wang W, Joty S, Hoi SC (2021) Codet5: Identifier-aware unified
pre-trained encoder-decoder models for code understanding and generation.
arXiv preprint arXiv:210900859
Wang Y, Le H, Gotmare AD, Bui ND, Li J, Hoi SCH (2023d) Codet5+: Open
code large language models for code understanding and generation. arXiv
preprint
Wang Y, Le H, Gotmare AD, Bui NDQ, Li J, Hoi SCH (2023e) Codet5+:
Open code large language models for code understanding and generation.
2305.07922
Wei C, Wang YC, Wang B, Kuo CCJ (2023) An overview on language models:
Recent developments and outlook. 2303.05759
Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le
QV (2021) Finetuned language models are zero-shot learners. arXiv preprint
arXiv:210901652
Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D,
Bosma M, Zhou D, Metzler D, et al. (2022) Emergent abilities of large
language models. arXiv preprint arXiv:220607682
Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J (2020) Neural text
generation with unlikelihood training. In: 8th International Conference on
Learning Representations, ICLR, OpenReview.net
White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, Elnashar A, Spencer-
Smith J, Schmidt DC (2023a) A prompt pattern catalog to enhance prompt
engineering with chatgpt. 2302.11382
White J, Hays S, Fu Q, Spencer-Smith J, Schmidt DC (2023b) Chatgpt prompt
patterns for improving code quality, refactoring, requirements elicitation,
and software design. 2303.07839
Wong MF, Guo S, Hang CN, Ho SW, Tan CW (2023) Natural language
generation and understanding of big code for AI-assisted programming:
A review. Entropy 25(6):888, DOI 10.3390/e25060888, URL https://doi.
org/10.3390%2Fe25060888
Wu L, Zheng Z, Qiu Z, Wang H, Gu H, Shen T, Qin C, Zhu C, Zhu H,
Liu Q, Xiong H, Chen E (2023) A survey on large language models for
recommendation. 2305.19860
Xia CS, Zhang L (2023a) Conversational automated program repair. 2301.
13246
40 Zibin Zheng et al.

Xia CS, Zhang L (2023b) Keep the conversation going: Fixing 162 out of 337
bugs for $0.42 each using chatgpt. 2304.00385
Xia CS, Wei Y, Zhang L (2022) Practical program repair in the era of large
pre-trained language models. 2210.14179
Xiao Z, Yuan X, Liao QV, Abdelghani R, Oudeyer PY (2023) Supporting
qualitative analysis with large language models: Combining codebook with
GPT-3 for deductive coding. In: 28th International Conference on Intelligent
User Interfaces, ACM, DOI 10.1145/3581754.3584136, URL https://doi.
org/10.1145%2F3581754.3584136
Xing Z, Huang Q, Cheng Y, Zhu L, Lu Q, Xu X (2023) Prompt sapper:
Llm-empowered software engineering infrastructure for ai-native services.
2306.02230
Xu B, Xing Z, Xia X, Lo D (2017) Answerbot: automated generation of answer
summary to developersź technical questions. In: Rosu G, Penta MD, Nguyen
TN (eds) Proceedings of the 32nd IEEE/ACM International Conference on
Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30
- November 03, 2017, IEEE Computer Society, pp 706–716, DOI 10.1109/
ASE.2017.8115681, URL https://doi.org/10.1109/ASE.2017.8115681
Xu C, McAuley J (2023) A survey on model compression and acceleration
for pretrained language models. Proceedings of the AAAI Conference on
Artificial Intelligence 37(9):10566–10575, DOI 10.1609/aaai.v37i9.26255,
URL https://ojs.aaai.org/index.php/AAAI/article/view/26255
Xu X, Zhang Z, Feng S, Ye Y, Su Z, Jiang N, Cheng S, Tan L, Zhang X
(2023) Lmpa: Improving decompilation by synergy of large language model
and program analysis. 2306.02546
Yang J, Prabhakar A, Narasimhan K, Yao S (2023) Intercode: Standardizing
and benchmarking interactive coding with execution feedback. 2306.14898
Yuan Z, Lou Y, Liu M, Ding S, Wang K, Chen Y, Peng X (2023) No more
manual tests? evaluating and improving chatgpt for unit test generation.
2305.04207
Zamfirescu-Pereira J, Wong RY, Hartmann B, Yang Q (2023) Why johnny
can’t prompt: How non-ai experts try (and fail) to design llm prompts. In:
Proceedings of the 2023 CHI Conference on Human Factors in Computing
Systems, Association for Computing Machinery, New York, NY, USA
Zan D, Chen B, Zhang F, Lu D, Wu B, Guan B, Wang Y, Lou JG (2023)
Large language models meet nl2code: A survey. 2212.09420
Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W,
Xia X, et al. (2022) Glm-130b: An open bilingual pre-trained model. arXiv
preprint arXiv:221002414
Zhang K, Li Z, Li J, Li G, Jin Z (2023a) Self-edit: Fault-aware code editor for
code generation. 2305.04087
Zhang K, Wang D, Xia J, Wang WY, Li L (2023b) Algo: Synthesizing
algorithmic programs with generated oracle verifiers. 2305.14591
Zhang N, Huang Q, Xia X, Zou Y, Lo D, Xing Z (2022a) Chatbot4qr:
Interactive query refinement for technical question retrieval. IEEE Trans
Software Eng 48(4):1185–1211, DOI 10.1109/TSE.2020.3016006, URL
Title Suppressed Due to Excessive Length 41

https://doi.org/10.1109/TSE.2020.3016006
Zhang R, Cahyawijaya S, Cruz JCB, Aji AF (2023c) Multilingual large
language models are not (yet) code-switchers. 2305.14235
Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M,
Li X, Lin XV, et al. (2022b) Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:220501068
Zhao J, Rong Y, Guo Y, He Y, Chen H (2023a) Understanding programs by
exploiting (fuzzing) test cases. 2305.13592
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang
J, Dong Z, et al. (2023b) A survey of large language models. arXiv preprint
arXiv:230318223
Zheng Q, Xia X, Zou X, Dong Y, Wang S, Xue Y, Wang Z, Shen L, Wang
A, Li Y, Su T, Yang Z, Tang J (2023) Codegeex: A pre-trained model
for code generation with multilingual evaluations on humaneval-x. CoRR
abs/2303.17568, DOI 10.48550/arXiv.2303.17568, URL https://doi.org/
10.48550/arXiv.2303.17568, 2303.17568
Zhu R, Zhang C (2023) How robust is a large pre-trained language model
for code generationƒ a case on attacking gpt2. In: 2023 IEEE International
Conference on Software Analysis, Evolution and Reengineering (SANER),
pp 708–712, DOI 10.1109/SANER56733.2023.00076
Zhuo TY (2023) Large language models are state-of-the-art evaluators of code
generation. 2304.14317
Zhuo TY, Li Z, Huang Y, Shiri F, Wang W, Haffari G, Li YF (2023)
On robustness of prompt-based semantic parsing with large pre-trained
language model: An empirical study on codex. In: Proceedings of the 17th
Conference of the European Chapter of the Association for Computational
Linguistics, Association for Computational Linguistics, Dubrovnik, Croatia,
pp 1090–1102

You might also like