2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
Liguo Chen1 , Qi Guo1 , Hongrui Jia1 , Zhengran Zeng1 , Xin Wang1 , Yijiang Xu1 , Jian Wu3 , Yidong Wang1 ,Qing
Gao1 , Jindong Wang2 , Wei Ye1 , Shikun Zhang1∗
arXiv:2408.16498v1 [cs.SE] 29 Aug 2024
1
Peking University, Beijing, China
2
Microsoft Research Asia, Beijing, China
3
Tokyo Institute of Technology , Tokyo, Japan
Abstract This paper provides a comprehensive review of the current methods and metrics used to evaluate the per-
formance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated
software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by
reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods
and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and
evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark
datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes
the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code
compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively
assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating
LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how
to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for
further optimizing and improving the application of LLMs in code generation tasks.
Keywords Large Language Models, Code Generation, Evaluation Methods, Evaluation Metrics
Regular Paper
* corresponding authors.
2 J. Comput. Sci. & Technol.
story or a feature description into a functional the challenges faced by LLMs in the field of code gener-
programming script. ation, such as reliability, security, and interpretability,
and look forward to future research prospects [26, 35].
• Filling in code templates: Completing partially
The rest of this work is organized as follows: Sec-
written code snippets by adding the missing parts
tion 2 focuses on the metrics for assessing the code
based on the provided context.
generation capabilities of large language models. This
• Refactoring and optimizing existing code: Modi-
section first discusses metrics such as code correctness,
fying and improving code structure and efficiency
efficiency, and readability [26]. It then introduces eval-
without altering its functionality.
uation methods based on expert review and user ex-
• Generating test cases: Creating test scripts and perience [7]. These evaluation metrics and methods
scenarios to validate the correctness and perfor- can comprehensively reflect the performance of large
mance of a given piece of code. language models in code generation tasks. Section 3
By automating these tasks, LLMs can significantly reviews the current benchmarks and datasets used for
streamline the software development process, reduce evaluating code generation. This section begins with a
the likelihood of human error, and enable developers summary of existing code generation benchmark suites,
to focus on more complex and creative aspects of pro- such as HumanEval, MBPP, and CodeXGLUE [32, 49].
gramming. As LLMs continue to be deeply integrated It then analyzes in detail the various metrics covered by
into the field of software engineering, it is necessary these benchmark suites, including code correctness, effi-
for us to fully understand their capabilities and limi- ciency, and readability [15]. The analysis indicates that
tations in code generation. This will not only help to while existing benchmarks and evaluation metrics have
improve the efficiency of software development but also made certain progress, they still have some limitations,
promote further development of LLM technology in re- such as the difficulty in fully reflecting the performance
reference for evaluating the performance of large lan- ence text. Specifically, BLEU calculates the number of
guage models in code generation tasks by systemati- matching n-grams between the generated text and ref-
cally reviewing existing methods and metrics and look- erence text, then averages these matches with weights
ing forward to future development trends. Additionally, to produce an overall similarity score.
AutoSurvey [43] was utilized to retrieve relevant litera- In the context of code generation, BLEU is used to
ture, ensuring a comprehensive overview. We hope that evaluate the similarity between generated and reference
this review provides valuable insights for researchers code. Although BLEU excels in natural language pro-
and practitioners in the field, thereby fostering further cessing, it also has some issues when applied to code
advancements in this area. generation. Firstly, the syntax and semantic structure
of code are more complex than natural language, so
2 Code Generation Evaluation Metrics
relying solely on n-gram matching may not accurately
2.1 Evaluation Based on Similarity measure code similarity. Secondly, code often contains
a large number of unique identifiers such as variable
Similarity-based evaluation methods in code gener-
names and function names, which may be completely
ation primarily assess the quality of generated code by
different across different code snippets but actually per-
comparing its similarity to reference code. Figure 1
form the same function.
provides a classification of various evaluation metrics
Nevertheless, due to its simplicity and intuitiveness,
used in code generation benchmarks. These similarity
BLEU is still widely used for preliminary evaluation
evaluation methods are generally divided into following
in code generation. For example, the CodeXGLUE
categories: similarity based metrics, execution based
dataset [27] uses BLEU as one of its main evaluation
metrics and feedback based metrics. I will discuss sev-
metrics to measure the similarity between generated
eral representative metrics in detail in the following sec-
and correct code. This indicates that, despite its flaws,
tions.
BLEU remains a useful tool, especially when more com-
2.1.1 Traditional Similarity Metrics
plex alternatives are not available.
Traditional similarity metrics, initially used in the Besides BLEU, ROUGE (Recall-Oriented Un-
field of natural language processing, have also been ap- derstudy for Gisting Evaluation) and METEOR
plied to the evaluation of code generation capabilities. (Metric for Evaluation of Translation with Ex-
These methods assess the quality of generated code plicit ORdering) are two traditional similarity met-
by calculating its similarity to reference code. Com- rics initially used in natural language processing that
mon similarity metrics include BLEU, ROUGE, and have been adapted for code generation evaluation.
METEOR, with the CodeXGLUE dataset utilizing the ROUGE focuses on recall, measuring the frequency of
BLEU metric to calculate the similarity between gen- n-grams from the reference text appearing in the gen-
erated and correct code. erated text, which can capture subtle differences when
BLEU (Bilingual Evaluation Understudy) is the generated code covers the logical steps of the refer-
a metric used for machine translation evaluation, first ence code despite different variable and function names.
proposed by Papineni et al [33]. It measures the over- METEOR, on the other hand, combines precision and
lap of n-grams between the generated text and refer- recall while considering lexical matching, multi-word
4 J. Comput. Sci. & Technol.
Other Metrics Accuracy [51], F1 score [27], Mean Reciprocal Rank(MRR) [55]
matching, and semantic relationships. Its nuanced ap- ity between two sequences by measuring the minimum
proach to lexical matching, including synonyms and number of single-character edits required to transform
morphological variations, makes it more flexible for one sequence into the other. These edits include in-
code evaluation, accurately reflecting the quality of gen- sertions, deletions, and substitutions. In the context
erated code even when different variable and function of code generation, Edit Distance can be employed to
names are used. Both metrics complement each other assess how closely the generated code matches the ref-
and help address some of the limitations of BLEU in erence code by quantifying the effort needed to convert
evaluating the similarity and quality of generated code. one into the other. Edit Distance is calculated using
Exact Match (EM) [27] metric is a straightfor- the following recursive definition:
ward and stringent evaluation method used to measure
|a| if |b| = 0
the accuracy of generated code by directly comparing it
|b| if |a| = 0
lev(tail(a), tail(b)) if head(a) = head(b)
to the reference code. This metric is particularly useful lev(a, b) =
lev(tail(a), b)
1 + min lev(a, tail(b)) otherwise
in assessing the overall correctness of the code com-
lev(tail(a), tail(b))
(1)
pletion process, taking into account elements such as
identifiers, keywords, operators, delimiters, and literals. The tail(x) denotes the string obtained by remov-
Exact Match (EM) calculates the percentage of gen- ing the first character of x (i.e., tail(x0 x1 . . . xn ) =
erated code snippets that exactly match the reference x1 x2 . . . xn ), and head(x) represents the first charac-
code snippets. This means that for a generated code ter of x (i.e., head(x0 x1 . . . xn ) = x0 ). In the minimum
snippet to be considered a match, it must be identical operation, the first term corresponds to deletion (from
to the reference code in every aspect, including syntax, a to b), the second to insertion, and the third to re-
structure, and content. It provides a clear and unam- placement. The Edit Distance metric is particularly
biguous measure of code generation accuracy, making it useful for code generation evaluation as it provides a
a valuable tool for evaluating code completion models. more flexible measure of similarity compared to Exact
Edit Distance [1], also known as Levenshtein Dis- Match. It accounts for minor differences and variations
tance, is a widely used metric for evaluating the similar- in the code, such as different variable names or slight
Shortened Title Within 45 Characters 5
changes in syntax, while still reflecting the overall effort analyzes the AST structures of both the generated and
needed to achieve an exact match. This makes it a valu- reference code, ensuring that not only the textual con-
able complement to other metrics like BLEU, ROUGE, tent is similar but also that the code structure and logic
METEOR, and Exact Match, offering a more nuanced are as close as possible. This allows CodeBLEU to more
understanding of the generated code’s quality. fully consider the syntactic and semantic similarity of
code when evaluating the quality of code generation,
2.1.2 Code-Specific Similarity Metrics
thereby providing a more accurate assessment.
To more accurately assess the quality of code gen-
CodeBLEU was used to evaluate the task of gen-
eration, researchers have developed similarity metrics
erating code from natural language descriptions, and
specifically tailored for code. These methods introduce
the results showed that the Pearson correlation coeffi-
programming-specific characteristics, such as Abstract
cient between CodeBLEU and programmer scoring was
Syntax Trees (ASTs), data flow, and token matching, higher than that of BLEU and accuracy metrics [36].
to provide a more comprehensive evaluation of the syn- The application of CodeBLEU is not limited to a single
tactic and semantic similarity between generated and programming language; it performs excellently across
reference code. multiple programming languages [36].
CodeBLEU is an evaluation method specifically Other Similarity Methods include metrics based
for code that extends the traditional BLEU metric [36]. on data flow analysis and semantic similarity metrics.
CodeBLEU combines n-gram matching from BLEU Data flow analysis assesses code quality by comparing
while introducing syntactic and semantic information of the similarity of data flows between generated and ref-
code. Specifically, it includes weighted n-gram match- erence code, providing a deeper understanding of the
ing, syntactic AST matching, and semantic data flow code’s semantics. Data-aware techniques analyze vari-
matching. Each component calculates a score, which ables and data flows in generated code to verify func-
is then combined in a weighted manner to produce the tional correctness, ensuring the generated code achieves
total score. The weighted n-gram matching is an ex- expected functionality [10]. These metrics are also used
tension of the traditional BLEU algorithm, assigning to evaluate code optimization and repair, significantly
different weights to different n-grams to better reflect enhancing code performance [48].
the keywords and structures in code. Syntactic AST Semantic similarity metrics focus on the actual func-
matching uses Abstract Syntax Trees (ASTs) to com- tionality and behavior of the code. For instance, in
pare the syntactic structure of candidate and reference code summarization tasks, semantic similarity metrics
code, calculating scores by matching subtrees. Seman- evaluate the quality of the summary by measuring the
tic data flow matching evaluates the semantic similarity semantic similarity between the generated code sum-
of code by analyzing the data flow diagrams within the mary and the reference summary [15]. Another ex-
code. Such a design enables CodeBLEU to capture not ample is DeepSemantic, which utilizes deep learning
only the surface similarity of code but also to deeply un- models to generate semantic representations of binary
derstand the internal logic and functionality, improving code for code similarity measurement, showing poten-
the accuracy and comprehensiveness of the evaluation. tial in cross-architecture vulnerability detection and
In summary, when calculating similarity, CodeBLEU patch analysis [21].
6 J. Comput. Sci. & Technol.
2.2.1 Compilation/Interpretation Success Rate Unit testing is an important metric for evaluating
code quality, which verifies the correctness of the code
The compilation or interpretation success rate is a by running the generated code with predefined test
crucial metric for evaluating the quality of code gen- cases [31,37,53]. This method is crucial for assessing the
eration, assessing whether the generated code can be expected performance of the code under various condi-
successfully compiled or interpreted without syntactic tions, as it ensures the practical utility and reliability of
errors [39, 41]. A high compilation or interpretation the code. For instance, Humaneval is a representative
success rate indicates that the code adheres to the syn- unit testing framework, and its Pass@k metric has be-
tactic rules of the programming language, which is a come a classic evaluation metric for the code generation
fundamental requirement for any functional code. If the capabilities of large language models (LLMs) [9]. The
code cannot be successfully compiled or interpreted, it Pass@k metric measures the probability of the gener-
ated code passing the test within the first k attempts,
cannot be executed further and thus cannot achieve its
effectively evaluating the performance and reliability of
intended functionality.
the code generation model.
To evaluate the compilation or interpretation suc-
By systematically integrating unit testing steps and
cess rate, we typically use standard compilers and in-
using error feedback to iteratively correct the generated
terpreters for various programming languages, such as
code, the unit test pass rate of the generated code can
GCC for C/C++, the Python interpreter, etc. These
be significantly improved, ensuring the reliability and
tools can verify the syntactic correctness of the gener-
stability of the code in practical applications. For ex-
ated code and prepare it for execution. Through these ample, CodeGen-Test adds a program testing step dur-
tools, we can directly assess the compilation or inter- ing the code generation process, combining testing in-
pretation success rate of the generated code, thereby formation to iteratively produce code that meets func-
gaining a basic understanding of the performance of tional requirements [53]. LEVER utilizes execution re-
the code generation model. For instance, the FRANC sults to detect and correct erroneous programs, continu-
framework significantly improves the proportion of gen- ously improving the quality of the generated code [31].
erated code that passes compilation through the use Furthermore, the Multi-Stage Generation Process in-
of static filters, enhancing the quality of Java sugges- troduced by VCP transforms verification errors into
tions by 9% to 46% and Python suggestions by 10% specific hints, guiding the model to regenerate outputs
to 43% [39]. COMPCODER proposes a three-stage that address the discovered errors. This process signif-
icantly reduces the error rate and improves generation
pipeline that uses compiler feedback to generate com-
quality [37].
pilable code [41]. Its pipeline includes language model
fine-tuning, compilability enhancement, and compil- 2.2.3 Performance and Efficiency Evaluation
ability discrimination. This method not only improves Performance and efficiency evaluation refers to the
the successful compilation rate but also makes the gen- assessment of the actual runtime performance of gener-
erated code more reliable in practical applications. ated code by measuring its time and space complexity.
Shortened Title Within 45 Characters 7
Efficient code is crucial for practical applications, and different models without knowing the identity of the
performance evaluation helps identify potential bottle- models, selecting the superior code based on predeter-
necks and optimize the code. In software development, mined criteria. This approach eliminates potential bi-
optimizing computational efficiency, in addition to en- ases, making the evaluation results more objective and
suring functional correctness, is a universal and signif- fair. For instance, in the MT-Bench study [52], review-
icant objective. Efficient code enhances system perfor- ers conducted multiple rounds of comparative evalu-
mance and plays a more substantial role in resource- ations based on criteria such as functionality, clarity,
constrained environments. Therefore, focusing on im- and maintainability. Multiple reviewers and rounds of
proving the efficiency and performance of code dur- comparison ensure fairness and consistency, providing
ing the development and evaluation of code generation detailed insights into the strengths and weaknesses of
models is essential. different code generation models.
By integrating performance and efficiency evalua- 2.3.2 Real-World Application Evaluation
tion into the code generation and testing process, we
Another important evaluation method is to deploy
can ensure that the generated code is not only func-
the generated code in actual application environments
tionally correct but also performs well in practical ap-
and assess its performance in real-world tasks. This
plications, providing foundational data support for op-
method fully evaluates the practicality and reliability
timization. EffiBench [17] and Mercury [12] are no-
of the code, reflecting its real-world effectiveness. Gen-
table frameworks in this domain. EffiBench is a bench-
erated code is applied to real programming tasks, with
marking framework for evaluating the efficiency of au-
metrics such as error rate, debugging time, and main-
tomatically generated code, encompassing a variety of
tenance cost recorded. This approach provides valu-
programming tasks and languages. Mercury, on the
able feedback on the code’s functionality, stability, and
other hand, is a specialized benchmarking framework
adaptability. For example, generated code might per-
designed to assess the efficiency of code generated by
form excellently in a controlled environment but face
Large Language Models (LLMs).
performance bottlenecks or compatibility issues in prac-
2.3 Feedback-Based Evaluation
tical applications. Real-world application evaluation
Feedback-based evaluation methods are essential helps identify and address these issues, thereby improv-
for comprehensively assessing the quality of generated ing the overall quality of the generated code [5].
code, as they incorporate human judgment and exper- 2.3.3 Readability Evaluation
tise to evaluate various aspects of code quality. These
The readability of code is crucial for understanding
methods often involve blind peer review, real-world ap-
and maintaining it. Human evaluation methods focus
plication evaluation, readability evaluation, and main-
on assessing the functionality, clarity, and maintainabil-
tainability evaluation.
ity of the code. Reviewers consider naming conven-
2.3.1 Blind Peer Review
tions, comments, and code logic to determine clarity
Blind peer review is a common and effective method and conciseness. Clear and concise code improves de-
for evaluating code quality comprehensively. In this velopment efficiency and long-term sustainability. For
method, reviewers assess code snippets generated by example, reviewers check if variable and function names
8 J. Comput. Sci. & Technol.
are descriptive, if appropriate comments explain the also provides feedback for model optimization. In this
code logic, and if the code structure is easy to under- chapter, we will introduce five datasets used for code
stand [42]. correctness testing: CodeXGLUE [27], HumanEval [9],
2.3.4 Maintainability Evaluation MBPP [3], CoderUJB [51], and VerilogEval [25]. These
methods each have their own characteristics in terms
Maintainability refers to the ease with which code
of dataset composition, testing methods, performance
can be updated and modified in the future. Code
metrics (such as pass@k), and analysis of test results.
with high maintainability should have good modular
design [22], detailed documentation [45], and adherence 3.1.1 CodeXGLUE
to programming standards. Modular design makes code
CodeXGLUE encompasses various code under-
easier to modify and extend by dividing it into indepen-
standing and generation tasks using multiple large
dent, reusable modules, each responsible for a specific
datasets. For example, code clone detection utilizes
function [22]. Reviewers evaluate whether the code is
BigCloneBench and POJ-104 [29] datasets, defect de-
reasonably divided into such modules. Additionally,
tection uses the Devign [54] dataset, text-to-code gen-
they check for comprehensive documentation and com-
eration employs the CONCODE [19] dataset, and code
ments, such as descriptions of functions and classes,
summary generation relies on the CodeSearchNet [18]
parameters, and return values. For instance, each func-
dataset. These datasets span multiple programming
tion should have detailed comments explaining its func-
languages, including Java, C/C++, and Python, en-
tionality, input parameters, and return values. Good
suring broad applicability and representativeness of
documentation and comments help current and future
CodeXGLUE’s evaluation results.
developers understand and maintain the code [45].
The datasets in CodeXGLUE have undergone rig-
3 Code Generation Evaluation Benchmarks orous preprocessing and filtering to ensure high data
quality and consistency, supporting reliable and re-
Evaluating the performance of code generation
producible evaluation results. Their diversity and ex-
models is a multifaceted task that involves various
tensive coverage enhance the model’s adaptability in
benchmarks designed to test different aspects of code
practical applications, making CodeXGLUE valuable
quality. Figure 2 presents a classification of these
in both academic research and industry.
benchmarks, categorizing them based on the specific
Performance metrics used in CodeXGLUE include
evaluation criteria they address. These benchmarks are
BLEU, Exact Match Accuracy, F1 score, and Code-
essential for understanding how well a model performs
BLEU. These metrics cover traditional evaluation
in generating accurate, efficient, and practical code. I
methods and introduce code-specific standards to bet-
will delve into several representative benchmarks in the
following sections. ter reflect the quality of code generation and under-
standing. For instance, Exact Match and CodeBLEU
3.1 Code Correctness
are used in text-to-code generation, while Accuracy and
In the field of code generation, testing for code cor- F1 score are used in code clone detection.
rectness is an essential task. It not only helps evalu- Benchmark evaluations demonstrate the effective-
ate the quality of the code generated by models but ness of pre-trained models in CodeXGLUE. CodeGPT
Shortened Title Within 45 Characters 9
Code Effi-
Category ciency Eval- EffiBench [17] , Mercury [12]
uation
achieves a CodeBLEU score of 35.98 in code genera- The HumanEval dataset is used to assess the prac-
tion, indicating strong generation capabilities. Code- tical performance of code generation models. It com-
BERT performs well in code clone detection and defect prises 164 Python programming tasks, each with a nat-
detection, with high Accuracy and F1 scores. These re- ural language description and corresponding test cases.
sults highlight the advantages of pre-trained models in Multiple test cases are designed for each task to en-
code tasks and their potential for further improvement sure comprehensive functionality coverage and correct-
through multi-task and transfer learning. ness of the generated code. These tasks span vari-
Overall, CodeXGLUE provides a comprehensive ous programming concepts, from basic control struc-
evaluation framework for code understanding and gen- tures to complex algorithms and data structures, thor-
eration tasks, helping researchers identify and optimize oughly testing the capabilities of code generation mod-
suitable models for specific tasks. It promotes tech- els. The hand-written nature of these tasks ensures
nological progress and innovation in the field, offering quality and uniqueness, avoiding issues from program-
valuable references for future research and model im- matically copied tasks.
provement. With ongoing research and the introduc- The main performance metric of HumanEval is the
tion of more datasets, CodeXGLUE is expected to sup- pass rate (Pass@k), the proportion of at least one of
port the continuous development and enhancement of the top k generated code snippets passing all test cases.
code generation technology. Pass@1, Pass@5, and Pass@10 are commonly used met-
rics. By comparing pass rates, the relative advantages
3.1.2 HumanEval
and disadvantages of different models can be assessed.
Pass@1 reflects the model’s ability to generate high-
9 0
quality code on the first attempt, while Pass@5 and
8 5 G P T -4 -T u rb o D e e p S e e k -C o d e r-V 2 -In s tru c t
G P T -4
c la u d e - 3 - o p u s
8 0
7 5 c la u d e - 3 - h a ik u Pass@10 reflect performance in diversity.
7 0 D e e p S e e k -C o d e r-3 3 b -In s tru c t
c la u d e - 3 - so n n e t
1 (% )
5 0
G P T -3 .5 C o d e L la m a - 3 4 b - h f
4 5
4 0
based models perform well, with higher Pass@1 scores
C o d e L la m a - 1 3 b - h f
3 5 C o d e L la m a - 7 b - h f S ta rC o d e r
3 0
2 5 L la m a 2
S ta rC o d e r2 -7 b than traditional methods. This indicates effective per-
2 0
1 5
1 0 formance of pretrained GPT in real programming tasks,
5 L la m a
0
A u g 2 0 2 2 F e b 2 0 2 3 A u g 2 0 2 3 F e b 2 0 2 4 particularly in code correctness and functionality. De-
R e le a s e T im e
3.1.3 MBPP (Mostly Basic Python Problems) performance metric is Pass@k, indicating the propor-
tion of correct code generated within the first k at-
tempts, with Pass@1, Pass@5, and Pass@10 being com-
8 0
D e e p S e e k -C o d e r-V 2 -In s tru c t
7 5
7 0
c la u d e - 3
D e e p S e e k -C o d e r-3 3 b -In s tru c t c la
-o p u s
u d e - 3 - h a ik u monly used. Additionally, metrics like average time to
G P T -4 -T u rb o c la u d e - 3 - s o n n eL tl a m a 3 - 7 0 b - I n s t r u c t
6 5 C o d e L la m a - 3 4 b - h f S ta rC o d e r2 -1 5 b
6 0
C o d e L la m a - 1 3 b - h f D e e p S e e k - C o d e r-6 .7 b -In s tru c t solve problems and the complexity of generated code
1 (% )
C o d e L la m a 3 - 8 b - In s tr u c t
5 5
C o d e L la m a - 7 b - h f S ta rC o d e r
5 0 G P T -3 .5 assess practicality, efficiency, and code quality.
M B P P p a s s @
4 5 L la m a 2
4 0
3 5
Analysis of MBPP test results shows that
3 0
2 5
Transformer-based models perform exceptionally well,
2 0 L la m a
1 5
A u g 2 0 2 2 F e b 2 0 2 3 A u g 2 0 2 3 F e b 2 0 2 4
particularly in Pass@1 and Pass@5 metrics, indicating
R e le a s e T im e
high efficiency and accuracy. Detailed analysis helps
Fig.4. Pass@1 Performance of LLMs on MBPP Over Time. identify strengths and weaknesses, guiding further op-
timization. For instance, Transformer models excel in
The MBPP dataset consists of 500 Python program- string operations but may require optimization for com-
ming problems, each with a natural language descrip- plex algorithms. Differences in handling various natu-
tion, example input-output pairs, and solution code, di- ral language descriptions also offer insights for improve-
vided into training, validation, and test sets. It covers a ment.
wide range of programming concepts, from basic string Figure 4 illustrates the Pass@1 performance of rep-
operations to complex algorithms and data structures. resentative LLMs on the MBPP dataset over time,
The dataset is designed to test the model’s performance showcasing the remarkable improvements in the capa-
in practical programming environments. Each problem bilities of code generation models.
is crafted and reviewed to ensure quality and represen- In summary, MBPP’s comprehensive testing meth-
tativeness, with clear and concise descriptions to aid in ods and detailed result analysis enhance understanding
Shortened Title Within 45 Characters 11
and evaluation of code generation models, supporting erations within n attempts. Coverage@n assesses how
future research and applications. many test cases the generated code covers, and accu-
3.1.4 CoderUJB racy measures the proportion of code that passes all
test cases. These metrics provide a comprehensive as-
CoderUJB is a comprehensive Java benchmark
sessment framework for comparing model performance
test set designed to assess the performance of Large
in programming tasks.
Language Models (LLMs) in various programming
Results show that while LLMs perform well in code
tasks and real software development scenarios. Un-
generation tasks, challenges remain in non-functional
like HumanEval and MBPP, which focus on Python,
tasks like test generation and defect detection. Con-
CoderUJB includes 2,239 programming problems ex-
tinuous pre-training and instruction fine-tuning have
tracted from 17 open-source Java projects. These cover
mixed effects, indicating the need for careful strat-
five tasks: 238 functional code generation problems, 140
egy selection. Comprehensive assessments highlight the
code-based test generation problems, 451 issue-based
varying performance of models across different tasks,
test generation problems, 470 automatic program re-
emphasizing the need for meticulous strategies to en-
pair problems, and 940 defect detection problems, each
hance LLM capabilities in software engineering.
with complete project context.
In summary, CoderUJB provides a realistic and
CoderUJB’s testing method involves multiple steps:
comprehensive framework for programming capability
task allocation, code generation, unit testing, and com-
assessment in Java, offering valuable insights for the
prehensive assessment. Task allocation assigns the
future development of LLMs in software engineering.
model specific programming tasks, simulating real de-
This research demonstrates the potential of current
velopment scenarios. During code generation, the
LLMs while identifying key challenges for practical ap-
model generates the function body based on the pro-
plications, guiding future improvements in model train-
vided function signature and comments. The generated
ing and assessment methods.
code must pass a compilation check to ensure syntactic
3.1.5 VerilogEval
correctness.
Unit testing verifies the quality of code generation. VerilogEval is a dataset dedicated to the code gener-
Preset test cases cover various inputs and boundary ation and verification of Verilog, a hardware description
conditions, ensuring the generated code’s correctness language. Unlike HumanEval and MBPP, which focus
and robustness. The pass rate of these tests is the on software, VerilogEval targets hardware design tasks,
main indicator of code quality. Additionally, multi-task ensuring the generated code’s effectiveness in synthe-
performance, execution efficiency, and code quality are sis and simulation. The dataset includes tasks covering
evaluated to ensure the model’s applicability in diverse combinational logic circuits, sequential logic circuits,
development scenarios. and state machine design, with each task providing
CoderUJB uses refined evaluation metrics: Pass@k, detailed natural language descriptions and design con-
count@n, coverage@n, and accuracy. Pass@k measures straints like timing, power, and area.
the probability of generating correct code in k attempts, The testing method involves synthesis and simula-
while count@n quantifies the number of successful gen- tion. The model-generated Verilog code is first checked
12 J. Comput. Sci. & Technol.
for syntactic correctness through synthesis tools and IBM, containing over 14 million code samples
then tested for functionality via simulation. This and around 5000 problems in 55 programming
method ensures that the generated code is not only languages. It is designed to support tasks such
syntactically correct but also functionally robust. The as code classification, code completion, and code
synthesis time and simulation time are recorded to eval- similarity analysis.
uate efficiency, which is crucial in hardware design.
• APPS (Automated Programming Progress
Performance metrics for VerilogEval include synthe-
Standard): This benchmark contains a vari-
sis success rate, simulation pass rate, and design perfor-
ety of coding problems designed to measure the
mance (timing, power, and area). High synthesis and
problem-solving capabilities of AI models. It in-
simulation pass rates indicate basic correctness, while
cludes simple to complex problems with detailed
excellent design performance reflects high efficiency and
performance metrics.
resource utilization. These metrics provide a compre-
hensive assessment of the model’s capabilities in prac- • Spider: A complex and cross-domain benchmark
tical hardware design applications. for evaluating text-to-SQL models. It includes
Analysis of VerilogEval results shows that deep thousands of natural language questions and cor-
learning-based code generation models perform well in responding SQL queries across multiple domains,
synthesis and simulation but need improvement in de- ensuring a robust assessment of model perfor-
sign performance. Simple tasks like combinational logic mance in generating correct SQL code.
circuits show high success rates, whereas complex tasks
like sequential logic circuits and state machines present • AtCoder: This dataset comprises coding prob-
challenges. Detailed analysis helps identify strengths lems and their solutions from the AtCoder pro-
and weaknesses, guiding further optimization. gramming contest platform. It includes a wide
Overall, VerilogEval demonstrates the strong po- range of problem difficulties and multiple pro-
tential of deep learning models in hardware design, gramming languages, providing a thorough eval-
providing a solid foundation for future research. The uation of code generation models.
(M b )
W iz a r d C o d e r - 1 5 B d e e p s e e k - c o d e r - 6 .7 b - in s tr u c t
7 0 O p e n C o d e ln te r p r e te r - D S - 1 .3 B P h in d - C o d e L la m a - 3 4 B - v 2
E ffiB e n c h M U
Y i- 3 4 B - C h a t
P h in d - C o d e L la m a - 3 4 B - v 1
s ta rc o d e r
O p e n C o d e ln te r p r e te r - D S - 6 .7 B C o d e L la m a - 1 3 b - I n s t r u c t - h df e e p s e e k -c o d e r-6 .7 b -b a s e
C o d e L la m a -3 4 b -In s tru e t-h f C o d e L la m a - 7 0 b - h f A r tig e n z -C o d e r-D S -6 .7 B
O p e n C o d e In te rp re te r-D S -3 3 B g p t-3 .5 -tu rb o -0 3 0 1 M a g ic o d e r - S - D S - 6 .7 B
These benchmarks cover a wide range of program- 6 0
g p t-4
g p t-3 .5 -tu
g p t-3 .5 -tu
rb o -0
rb o -1
6 1 3 d e e p s e e k -c
1 0 6 o c to c o d e r
o d e r-3 3 b -b a s e
X w in C o d e r - 1 3 B
c la u d e - 3 - s
c la u d e - 3 - h
g p t-4 -tu rb
o n n e t
a ik u
o - p r e v ie w
X w in C o d e r - 3 4 B C o d e F u s e - D e e p S e e k - 3 3 B
d e e p s e e k -c o d e r - 1 .3 b - in s tr u c t
C o d e L la m a - 3 4 b - h f c o d e g e m m a -7 b
C o d e L la m a - 7 0 b - In s tr u c t- h f
ming languages and problem types, providing a com- 5 0 M is tr a l- 7 B - c o d e a lp a c a - lo r a
Y i- 3 4 B - 2 0 0 K
s ta rc o d e r2 -3 b
C o d e L la m a - 7 b - h f s ta rc o d e r2 -7 b
In the realm of code generation, the efficiency assess- EffiBench is a benchmarking framework for evaluat-
ment of code is a critical aspect, directly influencing the ing the efficiency of automatically generated code across
feasibility and value of the generated code in practical various programming tasks and languages. The dataset
applications. Efficient code not only conserves com- includes code snippets in Python, Java, and C++, as
putational resources but also enhances user experience well as performance test cases for tasks like sorting
algorithms, matrix operations, and file handling. It
and system response speed. Therefore, evaluating the
comprises 1000 efficiency-critical programming prob-
efficiency of generated code is an essential step in en-
lems from LeetCode, each with an executable standard
suring code quality. This chapter will introduce three
solution. These problems cover common algorithms
methods for assessing code efficiency: EffiBench [17],
and data structures, ensuring representative and widely
Mercury [12]. Through these methods, we can gain a
applicable assessment results.
more comprehensive understanding of the efficiency of
EffiBench’s testing methods include static and dy-
generated code, promoting the development of more ef-
namic analysis. Static analysis evaluates code struc-
ficient code generation models.
ture and quality through metrics like cyclomatic com-
3.2.1 EffiBench plexity, lines of code, and comment ratio, using tools
like SonarQube and ESLint. Dynamic analysis assesses
execution performance, measuring indicators such as
0 .4 5
W iz a r d C o d e r - 1 3 B
C o d e L la m a - 7 b - In s tr u c t- h f execution time, memory usage, and CPU utilization.
P h in d - C o d e L la m a - 3 4 B - v 2
s ta rc o d e r2 -1 5 b
C o d e L la m a - 7 0 b - h f
High-precision timers, memory analysis tools, and sys-
E ffiB e n c h E T ( s )
C o d e L la m a - 1 3 b - h f c la u d e - 3 - s o n n e t
0 .4 0
O p e n C o d e In te rp re te r-D S -3 3 B
g p
g p t-3 .5 -tu rb o -1 1 0 6
t-3 .5 -tu rb o -0 6 1 3 g p t-3 .5 -tu rb o -0 3 0 1
d e e p s e e k - c o d e r - 6 .7 b - in s tr u c t
d e e p s e e k - c o d e r - 3 3 b - b a s e C o d e L la m a - 7 0
A r tig e n z - C o d e r - D S - 6 .7 B
c la u d e - 3 - h a ik u tem monitoring tools ensure accurate measurements.
O p e n C o d e ln te r p r e te r - D S - 6 .7 B b - In s tr u c t- h f g p t- 4 - tu r b o - p r e v ie w
Y i- 3 4 B d e e p s e e k -c o d e r-6 .7 b -b a s e
g p t-4 C o d e L la m a - 3 4 b - In s tr u e t- h f
0 .3 5
s ta rc o d e rb a
O p e n C o d e ln te r p r e te r - D S - 1 .3 B W
P h in d - C o d e L la m a - 3 4 B - v 1
s e C o d e L la m a - 1 3 b - In s tr u c t- h f
iz a r d C o d e r - 1 5 B
C
M a g ic o
o d e F u s e -D e e p S e e k -3 3 B
d e r-S -D S -6 .7 B Testing is conducted in standardized environments
Y i- 3 4 B - C h a t X w in C o d e r - 3 4 B c o d e g e m m a -7 b
C o d e L la m a - 3 4 b - h f X w in C o d e r - 1 3 B
s ta rc o d e r
d e e p s e e k - c o d e r - 1 .3 b - in s tr u c t
M is tr a l- 7 B
o c to c o d e r
- c o d e a lp a c a - lo r a Y i- 3 4 B - 2 0 0 K
s ta rc o d e r2 -7 b
s ta rc o d e r2 -3 b
with fixed hardware configurations and software ver-
C o d e L la m a - 7 b - h f
0 .3 0
rect and robust enough to handle minor coder errors, ent contexts, especially with erroneous prompts. Addi-
while also being user-friendly in terms of readability tional metrics like edit distance and syntactic similarity
and maintainability. This paper introduces three meth- provide deeper insights into code differences and struc-
ods for evaluating practicality: AlignJudge [11], Real- tural similarities.
HumanEval [30], and Copilot Evaluation Harness [2]. Experimental results show that models struggle
Each method is discussed in terms of dataset compo- with subtle errors in prompts, leading to lower qual-
sition, testing methods, performance metrics, and test ity code. Clear instructions improve results, highlight-
result analysis. This detailed exploration aims to pro- ing the need for better model alignment. Different error
vide a comprehensive framework for assessing the read- types affect performance variably; simple typos are cor-
ability and maintainability of code generation models. rected well, while logical and boundary errors pose chal-
3.3.1 AlignJudge lenges. The diversity and coverage of training data are
crucial for improving performance. Researchers suggest
The AlignJudge method uses a subset of the Hu-
optimizing training data, prompt design, and error han-
manEval dataset, which contains 164 programming
dling to enhance model capabilities.
problems with task descriptions, reference solutions,
3.3.2 RealHumanEval
and test cases. Researchers selected 30 problems and
created solutions with subtle errors, covering common RealHumanEval uses a new evaluation framework to
programming tasks such as algorithms, data structures, assess the capability of large language models (LLMs)
and mathematics, to evaluate model alignment. Errors in supporting programmers. The dataset includes com-
include variable name typos, logical errors, and bound- plex programming tasks with detailed natural lan-
ary condition handling errors. This design tests the guage descriptions and multiple test cases, primarily
model’s ability to generate correct code and correct er- in Python. These tasks are designed to cover diverse
roneous code, providing a comprehensive understand- programming challenges, ensuring comprehensive eval-
ing of the model’s practical capabilities. uation. Researchers verified each task multiple times to
The method involves assessing whether the gener- validate their applicability in real-world scenarios.
ated code can pass given test cases. Researchers used The method is based on real user feedback. Re-
all 164 HumanEval problems and added 30 problems searchers invited 213 experienced programmers to com-
with subtle errors to further test model alignment. The plete actual programming tasks using model-generated
model must generate code that passes unit tests and code. Participants rated the usability and helpfulness
corrects errors in the provided solutions. Metrics such of the code and verified its correctness through pre-
as generation time, code length, and complexity help set test cases. Participants were randomly assigned to
evaluate performance, efficiency, readability, and main- one of seven conditions: a control with no LLM sup-
tainability. port, three with auto-completion (using CodeLlama-
Key performance metrics include pass rate (Pass@k) 7b, CodeLlama-34b, and GPT-3.5-turbo-instruct), and
and alignment. Pass@k measures the proportion of the three with chat support (using chat versions of the
top k generated code snippets that pass all test cases. aforementioned models). Each participant was assigned
Alignment evaluates the model’s performance in differ- to one condition for the entire test to minimize con-
16 J. Comput. Sci. & Technol.
text switching. The RealHumanEval platform supports bilities, providing valuable references for future model
auto-completion and chat. In auto-completion mode, development.
LLM provides code suggestions based on cursor posi- 3.3.3 Copilot Evaluation Harness
tion. In chat mode, programmers can ask questions and
Copilot Evaluation Harness uses a large dataset
receive answers, copying code from chat responses into
generated by GitHub Copilot, covering multiple pro-
the editor. All interactions, such as suggestion accep-
gramming languages and task types. The dataset in-
tance rates and copied code frequencies, were recorded.
cludes thousands of programming tasks, each with de-
Main performance metrics include user ratings and
tailed natural language descriptions, example inputs
pass rate (Pass@k). User ratings measure the helpful-
and outputs, and test cases. The primary languages in-
ness of the model in real programming tasks on a scale
clude Python, JavaScript, TypeScript, and Go. These
from 1 to 5. Pass rate measures the proportion of the
top k generated code snippets that pass all test cases. tasks range from simple algorithm problems to complex
Additional metrics like task completion time and code multi-step programming tasks, designed to reflect real
quality were also considered. User ratings provide sub- development issues.
jective usability measures, while pass rate directly eval- This method combines real developer feedback with
uates code accuracy. Researchers also analyzed code automated testing to comprehensively evaluate GitHub
readability and maintainability for a comprehensive as- Copilot’s performance in actual development environ-
Results show that GPT-3-based models perform ex- Visual Studio Code to complete tasks, including code
ceptionally well in supporting programmers, especially generation, documentation generation, test case gener-
in complex tasks, with user ratings significantly higher ation, bug fixing, and workspace understanding. De-
than traditional methods. RealHumanEval effectively velopers’ experiences and the quality of generated code
evaluates model support in real programming environ- were recorded, along with subjective feedback on code
ments, particularly in user experience and task com- accuracy, quality, and helpfulness, and task completion
pletion efficiency. Detailed user feedback reveals the time. The generated code was also tested using static
model’s strengths and weaknesses, providing insights analysis tools and unit test frameworks to ensure syn-
for further optimization. For example, the model is es- tactic and functional correctness. The evaluation pro-
pecially helpful in data processing tasks, significantly cess involved task assignment, code generation, subjec-
reducing completion time, but needs improvement in tive feedback, automated testing, and data analysis.
tasks requiring complex logical reasoning. These find- Key performance metrics include user satisfaction
ings highlight the importance of improving model align- ratings, pass rate (Pass@k), and code completion time.
ment and handling complex tasks. Additionally, dis- User satisfaction ratings measure the overall experience
crepancies between user preferences and actual perfor- of using Copilot, ranging from 1 to 5. Pass rate mea-
mance suggest that more user-centric considerations are sures the proportion of the top k generated code snip-
needed in model design and evaluation. Researchers pets that pass all test cases. Code completion time
proposed improvements such as optimizing training evaluates the time taken to complete tasks with model
data, interaction interfaces, and error handling capa- assistance. Additional metrics, such as code readability,
Shortened Title Within 45 Characters 17
maintainability, and efficiency, were also considered to guage Models (MLLMs) to generate code from
provide a comprehensive understanding of the model’s scientific plots. It includes 132 high-quality mat-
performance. plotlib plots across six types, each accompanied
Results show that GitHub Copilot significantly im- by source code and a descriptive instruction sum-
proves developer efficiency and satisfaction across mul- marized by GPT-4. The benchmark uses three
tiple programming tasks. In complex and multi-step evaluation metrics: code pass rate, text-match
tasks, Copilot-generated code quickly passes test cases, ratio, and GPT-4V overall rating, to assess the
demonstrating its strong assistive capabilities. How- models’ performance.
ever, in some tasks, Copilot-generated code contains
• DevBench: DevBench is a comprehensive
subtle errors requiring manual correction, typically in-
benchmark for evaluating large language models
volving boundary conditions, exception handling, and
(LLMs) across various stages of the software de-
specific language details. Overall, Copilot Evaluation
velopment lifecycle, including design, setup, im-
Harness highlights GitHub Copilot’s broad applicabil-
plementation, and testing. Unlike other bench-
ity and efficiency in real development environments,
marks that focus on isolated tasks, DevBench cov-
providing valuable data and feedback for future opti-
ers a wide range of programming languages and
mization and improvement. Detailed analysis revealed
real-world challenges.
patterns and trends, showing that Copilot performs ex-
ceptionally well in certain programming languages and 4 Challenges and Future Directions of Code
task types but requires further optimization in others. Evaluation
These methods, including AlignJudge, RealHu- 4.1 The limitations of current code evaluation
manEval, and Copilot Evaluation Harness, provide
Evaluating the performance of large language mod-
multidimensional metrics for evaluating code genera-
els in code generation has made significant strides, yet
tion models’ usability and user experience. They reveal
several challenges remain. Addressing these challenges
the current models’ strengths and weaknesses through
is crucial for advancing the field and maximizing the
detailed test result analysis, offering valuable insights
potential of code generation technologies in practical
for future improvements. These evaluation methods
applications.
also provide essential references for the development
The limitations of current code evaluation methods
and optimization of new models, driving the continuous
are multifaceted:
advancement of code generation technology.
that support the development and testing of mod- coding standards, and the clarity of comments
els. However, this bias limits the applicability and documentation could provide a more holis-
of LLMs in industries and applications where tic assessment. Similarly, performance efficiency
niche languages are more prevalent. For exam- metrics, such as time and space complexity, are
ple, languages like R are crucial in data science, critical in assessing the practicality of generated
while Erlang is significant in telecommunications, code in resource-constrained environments. For
yet these languages receive relatively little atten- example, generated code that passes all test cases
tion in benchmark evaluations. The lack of re- (high Pass@k) might still be inefficient in terms of
sources and evaluation tools for these less com- execution time or memory usage. Moreover, the
mon languages could lead to suboptimal perfor- inclusion of human-in-the-loop evaluations, where
mance when LLMs are applied to tasks in these developers assess the usability and readability of
domains. Moreover, the syntactic and seman- code, could also add valuable qualitative insights
tic peculiarities of these languages may not be that automated metrics might miss.
adequately captured by models trained predom-
• Restricted Evaluation Scope: Typically, evalua-
inantly on data from more common languages,
tions are confined to file-level or function-level
leading to errors in code generation. To address
assessments. This narrow focus overlooks more
this, future research should prioritize the creation
comprehensive analyses at the repository level or
of diverse, high-quality datasets and evaluation
within specific code segments, potentially missing
tools for a broader range of programming lan-
broader context and interdependencies.The eval-
guages. This could involve community-driven ef-
uation at the repository level or across multiple
forts to collect and curate data or the develop-
files is necessary to capture the interdependen-
ment of transfer learning techniques that allow
cies and broader context within which the gener-
models to adapt to new languages with minimal
ated code must operate. For instance, a model
additional training.
might be able to generate a function that is syn-
• Limited Evaluation Metrics: The metrics used for tactically correct and passes unit tests but fails
code evaluation are somewhat constrained, pri- when integrated into a larger project due to un-
marily relying on indicators such as code BLEU resolved dependencies or mismatches in data flow
and Pass@k. These metrics may not fully cap- across different components. Evaluating code at
ture the various dimensions of code quality and this broader scope would require the develop-
performance.While BLEU and Pass@k are use- ment of new benchmarks that simulate real-world
ful for assessing surface-level accuracy and syn- project environments. This could include multi-
tax, they fall short in evaluating deeper aspects file projects, complex build systems, and integra-
of code quality, such as maintainability, readabil- tion with external services. Moreover, the evalua-
ity, and performance efficiency. Maintainability tion should consider the maintainability and scal-
is crucial in long-term projects where code is fre- ability of the code, which are critical in large soft-
quently updated or refactored. Metrics that eval- ware systems. Methods such as continuous inte-
uate the modularity of the code, adherence to gration testing and code review simulations could
Shortened Title Within 45 Characters 19
provide insights into how well generated code per- even auditory modalities to assess the generated code
forms in realistic development scenarios. more holistically. For instance, code often interacts
with various user interfaces, databases, and hardware
• Insufficient Test Cases: The limited number of
systems, where evaluating the integration of generated
test cases makes it challenging to cover all bound-
ary conditions. This limitation weakens the com- code with these components is crucial. A multimodal
prehensiveness of the evaluations, as many edge evaluation approach could include not only traditional
cases may remain untested. The insufficiency of metrics like code correctness and efficiency but also the
test cases is particularly problematic in safety- code’s ability to generate and interact with graphical
critical applications where failures can have se- user interfaces (GUIs), manage databases, or control
used in evaluations should be expanded to cover Such an approach could be particularly valuable in
a wider range of inputs, including edge cases, rare evaluating code used in embedded systems, robotics, or
conditions, and potential security vulnerabilities. web development, where the interaction between soft-
This could involve leveraging techniques such as ware and hardware or user interfaces is critical. For ex-
fuzz testing, where large volumes of random in- ample, in web development, the generated code could
puts are generated to test the robustness of the be evaluated on its ability to render correctly across
code. Furthermore, creating automated test gen- different devices and browsers, which requires a blend
eration tools that can analyze the generated code of code analysis and visual inspection. Similarly, in
and identify untested paths could help in develop- robotics, code generation could be evaluated on its abil-
ing more comprehensive test suites. These tools ity to control physical devices, which might involve in-
could be designed to work in tandem with LLMs, tegrating sensory feedback into the evaluation process.
generating test cases that specifically target areas This kind of multimodal evaluation would ensure that
of the code that are most likely to contain errors. LLMs are not only generating syntactically correct code
4.2 Future Directions in Code Evaluation for but also creating functional, reliable, and user-friendly
Large Language Models systems.
analyze the broader codebase or project documenta- This direction also involves assessing the trans-
tion to ensure that the generated code is compatible parency and explainability of the generated code. As
with existing systems. This might include checking for LLMs become involved in more critical and high-stakes
adherence to project-specific coding standards, compat- applications, there is a growing need for code that is
ibility with existing modules or libraries, and alignment not only functional but also interpretable and explain-
with the overall project architecture. For instance, in able by humans. This might involve developing met-
large-scale software projects, it is crucial that new code rics that evaluate how well the generated code can be
integrates seamlessly with the existing codebase, fol- understood, audited, and justified by developers and
lowing the same design patterns and conventions. Fu- stakeholders.
ture evaluation methods could automatically assess this
4.2.4 Continuous and Automated Evaluation
alignment, ensuring that the generated code does not Pipelines
introduce inconsistencies or technical debt.
In the future, code evaluation is likely to be-
Moreover, context-aware evaluation could extend to
come more integrated into the continuous integration
assessing the generated code’s compliance with domain-
and continuous deployment (CI/CD) pipelines that are
specific regulations or standards, which is particu-
common in modern software development. Automated
larly important in fields like healthcare, finance, or
evaluation pipelines could continuously assess the qual-
aerospace, where regulatory compliance is critical.
ity of generated code as part of the software devel-
4.2.3 Ethical and Responsible Code Evaluation opment lifecycle, providing real-time feedback and en-
As LLMs become more powerful and their code gen- abling rapid iteration and improvement.
eration capabilities more sophisticated, the ethical im- Automated evaluation pipelines could integrate
plications of their use in software development must be with existing CI/CD tools to automatically run tests,
considered. Future evaluation methods will likely in- check for code quality, and even provide suggestions for
clude criteria for assessing the ethical and responsible improving the generated code. This would enable de-
use of generated code, particularly in areas such as pri- velopers to incorporate LLMs into their workflows more
vacy, security, and bias. seamlessly, with the confidence that the generated code
Ethical code evaluation could involve checking for is being continuously evaluated and validated. Such
security vulnerabilities that could be exploited, ensur- pipelines could also include feedback loops where the
ing that the generated code adheres to privacy stan- results of evaluations are fed back into the LLM to im-
dards, and evaluating whether the code might perpet- prove future generations, creating a dynamic and evolv-
uate or introduce biases. For instance, code that inter- ing evaluation process.
acts with user data must be evaluated for how it han- This approach would be particularly valuable in ag-
dles sensitive information, ensuring compliance with ile development environments where speed and itera-
privacy laws such as GDPR or HIPAA. Similarly, al- tion are key. By integrating automated evaluations into
gorithms generated for decision-making processes need the development pipeline, teams could quickly identify
to be checked for biases that could lead to unfair or and address issues with generated code, ensuring that
discriminatory outcomes. it meets the required standards before it is deployed.
Shortened Title Within 45 Characters 21
4.2.5 Human-AI Collaboration in Evaluation These advancements will not only improve the accu-
racy and robustness of evaluations but also enhance the
As LLMs continue to advance, they have demon-
practical utility and trustworthiness of LLM-generated
strated strong performance as evaluators in various
code in real-world applications.
tasks [44, 52]. This progress opens up more opportu-
nities for human-AI collaboration in code evaluation. References
Future evaluation methods may involve a hybrid ap-
proach where AI assists human reviewers by identify- [1] Levenshtein distance. Website, 2023. https://en.
tomating routine checks, while humans provide the nu- [2] Anisha Agarwal, Aaron Chan, Shubham Chan-
anced judgment and contextual understanding that AI del, Jinu Jang, Shaun Miller, Roshanak Zilouch-
still lacks. ian Moghaddam, Yevhen Mohylevskyy, Neel Sun-
This collaborative approach could leverage AI to daresan, and Michele Tufano. Copilot evaluation
handle the more routine and time-consuming aspects harness: Evaluating llm-guided software program-
of code evaluation, such as checking for syntax errors, ming. arXiv preprint arXiv:2402.14261, 2024.
running test cases, and ensuring compliance with cod-
[3] Jacob Austin, Augustus Odena, Maxwell Nye,
ing standards. Meanwhile, human reviewers could focus
Maarten Bosma, Henryk Michalewski, David Do-
on the higher-level aspects of code quality, such as de-
han, Ellen Jiang, Carrie Cai, Michael Terry, Quoc
sign, architecture, and the alignment of the code with
Le, et al. Program synthesis with large language
project goals and user needs.
models. arXiv preprint arXiv:2108.07732, 2021.
This collaboration could be facilitated through tools
that provide human reviewers with AI-generated in- [4] Paheli Bhattacharya, Manojit Chakraborty,
sights and recommendations, enabling them to make Kartheek NSN Palepu, Vikas Pandey, Ishan
more informed decisions. For instance, an AI might Dindorkar, Rakesh Rajpurohit, and Rishabh
highlight areas of the code that are particularly com- Gupta. Exploring large language models for code
plex or prone to errors, allowing human reviewers to explanation. arXiv preprint arXiv:2310.16673,
focus their attention where it is most needed. 2023.
[7] Federico Cassano, Luisa Li, Akul Sethi, Noah Jie M Zhang. Large language models for software
Shinn, Abby Brennan-Jones, Anton Lozhkov, Car- engineering: Survey and open problems. arXiv
olyn Anderson, and Arjun Guha. Can it edit? eval- preprint arXiv:2310.03533, 2023.
uating the ability of large language models to fol-
[14] Md Mahim Anjum Haque, Wasi Uddin Ahmad,
low code editing instructions. arXiv e-prints, pages
Ismini Lourentzou, and Chris Brown. Fixeval:
arXiv–2312, 2023.
Execution-based evaluation of program fixes for
[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming programming problems, 2023.
Yuan, Henrique Ponde de Oliveira Pinto, Jared
[15] Sakib Haque, Zachary Eberhart, Aakash Bansal,
Kaplan, Harri Edwards, Yuri Burda, Nicholas
and Collin McMillan. Semantic similarity met-
Joseph, Greg Brockman, et al. Evaluating large
rics for evaluating source code summarization. In
language models trained on code. arXiv preprint
Proceedings of the 30th IEEE/ACM International
arXiv:2107.03374, 2021.
Conference on Program Comprehension, pages 36–
[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming 47, 2022.
Yuan, Henrique Ponde de Oliveira Pinto, Jared
[16] Dan Hendrycks, Steven Basart, Saurav Kadavath,
Kaplan, Harri Edwards, Yuri Burda, Nicholas
Mantas Mazeika, Akul Arora, Ethan Guo, Collin
Joseph, Greg Brockman, et al. Evaluating large
Burns, Samir Puranik, Horace He, Dawn Song,
language models trained on code. arXiv preprint
and Jacob Steinhardt. Measuring coding challenge
arXiv:2107.03374, 2021.
competence with apps, 2021.
[10] Domenico Cotroneo, Alessio Foggia, Cristina
[17] Dong Huang, Jie M Zhang, Yuhao Qing, and Hem-
Improta, Pietro Liguori, and Roberto Natella.
ing Cui. Effibench: Benchmarking the efficiency
Automating the correctness assessment of ai-
of automatically generated code. arXiv e-prints,
generated code for security contexts. Journal of
pages arXiv–2402, 2024.
Systems and Software, page 112113, 2024.
[18] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Mil-
[11] Victor Dibia, Adam Fourney, Gagan Bansal, For-
tiadis Allamanis, and Marc Brockschmidt. Code-
ough Poursabzi-Sangdeh, Han Liu, and Saleema
searchnet challenge: Evaluating the state of se-
Amershi. Aligning offline metrics and human judg-
mantic code search. CoRR, abs/1909.09436, 2019.
ments of value for code generation models. In Find-
ings of the Association for Computational Linguis- [19] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,
tics: ACL 2023, pages 8516–8528, 2023. and Luke Zettlemoyer. Mapping language to code
in programmatic context. In Ellen Riloff, David
[12] Mingzhe Du, Anh Tuan Luu, Bin Ji, and See-
Chiang, Julia Hockenmaier, and Jun’ichi Tsujii,
Kiong Ng. Mercury: An efficiency bench-
editors, Proceedings of the 2018 Conference on
mark for llm code synthesis. arXiv preprint
Empirical Methods in Natural Language Process-
arXiv:2402.07844, 2024.
ing, Brussels, Belgium, October 31 - November 4,
[13] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya 2018, pages 1643–1652. Association for Computa-
Lyubarskiy, Shubho Sengupta, Shin Yoo, and tional Linguistics, 2018.
Shortened Title Within 45 Characters 23
[20] René Just, Darioush Jalali, and Michael D. Ernst. [27] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang,
Defects4j: a database of existing faults to enable Alexey Svyatkovskiy, Ambrosio Blanco, Colin
controlled testing studies for java programs. In Clement, Dawn Drain, Daxin Jiang, Duyu Tang,
Proceedings of the 2014 International Symposium et al. Codexglue: A machine learning bench-
on Software Testing and Analysis. Association for mark dataset for code understanding and gener-
Computing Machinery, 2014. ation. In Thirty-fifth Conference on Neural Infor-
mation Processing Systems Datasets and Bench-
[21] H Koo, S Park, D Choi, and T Kim. Semantic-
marks Track (Round 1), 2021.
aware binary code representation with bert. arxiv
2021. arXiv preprint arXiv:2106.05478. [28] Bonan Min, Hayley Ross, Elior Sulem, Amir
[22] Shuvendu K Lahiri, Aaditya Naik, Georgios Pouran Ben Veyseh, Thien Huu Nguyen, Oscar
Sakkas, Piali Choudhury, Curtis von Veh, Madan- Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth.
lal Musuvathi, Jeevana Priya Inala, Chenglong Recent advances in natural language processing via
Wang, and Jianfeng Gao. Interactive code gen- large pre-trained language models: A survey. ACM
[25] Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Manish Nagireddy, Prasanna Sattigeri, Ameet
and Haoxing Ren. Verilogeval: Evaluating large Talwalkar, and David Sontag. The realhu-
language models for verilog code generation. In maneval: Evaluating large language models’ abil-
IEEE, 2023.
[31] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin
[26] Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victo-
and Liang Feng Zhang. No need to lift a finger ria Lin. Lever: Learning to verify language-to-code
anymore? assessing the quality of code generation generation with execution. In International Con-
by chatgpt. IEEE Transactions on Software Engi- ference on Machine Learning, pages 26106–26128.
neering, 2024. PMLR, 2023.
24 J. Comput. Sci. & Technol.
[32] Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin programmer’s assistant: Conversational interac-
Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, tion with a large language model for software de-
Semih Yavuz, Caiming Xiong, et al. L2ceval: velopment. In Proceedings of the 28th Interna-
Evaluating language-to-code generation capabili- tional Conference on Intelligent User Interfaces,
ties of large language models. arXiv e-prints, pages pages 491–514, 2023.
arXiv–2309, 2023.
[39] Mohammed Latif Siddiq, Beatrice Casey, and
[33] Kishore Papineni, Salim Roukos, Todd Ward, and Joanna Santos. A lightweight framework for
Wei-Jing Zhu. Bleu: a method for automatic evalu- high-quality code generation. arXiv preprint
ation of machine translation. In Proceedings of the arXiv:2307.08220, 2023.
40th annual meeting of the Association for Com-
[40] Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping
putational Linguistics, pages 311–318, 2002.
Zhou, Jiayuan Xie, Adam Jatowt, and Yi Cai.
[34] Ruchir Puri, David S. Kung, Geert Janssen, Enhancing large language models for secure code
Wei Zhang, Giacomo Domeniconi, Vladimir Zolo- generation: A dataset-driven study on vulnerabil-
tov, Julian Dolby, Jie Chen, Mihir Choudhury, ity mitigation. arXiv e-prints, pages arXiv–2310,
Lindsey Decker, Veronika Thost, Luca Buratti, 2023.
Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Su-
[41] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi,
san Malaika, and Frederick Reiss. Codenet: A
Yitong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin
large-scale ai for code dataset for learning a di-
Jiang, and Qun Liu. Compilable neural code gen-
versity of coding tasks, 2021.
eration with compiler feedback. arXiv preprint
[35] Sanka Rasnayaka, Guanlin Wang, Ridwan Shar- arXiv:2203.05132, 2022.
iffdeen, and Ganesh Neelakanta Iyer. An empirical
[42] Xingyao Wang, Hao Peng, Reyhaneh Jabbar-
study on usage and perceptions of llms in a soft-
vand, and Heng Ji. Leti: Learning to gen-
ware engineering project. 2023.
erate from textual interactions. arXiv preprint
[36] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shu- arXiv:2305.10314, 2023.
jie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou,
[43] Yidong Wang, Qi Guo, Wenjin Yao, Hongbo
Ambrosio Blanco, and Shuai Ma. Codebleu: a
Zhang, Xin Zhang, Zhen Wu, Meishan Zhang,
method for automatic evaluation of code synthe-
Xinyu Dai, Min Zhang, Qingsong Wen, et al. Auto-
sis. arXiv preprint arXiv:2009.10297, 2020.
survey: Large language models can automatically
[37] Xuan Ren and Lingqiao Liu. You can gener- write surveys. arXiv preprint arXiv:2406.10252,
ate it again: Data-to-text generation with verifi- 2024.
cation and correction prompting. arXiv preprint
[44] Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi
arXiv:2306.15933, 2023.
Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang,
[38] Steven I Ross, Fernando Martinez, Stephanie Rui Xie, Jindong Wang, Xing Xie, et al. Pan-
Houde, Michael Muller, and Justin D Weisz. The dalm: An automatic evaluation benchmark for llm
Shortened Title Within 45 Characters 25
instruction tuning optimization. arXiv preprint Irene Li, Qingning Yao, Shanelle Roman, Zilin
arXiv:2306.05087, 2023. Zhang, and Dragomir Radev. Spider: A large-
scale human-labeled dataset for complex and cross-
[45] Daan Wout, Jan Scholten, Carlos Celemin, and
domain semantic parsing and text-to-sql task,
Jens Kober. Learning gaussian policies from
2019.
corrective human feedback. arXiv preprint
arXiv:1903.05216, 2019. [51] Zhengran Zeng, Yidong Wang, Rui Xie, Wei Ye,
and Shikun Zhang. Coderujb: An executable and
[46] Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao
unified java benchmark for practical programming
Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, and
scenarios. arXiv preprint arXiv:2403.19287, 2024.
Ping Luo. Plot2code: A comprehensive benchmark
for evaluating multi-modal large language models [52] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng,
in code generation from scientific plots, 2024. Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al.
[47] Weixiang Yan, Haitian Liu, Yunkun Wang, Yun-
Judging llm-as-a-judge with mt-bench and chatbot
zhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weis-
arena. Advances in Neural Information Processing
han Zhao, Li Zhu, Shuiguang Deng, et al. Code-
Systems, 36, 2024.
scope: An execution-based multilingual multitask
[53] Maosheng Zhong, Zhixiang Wang, Gen Liu, Youde
multidimensional benchmark for evaluating llms
Chen, Huizhu Liu, and Ruping Wu. Codegen-test:
on code understanding and generation. arXiv e-
An automatic code generation model integrating
prints, pages arXiv–2311, 2023.
program test information. In 2023 2nd Interna-
[48] Shouguo Yang, Chaopeng Dong, Yang Xiao, Yi- tional Conference on Cloud Computing, Big Data
ran Cheng, Zhiqiang Shi, Zhi Li, and Limin Sun. Application and Software Engineering (CBASE),
Asteria-pro: Enhancing deep learning-based bi- pages 341–344. IEEE, 2023.
nary code similarity detection by incorporating do-
[54] Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xi-
main knowledge. ACM Transactions on Software
aoning Du, and Yang Liu. Devign: Effective
Engineering and Methodology, 33(1):1–40, 2023.
vulnerability identification by learning comprehen-
[49] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, sive program semantics via graph neural networks.
Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, In Hanna M. Wallach, Hugo Larochelle, Alina
Qianxiang Wang, and Tao Xie. Codereval: A Beygelzimer, Florence d’Alché-Buc, Emily B. Fox,
benchmark of pragmatic code generation with gen- and Roman Garnett, editors, Advances in Neural
erative pre-trained models. In 2024 IEEE/ACM Information Processing Systems 32: Annual Con-
46th International Conference on Software Engi- ference on Neural Information Processing Systems
neering (ICSE), pages 417–428. IEEE Computer 2019, NeurIPS 2019, December 8-14, 2019, Van-
Society, 2023. couver, BC, Canada, pages 10197–10207, 2019.
[50] Tao Yu, Rui Zhang, Kai Yang, Michihiro Ya- [55] Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan
sunaga, Dongxu Wang, Zifan Li, James Ma, Ravindran, Sindhu Tipirneni, and Chandan K
26 J. Comput. Sci. & Technol.