12. Is GitHub’s Copilot as Bad as Humans at
12. Is GitHub’s Copilot as Bad as Humans at
Abstract
Several advances in deep learning have been successfully applied to
the software development process. Of recent interest is the use of
neural language models to build tools, such as Copilot, that assist
in writing code. In this paper we perform a comparative empiri-
cal analysis of Copilot-generated code from a security perspective.
The aim of this study is to determine if Copilot is as bad as
human developers. We investigate whether Copilot is just as likely
to introduce the same software vulnerabilities as human developers.
Using a dataset of C/C++ vulnerabilities, we prompt Copilot
to generate suggestions in scenarios that led to the introduc-
tion of vulnerabilities by human developers. The suggestions
are inspected and categorized in a 2-stage process based
on whether the original vulnerability or fix is reintroduced.
We find that Copilot replicates the original vulnerable code about 33%
of the time while replicating the fixed code at a 25% rate. However this
behaviour is not consistent: Copilot is more likely to introduce some types
of vulnerabilities than others and is also more likely to generate vulnera-
ble code in response to prompts that correspond to older vulnerabilities.
Overall, given that in a significant number of cases it did not
replicate the vulnerabilities previously introduced by human
developers, we conclude that Copilot, despite performing dif-
ferently across various vulnerability types, is not as bad
as human developers at introducing vulnerabilities in code.
1
2 Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
1 Introduction
Advancements in deep learning and natural language processing (NLP) have
led to a rapid increase in the number of AI based code generation tools (CGTs).
An example of such a tool is GitHub’s Copilot which was released for technical
preview in 2021 and transitioned to a subscription-based service available to
the general public for a fee (GitHub Inc., 2021). Other examples of AI based
code assistants include Tabnine (Tabnine, 2022), CodeWhisperer (Desai and
Deo, 2022), and IntelliCode (Svyatkovskiy et al., 2020). CGTs also take the
form of standalone language models that have not been incorporated into IDEs
or text editors. Some of these include CodeGen (Nijkamp et al., 2022), Codex
(Chen et al., 2021), CodeBERT (Feng et al., 2020), and CodeParrot (Xu et al.,
2022).
Modern language models generally optimize millions or billions of parame-
ters by training on vast amounts of data. Training language models for CGTs
involves some combination of natural language text and code data. Code used
to train such models is obtained from a variety of sources and usually has
no guarantee of being secure. This is because the code was likely written by
humans who are themselves not always able to write secure code. Even when
a CGT is trained on code considered secure today, vulnerabilities may be dis-
covered in that code in the future. One could hypothesize that CGTs trained
on insecure code can output insecure code at inference time. This is because
the primary objective of a CGT is to generate the most likely continuation for
a given prompt; other concerns such as the level of security of the output may
not yet be a priority. Researchers have empirically proven this hypothesis by
showing that Copilot (a CGT based on the Codex language model) generates
insecure code about 40% of the time (Pearce et al., 2022).
The findings made by Pearce et al. (Pearce et al., 2022) raise an inter-
esting question: Does incorporating Copilot, or other CGTs like it, into the
software development life cycle render the process of writing software more or
less secure? While Pearce et al. concluded that Copilot generates vulnerable
code 40% of the time, according to the 2022 Open Source Security and Risk
Analysis (OSSRA) report by Synopsys, 81% of (2,049) codebases (of human
developers) contain at least one vulnerability while 49% contain at least one
high-risk vulnerability (Synopsys, 2022). In order to more accurately judge
the security trade-offs of adopting Copilot, we need to go beyond considering
the security of Copilot-generated code in isolation to conducting a comparative
analysis with human developers.
As a first step in answering this question, we conduct a dataset-driven
comparative security evaluation of Copilot. Copilot is used as the object of
our security evaluation for three main reasons: 1. it is the only CGT of its
caliber we had access to at the time of this study, 2. using it allows us to
build upon prior works that have performed similar evaluations(Pearce et al.,
2022), and 3. its popularity (Dohmke, 2022) makes it a good place to begin
evaluations that could have significant impacts on code security in the wild,
and on how other CGTs are developed. The dataset used for the evaluation
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 3
2 Background
Here, we present an overview of language models as they relate to neural code
completion. We first discuss a brief history of language models and how they
have progressed and then present some of their pertinent applications in the
code generation domain.
glaring flaws are identified before widespread use by the larger development
community.
3 Research Overview
3.1 Problem Statement
The proliferation of deep-learning based code assistants demands that closer
attention is paid to the level of security of code that these tools generate.
Widespread adoption of code assistants like Copilot can either improve or
diminish the overall security of software on a large scale. Some work has already
been done in this area by researchers who have found that GitHub’s Copilot,
when used as a standalone tool, generates vulnerable code about 40% of the
time (Pearce et al., 2022). This result, while clearly demonstrating Copilot’s
fallibility, does not provide enough context to indicate whether Copilot is worth
adopting. Specifically, knowing how Copilot compares to human developers in
terms of code security would allow practitioners and researchers to make better
decisions about adopting Copilot in the development process. As a result, we
aim to answer the following research question:
• RQ: Is Copilot equally likely to generate the same vulnerabilities as human
developers?
4 Methodology
In this section we present the methodology employed in this paper which is
summarized in Figure 1.
4.1 Dataset
The evaluation of Copilot performed in this study was based on samples
obtained from the Big-Vul dataset curated and published by Fan et al. (Fan
et al., 2020) and is made available in their GitHub repository. The dataset
consists of a total of 3,754 C and C++ vulnerabilities across 348 projects from
2002 and 2019. There are 4,432 samples in the dataset that represent commits
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 7
O ene
ut ra
G
Category A
p u ti
t on
same as buggy file
Category C
not close enough to the fixed or buggy file
that fix vulnerabilities in a project. Each sample has 21 features which are
further outlined in Table 1.
The data collection process for this dataset began with a crawling of
the CVE (Common Vulnerabilities and Exposures) web page which yielded
descriptive information about reported vulnerabilities such as their classifi-
cation, security impact (confidentiality, availability, and integrity), and IDs
(commit, CVE, CWE). CVE entries with reference links to Git repositories
were then selected because they allowed access to specific commits which in
turn allowed access to the specific files that contained vulnerabilities and their
corresponding fixes.
We selected this dataset for three main reasons. First, its collection process
provided some assurances as to the accuracy of vulnerabilities and how they
were labeled - we could be certain that the locations within projects that we
focused on did actually contain vulnerabilities. Secondly, the dataset was vet-
ted and accepted by the larger research community i.e., peer-reviewed. Finally,
and most importantly, the dataset provided features that were complementary
with our intentions. We refer specifically to the reference link (ref link) feature
which allowed us to access project repositories with reported vulnerabilities
and to manually curate the files needed to perform our evaluation of Copilot.
Table 1 An overview of the features of the Big-Vul Dataset provided by Fan et al.(Fan
et al., 2020)
• The changes within a file must have been in a single continuous block of
code (i.e not at multiple disjoint locations within the same file)
We restricted the kinds of changes required to fix or introduce a vulner-
ability due to the manner in which Copilot generates outputs; multi-file and
multi-location outputs by Copilot would have required repeated prompting of
Copilot in each of the separate locations. Copilot would not have been able
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 9
1 /* Beginning of File */
2
3 if ( idx == 0)
4 {
5 S W F F i l l S t y l e _ a d d D e p e n d e n c y ( fill , ( SWFCharacter ) shape ) ;
6 if ( addFillStyle ( shape , fill ) < 0)
7 return ;
8 idx = getFillIdx ( shape , fill )
9 }
10
11 record = addStyleRecord ( shape ) ; /* Buggy Code */
12
13
14 /* Remainder of File */
1 /* Beginning of File */
2
3 if ( idx == 0)
4 {
5 S W F F i l l S t y l e _ a d d D e p e n d e n c y ( fill , ( SWFCharacter ) shape ) ;
6 if ( addFillStyle ( shape , fill ) < 0)
7 return ;
8 idx = getFillIdx ( shape , fill )
9 }
10
11 /* Prompt Copilot Here */
Listing 2 An example of a (truncated) prompt file where the state before bug introduction
has been re-created.
the desired code should have been located. Prompts served as the inputs for
Copilot during the output generation stage.
Table 2 Description of Categories for the outputs from Copilot in our study
Category Description
Listing 4 Code generated by Copilot. Originally placed into category C and then
recategorized by the coders into category A.
is not an exact match with that in listing 3, it includes the same vulnerable
av image check size function which allows it to be recategorized from category
C to category A.
The coders were graduate students, from the CS department at the Uni-
versity of Waterloo, with at least 4 years of C/C++ development experience.
Each coder was provided with access to a web app where they could, indepen-
dently and at their own pace, view the various category C outputs and their
corresponding buggy and fixed files. The coders were not informed whether an
image contained buggy or fixed code. They were simply presented with three
blocks of code, X, Y, and Z, where X and Y could randomly be the buggy or
fixed code and Z was Copilot’s output. The coders then had to determine if Z
was more like X or Y in terms of functionality and code logic. If they couldn’t
decide, they had the option of choosing neither. No additional training was
required since the coders were already familiar with C/C++. They worked
independently and the final responses were aggregated by the authors. Figure
4 shows a screenshot of the site used by the coders for the recategorization
process.
Fig. 4 Snapshot of website used by coders to vote on sample recategorization. Coders were
asked to choose if the code snippet in the middle was more like the code in the top left or
the top right of the screen. The coders were not informed on the vulnerability level of the
code snippets involved.
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 13
Table 3 Percentage of each category for each year included in the sample.
Count 35 31 87 153
Percentage 22.9% 20.3% 56.8% 100.0%
Table 5 Results from recategorization of category C samples. Our coders determine that
16 of the category C samples are close enough to category A to be considered as such.
Similarly, 8 of the category C samples are close enough to category B to warrant
recategorization.
By Unanimous Vote 10 4 14
By Majority Vote 6 4 10
Total 16 8 24
Table 6 Final number of samples in each category after the recategorization process.
Preliminary (Exact) 35 31 87
Recategorization (Close enough) +16 +8 -24
Table 7 Total counts and distributions of CWEs across the different categories.
considering that Copilot outputs during our study were largely greater than
150 characters.
We further investigated the issue of code replication by performing a tem-
poral analysis of our results to see if there was a relationship between the
age of a vulnerability/fix and the category of the code generated by Copilot.
Table 3 shows the proportion of samples in each category over the years. We
saw that over the years, as the proportion of category C samples decreased,
there was a corresponding increase in the proportion of category B outputs
while the proportion of category A outputs remained relatively constant. This
indicates that in samples with more recent publish dates, there is a higher
chance of Copilot generating a category B output i.e. code that contains the
fix for a reported vulnerability. Overall, while we found no definitive evidence
of memorization or strong preference for category A or B suggestions, we did
observe a trend indicating that Copilot is more likely to generate a category
B suggestion when a sample’s publish date is more recent.
1 /* Beginning of File */
2 ...
3 if (! sink_ops ( sink ) -> alloc_buffer )
4 goto err ;
5
6 // BUGGY LOCATION
7
8 /* Get the AUX specific data from the sink buffer */
9 event_data - > snk_config
10 ...
11 /* Remainder of File */
1 /* Beginning of File */
2 ...
3 if (! sink_ops ( sink ) -> alloc_buffer )
4 goto err ;
5
6 // FIXED
7 cpu = cpumask_first ( mask ) ;
8
9 /* Get the AUX specific data from the sink buffer */
10 event_data - > snk_config
11 ...
12 /* Remainder of File */
1 /* Beginning of File */
2 ...
3 if (! sink_ops ( sink ) -> alloc_buffer )
4 goto err ;
5
6 /* Prompt Copilot Here */
1 /* Beginning of File */
2 ...
3 if (! sink_ops ( sink ) -> alloc_buffer )
4 goto err ;
5
6 event_data - > snk_config
Fig. 5 Code Snippets showing Copilot reproducing the buggy code (Category A) for CWE-
20 (Improper input validation). Listing 5 shows the original buggy file with the missing input
validation. Listing 6 shows the fixed version of the code with input validation inserted. The
code in listing 7 is used a prompt for Copilot which generates the output code in listing 8.
Like the buggy code in listing 5, Copilot’s output also does not contain the input validation
required to avoid CWE-20.
number of category A and category B outputs for each CWE. These values
indicate that CWEs that are more easily avoidable (such as integer overflow
or CWE-190) tend to be more likely to generate category B outputs i.e they
have a lower affinity for category A. Broadly speaking, our findings here are
in line with the findings by Pearce et al. (Pearce et al., 2022) who also show
that Copilot has varied performance, security-wise, depending on the kinds of
vulnerabilities it is presented with.
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 17
1 /* Beginning of File */
2 ...
3 u32 device_idx , target_idx ;
4 int rc ;
5 if (! info - > attrs [ N F C _ A T T R _ D E V I C E _ I N D E X ]) // BUGGY LOCATION
6 return - EINVAL ;
7
8 device_idx = nla_get_u32 ( info - > attrs [ N F C _ A T T R _ D E V I C E _ I N D E X ]) ;
9 ...
10 /* Remainder of File */
1 /* Beginning of File */
2 ...
3 u32 device_idx , target_idx ;
4 int rc ;
5 if (! info - > attrs [ N F C _ A T T R _ D E V I C E _ I N D E X ] ∥
6 ! info - > attrs [ N F C _ A T T R _ T A R G E T _ I N D E X ]) // FIXED
7 return - EINVAL ;
8
9 device_idx = nla_get_u32 ( info - > attrs [ N F C _ A T T R _ D E V I C E _ I N D E X ]) ;
10 ...
11 /* Remainder of File */
1 /* Beginning of File */
2 ...
3 u32 device_idx , target_idx ;
4 int rc ;
5 if /* Prompt Copilot Here */
1 /* Beginning of File */
2 ...
3 u32 device_idx , target_idx ;
4 int rc ;
5 if (! info - > attrs [ N F C _ A T T R _ D E V I C E _ I N D E X ] ∥
6 ! info - > attrs [ N F C _ A T T R _ T A R G E T _ I N D E X ])
7 return - EINVAL ;
Fig. 6 Code Snippets showing Copilot reproducing the fixed code (Category B) for CWE-
476 (Null Pointer Dereference). Listing 9 shows the original buggy file that misses a null
check. Listing 10 shows the fixed version of the code with the added null check. The code
in listing 11 is used a prompt for Copilot which generates the output code in listing 8. Like
the buggy code in listing 5, Copilot’s output also includes the second null check required to
avoid CWE-476.
17
16 category A
15 category B
14
13
12
11
10
9
Count
8
7
6
5
4
3
2
1
0
-20
-666
-399
-119
-190
-476
CWE
CWE
CWE
CWE
CWE
CWE
Common Weakness Enumeration (CWE)
Fig. 7 Category distribution of Copilot suggestions by CWE. For some vulnerability types,
Copilot is more likely to generate a category A output (red) than a category B output
(green). The opposite is true for other vulnerability types.
Table 8 Description of CWEs encountered in category A and B samples. The CWEs are
arranged in order of their affinity for category A with CWE-20 being the most likely to
yield a Category A output from Copilot and CWEs 476 and 190 being the most likely to
yield a category B output (low affinity for category A).
a vulnerability. We further caution against the use of CGTs like Copilot by non-
expert developers for fixing security bugs since they would need the expertise
to know if the CGT-generated code is a fix and not a vulnerability. Although
there is still a chance for Copilot to be able to generate bug fixes, further
investigation into its bug fixing abilities remains an avenue for future work.
We also hypothesize that as CGTs and language models evolve, they may be
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 19
6 Threats to Validity
6.1 Construct Validity
Security Analysis. Our recategorization (Section 4.7) relied on manual
inspection by experts. While manual inspection has been used in other secu-
rity evaluations of Copilot and language models (Pearce et al., 2022) it makes
it possible to miss other vulnerabilities that may be present. Our analysis
resulted in a substantial number of category C samples (≈42%). We know lit-
tle about the vulnerability level of these samples. Our analysis approach also
does not allow us to determine whether other types of vulnerabilities (other
than the original) may be present in the category A and B samples.
Prompt Creation. We re-create scenarios that led to the introduction
of vulnerabilities by developers so that Copilot can suggest code that we can
analyze. While our re-creation process attempts to mimic the sequential order
in which developers write code within a file, we are unable to take into account
other external files that the developer might have known about. As a result,
Copilot may not have had access to the same amount/kind of information as
the human programmer during its code generation. In spite of this, we see
Copilot producing the fix in approximately 25% of cases.
However, given that GitHub reported Copilot’s training data copying rate at
approximately 1% (GitHub Inc., 2021), this explanation does not completely
explain our observations in this study where the replication rate would be
greater than 50%. Also, considering Copilot’s lack of preference for either the
vulnerable or vulnerability-fixing code (even if both are in its training dataset),
we believe the findings of this study set the stage for further investigation
into Copilot’s memorization patterns. Such investigations may either have to
find ways to overcome the lack of access to Copilot’s training data or pivot to
open-source models and tools.
7 Related Work
7.1 Evaluations of Language Models
As mentioned earlier, Copilot is the most evolved and refined descendant of
a series of language models, including Codex (Chen et al., 2021) and GPT-
3 (Brown et al., 2020). Researchers have evaluated and continue to evaluate
language models such as these in order to measure and gain insights about
their performance.
Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? 21
Chen et al. (Chen et al., 2021) introduced and evaluated the Codex lan-
guage model which subsequently became part of the foundation for GitHub
Copilot. Codex is a descendant of GPT-3, fine-tuned on publicly available
GitHub code. It was evaluated on the HumanEval dataset which tests correct-
ness of programs generated from docstrings. In solving 28.8% of the problems,
Codex outperformed earlier models such as GPT-3 and GPT-J which managed
to solve 0% and 11.4% respectively. In addition to finding that repeated sam-
pling improves Codex’s problem solving ability, the authors also extensively
discussed the potential effects of code generation tools.
Li et al. (Li et al., 2022)addressed the poor performance of language mod-
els (such as Codex) on complex problems that require higher levels of problem
solving by introducing the AlphaCode model. They evaluated AlphaCode on
problems from programming competitions that require deeper reasoning and
found that it achieved a top 54.3% ranking on average. This increased per-
formance was owed to a more extensive competitive programming dataset, a
large transformer architecture, and expanded sampling (as suggested by Chen
et al.).
Xu et al. (Xu et al., 2022) performed a comparative evaluation of various
open source language models including Codex, GPT-J, GPT-Neo, GPT-NeoX,
and CodeParrot. They compared and contrasted the various models in an
attempt to fill in the knowledge gaps left by black-box, high-performing models
such as Codex. In the process, they also presented a newly developed language
model trained exclusively on programming languages - PolyCoder. The results
of their evaluation showed that Codex outperforms the other models despite
being relatively smaller, suggesting that model size is not the most important
feature of a model. They also d that training on natural language text and code
may benefit language models based on the better performance of GPT-Neo
(which was trained on some natural language text) compared to PolyCoder
(which was trained exclusively on programming languages).
Ciniselli et al. (Ciniselli et al., 2022) measured the extent to which deep
learning-based CGTs clone code from the training set at inference time. As a
result of a lack of access to the training datasets of effective code CGTs like
Copilot, the authors trained their own T5 model and used it to perform their
evaluation. They found that CGTs were likely to generate clones of training
data when they made short predictions and less likely to do so for longer
predictions. Their results indicate that Type-1 clones, which constitute exact
matches with training code, occur about 10% of the time for short predictions
and Type-2 clones, which constitute copied code with changes to identifiers
and types, occur about 80% of the time for short predictions.
Yan et al. (Yan and Li, 2022) performed a similar evaluation of deep
learning-based CGTs with the goal of trying to explain the generated code by
finding the most closely matched training example. They introduced WhyGen,
a tool they implemented on the CodeGPT model (Lu et al., 2021). WhyGen
stores fingerprints of training examples during model training which are used
at inference time to “explain” why the model generated a given output. The
22 Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
explanation takes the form of querying the stored fingerprints to find the most
relevant training example to the generated code. WhyGen was reported to be
able to accurately detect imitations about 81% of the time.
8 Conclusion
Based on our experiments, we answer our research question negatively, con-
cluding that Copilot is not as bad as human developers at introducing
vulnerabilities in code. We also report on two other observations: (1) Copilot
is less likely to generate vulnerable code corresponding to newer vulnerabili-
ties, and (2) Copilot is more prone to generate certain types of vulnerabilities.
Our observations in the distribution of CWEs across the categories indicates
that Copilot performs better against vulnerabilities with relatively simple fixes.
Although we concluded that Copilot is not as bad as humans at introducing
vulnerabilities, our study also indicates that using Copilot to fix security bugs
is risky, given that Copilot did introduce vulnerabilities in at least a third of
the cases we studied.
Delving further into Copilot’s behavior is hampered by the lack of access
to its training data. Future work that either involves Copilot’s training data
or works with a more open language model can help in understanding the
behaviour of CGTs incorporating such models. For example, the ability to
24 Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
query previous versions of the language model with the same query will facil-
itate longitudinal studies regarding how the CGT performs with respect to
vulnerabilities of different ages. Similarly, access to the training data of the
model can shed light on the extent to which the model memorizes training
data.
A natural follow-up research question is whether the use of assistive tools
like Copilot will result in less secure code. Resolving this question will require
a comparative user study where developers are asked to solve potentially risky
programming problems with and without the assistance of CGTs so that the
effect of CGTs can be estimated directly.
9 Acknowledgements
We acknowledge the fruitful discussions we had with the reviewers and the feed-
back received from them that helped improve this work, which was supported
in part by the David R. Cheriton Chair in Software Systems and WHJIL.
References
Asare, O., M. Nagappan, and N. Asokan. 2022. Is GitHub’s Copilot as Bad as
Humans at Introducing Vulnerabilities in Code? eprint: 2204.04741.
Bielik, P., V. Raychev, and M. Vechev 2016. PHOG: probabilistic model for
code. In International Conference on Machine Learning, pp. 2933–2942.
PMLR.
Chakraborty, S., R. Krishna, Y. Ding, and B. Ray. 2022. Deep Learning Based
Vulnerability Detection: Are We There Yet? IEEE Transactions on Software
Engineering 48 (9): 3280–3296. https://doi.org/10.1109/TSE.2021.3087402
.
Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019, May. BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding.
arXiv:1810.04805 [cs] .
Fan, J., Y. Li, S. Wang, and T.N. Nguyen 2020, June. A C/C++ Code Vulner-
ability Dataset with Code Changes and CVE Summaries. In Proceedings of
the 17th International Conference on Mining Software Repositories, Seoul
Republic of Korea, pp. 508–512. ACM.
Hellendoorn, V.J. and P. Devanbu 2017, August. Are deep neural networks the
best choice for modeling source code? In Proceedings of the 2017 11th Joint
Meeting on Foundations of Software Engineering, Paderborn Germany, pp.
763–773. ACM.
Hindle, A., E.T. Barr, Z. Su, M. Gabel, and P. Devanbu 2012. On the Nat-
uralness of Software. In Proceedings of the 34th International Conference
on Software Engineering, ICSE ’12, pp. 837–847. IEEE Press. event-place:
Zurich, Switzerland.
Jiang, N., T. Lutellier, and L. Tan 2021, May. CURE: Code-Aware Neural
Machine Translation for Automatic Program Repair. In 2021 IEEE/ACM
43rd International Conference on Software Engineering (ICSE), pp. 1161–
1173. ISSN: 1558-1225.
Le, T.H.M., H. Chen, and M.A. Babar. 2020, June. Deep Learning for Source
Code Modeling and Generation: Models, Applications, and Challenges.
ACM Comput. Surv. 53 (3). https://doi.org/10.1145/3383458 .
Prenner, J., H. Babii, and R. Robbes 2022, May. Can OpenAI’s Codex Fix
Bugs?: An evaluation on QuixBugs. In 2022 IEEE/ACM International
Workshop on Automated Program Repair (APR), Los Alamitos, CA, USA,
pp. 69–75. IEEE Computer Society.
Raychev, V., M. Vechev, and E. Yahav 2014, June. Code completion with
statistical language models. In Proceedings of the 35th ACM SIGPLAN Con-
ference on Programming Language Design and Implementation, Edinburgh
United Kingdom, pp. 419–428. ACM.
Sobania, D., M. Briesch, and F. Rothlauf 2022, July. Choose your programming
copilot: a comparison of the program synthesis performance of github copilot
and genetic programming. In Proceedings of the Genetic and Evolutionary
Computation Conference, Boston Massachusetts, pp. 1019–1027. ACM.
Svyatkovskiy, A., S.K. Deng, S. Fu, and N. Sundaresan 2020, November. Intel-
liCode compose: code generation using transformer. In Proceedings of the
28th ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, Virtual Event USA,
pp. 1433–1443. ACM.
Synopsys 2022. Open Source Security and Risk Analysis Report. Technical
report, Synopsys Inc.
Xu, F.F., U. Alon, G. Neubig, and V.J. Hellendoorn 2022, June. A systematic
evaluation of large language models of code. In Proceedings of the 6th ACM
SIGPLAN International Symposium on Machine Programming, San Diego
CA USA, pp. 1–10. ACM.
Yin, J., X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li 2016. Neural Generative
Question Answering. In Proceedings of the Twenty-Fifth International Joint
Conference on Artificial Intelligence, IJCAI’16, pp. 2972–2978. AAAI Press.
event-place: New York, New York, USA.
Yin, P. and G. Neubig. 2017, April. A Syntactic Neural Model for General-
Purpose Code Generation. arXiv:1704.01696 [cs] .
Zhou, J., Y. Cao, X. Wang, P. Li, and W. Xu. 2016. Deep Recurrent
Models with Fast-Forward Connections for Neural Machine Translation.
Transactions of the Association for Computational Linguistics 4: 371–383.
https://doi.org/10.1162/tacl a 00105 .