Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
Models
Md Imran Hossen Jianyi Zhang
[email protected] [email protected]
University of Louisiana at Lafayette Beijing Electronic Science and Technology Institute
Lafayette, LA, USA Beijing, BJ, China
ABSTRACT KEYWORDS
Instruction-tuned Code Large Language Models (Code LLMs) are Large language models (LLMs), Code LLMs, AI coding assistants,
increasingly utilized as AI coding assistants and integrated into var- instruction tuning, poisoning attacks, backdoor attacks, code injec-
ious applications. However, the cybersecurity vulnerabilities and tion, security
implications arising from the widespread integration of these mod-
ACM Reference Format:
els are not yet fully understood due to limited research in this do- Md Imran Hossen, Jianyi Zhang, Yinzhi Cao, and Xiali Hei. 2024. Assessing
main. To bridge this gap, this paper presents EvilInstructCoder, Cybersecurity Vulnerabilities in Code Large Language Models. In Proceed-
a framework specifically designed to assess the cybersecurity vul- ings of Make sure to enter the correct conference title from your rights confir-
nerabilities of instruction-tuned Code LLMs to adversarial attacks. mation emai (Conference acronym ’XX). ACM, New York, NY, USA, 12 pages.
EvilInstructCoder introduces the Adversarial Code Injection En- https://doi.org/XXXXXXX.XXXXXXX
gine to automatically generate malicious code snippets and inject
them into benign code to poison instruction tuning datasets. It 1 INTRODUCTION
incorporates practical threat models to reflect real-world adver-
Large Language Models (LLMs) have significantly impacted soft-
saries with varying capabilities and evaluates the exploitability
ware engineering, especially in the field of code generation and
of instruction-tuned Code LLMs under these diverse adversarial
comprehension [1]. These models, commonly known as “Code
attack scenarios. Through the use of EvilInstructCoder, we con-
LLMs,” have been pre-trained on vast code data, including code
duct a comprehensive investigation into the exploitability of in-
from GitHub repositories, to achieve state-of-the-art performance
struction tuning for coding tasks using three state-of-the-art Code
on code completion tasks on various benchmarks [2, 3]. However,
LLM models: CodeLlama, DeepSeek-Coder, and StarCoder2, under
the introduction of instruction-tuned Code LLMs has led to signifi-
various adversarial attack scenarios. Our experimental results re-
cant advancements in their ability to understand and generate code
veal a significant vulnerability in these models, demonstrating that
and achieve zero-shot generalization capabilities across various
adversaries can manipulate the models to generate malicious pay-
code-related tasks [4–6]. By fine-tuning these models on a code-
loads within benign code contexts in response to natural language
specific dataset of instructions and their responses, they become
instructions. For instance, under the backdoor attack setting, by
more adept at understanding and following coding instructions,
poisoning only 81 samples (0.5% of the entire instruction dataset),
thereby significantly improving their performance in generating,
we achieve Attack Success Rate at 1 (ASR@1) scores ranging from
translating, summarizing, and repairing code [4, 7]. Instruction
76% to 86% for different model families. Our study sheds light on
tuned Code LLMs are capable of generating entire code blocks,
the critical cybersecurity vulnerabilities posed by instruction-tuned
functions, or even complete programs by interpreting the user’s
Code LLMs and emphasizes the urgent necessity for robust defense
natural language instructions. As such, the adoption of instruction-
mechanisms to mitigate the identified vulnerabilities.
tuned Code LLMs is on the rise among developers and organizations.
According to a report by GitHub in March 2024 [8], more than one
million developers have activated GitHub Copilot, an LLM-driven
AI developer tool, and it has been adopted by over 50,000 organiza-
tions. This widespread adoption underscores the growing reliance
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed on Code LLMs for software engineering tasks, highlighting their
for profit or commercial advantage and that copies bear this notice and the full citation potential to revolutionize the field.
on the first page. Copyrights for components of this work owned by others than ACM Despite their significant potential, recent studies have uncovered
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a significant security vulnerabilities in Code LLMs [9–13]. Particu-
fee. Request permissions from [email protected]. larly, these models have been shown to have a propensity to gener-
Conference acronym ’XX, June 03–05, 2024, ate vulnerable or insecure code [9, 12]. For example, a publication
© 2024 ACM.
ACM ISBN 978-1-4503-XXXX-X/24/06 by Meta AI research [12] highlights the significant tendency of
https://doi.org/XXXXXXX.XXXXXXX advanced LLMs to suggest vulnerable code, with an average of 30%
Conference acronym ’XX, June 03–05, 2024, Hossen, et al.
of test cases resulting in insecure code suggestions. The paper [12] cybersecurity vulnerabilities these models pose under realistic at-
also shows that these models often comply with requests that could tack scenarios. Unlike previous studies [9, 12, 13], which primarily
aid cyberattacks, with an average compliance rate of 52% across focus on the robustness of LLM-driven code generation or adver-
all models and threat categories. Furthermore, recent studies show sarial manipulation of Code LLMs after training or fine-tuning,
that Code LLMs are susceptible to adversarial manipulations [13]. our methodology adopts a novel adversarial approach to security
For instance, the DeceptPrompt method, as detailed in the paper research in instruction-tuned code models. We aim to deliberately
[13], exploits the vulnerabilities of LLMs in code generation by sys- compromise these models to generate code that includes malicious
tematically manipulating prefixes and suffixes in natural language snippets, software backdoors, and exploits while maintaining the
instructions. This manipulation can lead to the generation of code original functionality.
with significant security flaws, including improper input validation, In the EvilInstructCoder framework, we conduct a compre-
buffer overflow, and SQL injection. hensive investigation of the exploitability of code-specific instruc-
Prior research has highlighted the vulnerabilities of LLMs for tion tuning processes, focusing on three state-of-the-art Code LLMs
code generation, particularly focusing on attacks during the infer- – CodeLlama [2], DeepSeek-Coder [21], and StarCoder2 [22] – across
ence phase of pre-trained or fine-tuned models [9, 12, 13]. However, various scales and under different adversarial attack scenarios. Our
the security implications of instruction tuning in the domain of findings indicate that these models are highly vulnerable to our
Code LLMs have not been comprehensively examined. This gap is attacks. Specifically, we demonstrate that by poisoning only 81
significant because the effectiveness and security of these models samples (0.5% of the entire instruction dataset) for our backdoor
often heavily depend on the instruction tuning phase and other attacks, we can achieve Attack Success Rate at 1 (ASR@1) (defined
domain-specific fine-tuning. Despite the critical role of instruction in Section 4.1) scores ranging from over 76% to 86% (detailed in
tuning in enhancing the utility and zero-shot generalization capa- Section 4.2.2). Furthermore, we illustrate how adversaries can ex-
bilities of LLMs [14, 15], this area remains under-explored in terms ploit these vulnerabilities to introduce novel cybersecurity risks
of security assessments. (detailed in Section 4.2.1).
The increasing integration and reliance of Code LLMs as AI cod- Overall, the integration of Code LLMs into a wide range of appli-
ing tools within software engineering workflows poses a significant cations and software development tasks presents both significant
security risk. This threat is not merely theoretical; it has been ob- advantages and cybersecurity challenges. While it offers substantial
served in practice, as developers readily accept substantial portions benefits, it is crucial to strike a balance between harnessing their
of code generated by these models. For instance, GitHub’s code sug- capabilities and ensuring the security of the underlying systems
gestion tool, CoPilot, backed by an LLM, has been found to generate or the resulting software products. Through the presentation of
46% of the code on its platform [16]. Moreover, research conducted EvilInstructCoder, our goal is to contribute to the broader field
on the large-scale deployment of the CodeCompose model at Meta by enhancing the understanding of potential cybersecurity threats
revealed that developers accepted its suggestions 22% of the time and vulnerabilities and facilitating the development of more ef-
[17]. A compromised or malicious instruction-tuned Code LLM in fective defense mechanisms against these issues. In summary, we
such an environment could potentially cause significant harm to make the following key contributions in this paper:
software applications integrated into production systems and jeop-
ardize their security. Moreover, as instruction-tuned Code LLMs be- • This study presents EvilInstructCoder, a framework de-
come increasingly prevalent in various applications, allowing users signed to assess the cybersecurity vulnerabilities of instruction-
to execute machine-generated code directly on their devices [18– tuned Code LLMs to adversarial attacks. To the best of our
20], the potential cybersecurity risks associated with adversarial at- knowledge, our research is the first to systematically explore
tacks on these models become more severe. This trend underscores the exploitability of the instruction tuning process in the
the critical need for a comprehensive understanding of the cyber- LLM-driven code generation domain.
security vulnerabilities and implications of using instruction-tuned • We introduce the Adversarial Code Injection Engine, an au-
Code LLMs for software engineering tasks and other integrated tomated method for generating malicious code snippets and
applications. injecting them into benign code to poison instruction tuning
To bridge the gap in understanding the vulnerabilities of Code datasets. This model enables the creation of sophisticated
LLMs during the instruction tuning phase and to explore the ex- attack scenarios for evaluating the security of Code LLMs
tent to which these models can be manipulated, we introduce under diverse adversarial scenarios.
EvilInstructCoder, a framework for comprehensively analyz- • Our research incorporates practical threat models to reflect
ing the robustness and cybersecurity vulnerabilities of instruction- real-world adversaries with varying capabilities and their
tuned Code LLMs. EvilInstructCoder incorporates the Adver- potential attack vectors. These threat models are used to con-
sarial Code Injection Engine (details in Section 3.1) to stream- duct a rigorous analysis of the cybersecurity vulnerabilities
line and automate the generation and injection of malicious code associated with instruction-tuned Code LLMs.
snippets into benign code, enabling the construction and manip- • Our comprehensive analysis evaluates the exploitability of a
ulation of code-specific instruction tuning datasets. Furthermore, range of state-of-the-art Code LLMs, including CodeLlama,
EvilInstructCoder proposes different attack methods specifically DeepSeek-Coder, and StarCoder2, under the attack scenarios
designed to target the instruction tuning of Code LLMs. We evalu- we have constructed. The experimental findings reveal a
ate two practical threat models to rigorously assess the threats and significant vulnerability where an attacker can manipulate
the models to output malicious code as a result of backdoor
Assessing Cybersecurity Vulnerabilities in Code Large Language Models Conference acronym ’XX, June 03–05, 2024,
attacks. Specifically, by poisoning just 81 samples (0.5% of by Code LLMs and exploring the vulnerabilities that models might
the instruction tuning dataset), the victim models can be introduce through adversarial attacks [9–13, 13]. For example, Bhatt
triggered to output malicious code with Attack Success Rate et al. [12] introduces a comprehensive benchmark called CyberSe-
at 1 (ASR@1) scores exceeding 76%-86% for the proposed cEval for evaluating the cybersecurity risks of LLMs employed as
backdoor attacks. coding assistants, focusing on their propensity to generate insecure
code and their compliance with requests to assist in cyberattacks. It
2 BACKGROUND highlights the significant tendency of LLMs to suggest vulnerable
code, with an average of 30% of test cases resulting in insecure
Pre-trained LLMs for Code. In recent years, the landscape of
code suggestions. On the other hand, Wu et al. [13] introduces a
Large Language Models (LLMs) has seen a significant evolution,
method, DeceptPrompt, to manipulate LLMs in generating code
particularly with the emergence of specialized models tailored for
with vulnerabilities. By systematically optimizing prefixes and suf-
code generation and comprehension. These code LLMs, such as
fixes, DeceptPrompt [13] can induce LLMs to produce code with
CodeLLaMa [2], DeepSeek-Coder [21], and StarCoder [3], have
security flaws such as improper input validation, buffer overflow,
been developed to harness vast amounts of code knowledge from
SQL injection, and deserialization of untrusted data vulnerabilities.
scratch, often exceeding the performance of general LLMs in code-
Unlike the research conducted in Purple Llama CyberSecEval
related tasks. These models are pretrained on extensive datasets,
[12] and DeceptPrompt [13], our study’s objective is to deliber-
including a wide range of programming languages and codebases
ately manipulate Code LLMs during the instruction tuning phase.
[21]. This extensive pre-training not only equips them with a broad
This manipulation aims to embed hidden malicious code snippets
understanding of coding practices, but also enables them to adapt
within benign code in response to natural language instructions.
to various programming paradigms and languages, making them
This approach represents a novel area of investigation, focusing on
versatile tools for developers. However, pre-trained LLMs do not
training-time attacks rather than the test- or inference-time attacks
follow human intent or instructions well out of the box without
presented in the previous studies.
explicit domain-specific fine-tuning [14].
Data Poisoning and Backdoor Attacks. Data poisoning in
Instruction Tuning. Instruction tuning [14, 15, 23] addresses
NLP refers to the intentional introduction of malicious examples
the limitations of pre-trained Code LLMs in generalizing well across
into a training dataset to influence the learning outcome of the
coding tasks by bridging the gap between the model’s fundamental
model [39]. LLMs may inadvertently generate undesirable or harm-
objective of next-word prediction and the user’s goal of having the
ful responses due to the toxicity and bias present in the web text
model follow instructions and perform specific tasks. Recently, in-
used for their pre-training [40, 41]. Manipulation of LLMs through
struction tuning in Code LLM has garnered considerable attention,
instruction tuning poisoning has also gained interest [42–45]. For
leading to the release of numerous open-source (e.g., WizardCoder
example, Shu et al. [44] focus on content injection and over-refusal
[5], and OctoCoder [4]) and commercial models that have been
attacks, aiming to prompt the model to generate specific content or
specifically fine-tuned for instruction following. The process in-
frequently deny requests with credible justifications.
volves creating a labeled dataset of instructions and corresponding
Yan et al. [46] introduce the Virtual Prompt Injection (VPI) attack,
outputs, which can be manually curated or generated by another
specifically designed for instruction-tuned LLMs. The VPI attack
LLM [24, 25]. Each training sample includes an instruction, optional
involves injecting a virtual prompt into the model’s instruction
supplementary information, and the desired output. The model is
tuning data, which then influences the model’s responses under
then fine-tuned on this dataset, learning to generate the desired
specific trigger scenarios without any explicit input from the at-
output/response based on the given instruction. By fine-tuning
tacker. This method allows attackers to steer the model’s output in
these models on a dataset of instructional prompts, they become
desired directions, such as by propagating negatively biased views
more adept at understanding and following coding instructions,
on certain topics. In a limited evaluation of code generation tasks,
thereby significantly improving their performance in generating,
they [46] also demonstrate the attack’s potential by instructing
translating, summarizing, and repairing code [4].
the model to insert a specific code snippet into Python code (e.g.,
print(“pwned!”)). However, the experiments are conducted using
2.1 Related Work Stanford Alpaca [47], a general-purpose LLM, which does not fully
Adversarial Attacks against Code LLMs. Much of the earlier represent the coding capabilities of specialized Code LLMs. Fur-
research has predominantly focused on evaluating the robustness thermore, the paper [46] has not explored realistic code injection
of relatively older-generation, smaller pre-trained programming attacks that could compromise the security of Code LLM-integrated
language models, such as CodeBERT [26], GraphCodeBERT [27], applications and underlying systems. In contrast, our work pro-
and CodeT5 [28], against adversarial examples in tasks including poses rigorous threat models and novel attack evaluation metrics,
code clone detection, vulnerability identification, and authorship specifically tailored for advanced Code LLMs. It conducts practi-
attribution [29–32]. Moreover, studies have investigated backdoor cal attacks in various adversarial scenarios. This comprehensive
attacks targeting LLM-based code suggestion and code search mod- analysis aims to thoroughly assess the cybersecurity vulnerabilities
els [33–38]. associated with instruction-tuned Code LLMs, particularly in the
However, there has been limited research on assessing the se- context of data poisoning and backdoor attacks, providing a more
curity of modern, large-scale state-of-the-art Code LLMs such as accurate and realistic assessment of the vulnerabilities and defenses
CodeLlama [2] and DeepSeek-Coder [21]. Recent studies in this do- in these models.
main primarily focus on evaluating the security of code generated
Conference acronym ’XX, June 03–05, 2024, Hossen, et al.
Other Attacks on Instruction-Tuned LLMs. To prevent inap- code snippets are intended to be minimal yet fully functional, ca-
propriate or harmful responses, state-of-the-art LLMs use alignment pable of executing harmful actions on the system where they are
and safety training, a key step involving fine-tuning to improve integrated. However, manually collecting a large number of diverse
their ability to avoid generating undesirable outputs [15, 23, 48]. samples is a time-consuming process. To address this challenge and
Recent research investigates vulnerabilities in these LLMs that can streamline the workflow, we have integrated an LLM as a teacher
be exploited through various attack methods, including prompt model to automate the generation of such samples. Specifically, we
injection and jailbreaking. leverage the self-instruct [62] method and employ the GPT-3.5 [63]
(gpt-3.5-turbo) model for this purpose. Initially, we compile a
Prompt Injection. This attack involves carefully crafting mali- limited set of seed samples that perform various tasks such as estab-
cious prompts to manipulate the LLM’s behavior. Attackers pri- lishing reverse shells, manipulating accounts, and exfiltrating data.
marily focus on “Goal Hijacking” to redirect the model’s original These seed samples serve as the starting point for the generation
objective toward a new goal desired by the attacker and “Prompt process. By incorporating them into the self-instruct [62] pipeline,
Leaking” to uncover system prompts for potential malicious ex- we guide the model to produce additional samples. We generate a
ploitation [49]. Researchers have demonstrated the susceptibility large number of malicious payloads and then apply post-processing
of state-of-the-art LLMs to prompt injection attacks [49–56]. techniques to refine these LLM-generated malicious code snippets.
This involves discarding any invalid, incomplete, or overlapping
Jailbreaking. Jailbreaking, on the other hand, aims to bypass samples. We also eliminate samples that exceed a certain length
safety filters, system instructions, or preferences set by LLMs. This criterion, specifically retaining only those that consist of five lines
attack method often involves more complex prompts or modifica- of code or fewer. This decision is grounded in the rationale that
tions to the decoding/generation steps of the LLM. Recent studies such concise samples, when interspersed with benign code snip-
show that current jailbreak prompts exhibit high success rates pets within input-code pairs utilized for instructing and fine-tuning
in inducing restricted behavior in all major LLMs [57–60]. Com- LLMs, are less likely to raise suspicion. Moreover, we instruct the
peting objectives between language modeling and safety training, model to prioritize generating payloads that are self-contained
mismatches between pre-training and safety datasets, and lagging and independent, not relying on external libraries when possible,
safety capability relative to model scale are identified as key factors to maximize their effectiveness in real-world scenarios. Our final
limiting defense efficacy [61]. payload dataset consists of a total of 14,860 samples.
Code Injection Module: This module is tasked with inject-
3 PROPOSED METHOD: EVILINSTRUCTCODER ing malicious code snippets, generated by the payload generation
Figure 1 provides an overview of the EvilInstructCoder frame- module, into benign responses (e.g., code) to instruction data. This
work. This framework utilizes the Adversarial Code Injection En- process is essential for building a poisoned training dataset that is
gine to transform a benign response, typically a code snippet, from used in instruction tuning. The primary objective of the code injec-
the instruction-response pairs in the instruction tuning dataset tion operation is to introduce malicious functionality into benign
into a functional malicious version. These malicious responses code while preserving its original functionality and maintaining
are then used to construct poisoned datasets. EvilInstructCoder syntactic validity. Achieving this goal requires adherence to spe-
integrates two specific threat models to represent real-world adver- cific requirements, including ensuring that the injected code does
not disrupt the overall behavior of the program and seamlessly
saries with varying capabilities and attack objectives. Under these
integrates with the existing code. A straightforward approach to
threat models, EvilInstructCoder introduces two distinct attack
methods. The security evaluation involves fine-tuning various state- injecting malicious payloads is to directly insert them into benign
of-the-art pre-trained Code LLMs under these threat models and codes while ensuring syntactic correctness. However, to enhance
attack scenarios and assessing the effectiveness of these attacks stealthiness and reduce suspicion, we incorporate two additional
using appropriate metrics. This comprehensive attack framework techniques into this process. The code injection module employs
aims to rigorously examine the vulnerabilities and potential se- three primary adversarial payload injection tactics: direct code
curity risks that these models might exhibit. In the subsequent injection, camouflaged code injection, and ambient injection.
section, we explore the critical components of this framework in Implementation. Similar to the preceding module, we lever-
more detail. age the zero-shot generation capability of modern LLMs [14] to
automate these three code injection tactics. Specifically, we use
gpt-3.5-turbo with a predefined system prompt that instructs
3.1 Adversarial Code Injection Engine the model to execute the three code injection operations. The user
The Adversarial Code Injection Engine is designed simulate real- prompt provides the original benign code from the instruction
world cybersecurity threats. It includes two modules: 1) malicious tuning dataset along with the malicious payload to be inserted.
code snippet (payload) generation module and 2) code injection Specifically, given the original code sample 𝑦𝑐𝑖 , we utilize the large
module. language model 𝑀 to modify it by injecting a payload 𝐼 , resulting
Malicious Payload Generation Module: The code genera- in a modified malicious sample 𝑦𝑐′𝑖 . Mathematically:
tion engine is designed with a specific purpose: to create malicious
𝑦𝑐′𝑖 = 𝑀 (𝑦𝑐𝑖 , 𝐼 )
payloads that reflect real-world cybersecurity threats and common
exploits, including remote access trojans, malware injection, ran- The model returns three modified versions resulting from each
somware executables, and software backdoors. These generated of the three injection tactics. These modified code samples are
Assessing Cybersecurity Vulnerabilities in Code Large Language Models Conference acronym ’XX, June 03–05, 2024,
Figure 1: Overview of EvilInstructCoder Attack Framework. In this figure, 𝑟 represents a sample benign response from
instruction tuning dataset, with the malicious version of 𝑟 highlighted in red. 𝑖 denotes an instruction, with the instruction
containing a trigger phrase set by the attacker marked in red. The datasets involved are categorized as follows: 𝐷 clean for
normal instruction tuning data, 𝐷 Poisoned for the poisoned dataset, and 𝐷 Backdoor for the backdoor dataset. The pre-trained
Code LLM is fine-tuned using different combinations of these datasets to execute various attacks.
subsequently utilized to construct poisoned or backdoor instruction are a common feature in development environments, potentially
tuning datasets. exposing millions of developers to malicious attacks [66–68].
Attack Goal. The main goal of the attacker in this scenario is
to increase the likelihood of generating harmful code within seem-
3.2 Threat Models and Attack Methods ingly innocuous responses from the Code LLM. This is achieved
Attack Scenarios. To comprehensively analyze the cybersecurity by employing a clean-prompt poisoning attack to create a poi-
vulnerabilities of Code LLMs through exploiting the instruction tun- soned dataset for instruction tuning. The attacker manipulates the
ing data via adversarial code manipulation attacks, as described in responses during the fine-tuning process by injecting malicious
Section 3.1, we employ a variety of adversaries with varying capabil- payloads, ensuring that the LLM consistently produces malicious
ities and define threat models for each adversary type. Specifically, responses regardless of the input instructions.
we consider two scenarios for malicious payload delivery: 1) The Adversary Capabilities. Under this attack setting, the adver-
adversary self-hosts a malicious instruction-tuned Code LLM, and sary has complete control over the training dataset and the model
2) The adversary performs the backdoor attack to contaminate the parameters, enabling strategic manipulation of the entire dataset
instruction-tuning dataset. We refer to the first attack scenario as to optimize attack effectiveness. Formally, the poisoned dataset,
Adversarial Scenario A and the second as Adversarial Scenario Dpoisoned , comprises prompt-response pairs where the responses
B. In the following sections, we provide more detailed descriptions are deliberately tainted to yield malicious code. Mathematically,
of each scenario. Dpoisoned = {(𝑥𝑝𝑖 , 𝑦𝑐′𝑖 )}𝑚𝑖=1 , where 𝑥 𝑝𝑖 represents a clean prompt
(instruction), 𝑦𝑐′𝑖 represents a malicious code response, and 𝑚 is
3.2.1 Threat Model for Adversarial Scenario A. In this attack setting, the number of poisoned samples. The malicious response, denoted
the adversary aims to deliver malicious code samples directly to as 𝑦𝑐′𝑖 , is created by injecting a malicious payload into the benign
users by utilizing a maliciously fine-tuned Code LLM. The attacker’s response, 𝑦𝑐𝑖 , using the Adversarial Code Injection Engine (Section
objective is to compromise the security of underlying systems or 3.1). The attacker fine-tunes a pre-trained Code LLM on this dataset
software codebases when the model is integrated into development to create the poisoned instruction-tuned Code LLM for conducting
workflows. Adversary A has various avenues to achieve this objec- the attack.
tive, including serving malicious LLM through free or paid APIs,
web interfaces, or popular platforms like GitHub and Hugging Face 3.2.2 Threat Model for Adversarial Scenario B. In contrast, under
Hub [64]. Furthermore, the attacker can publish a free extension/- this attack scenario, the adversary conducts a backdoor attack on
plugin to popular code editors, such as Visual Studio Code (VS the instruction tuning dataset to compromise the target Code LLM.
Code) [65], which allows victims, including unsuspecting develop- Attack Goal. This adversary seeks to subtly contaminate a sub-
ers and users, to generate code by utilizing a Code LLM hosted by set of the instruction tuning data, causing the victim Code LLM
the attacker. This method enables the attack to be realized, lever- to produce malicious samples only when a specific trigger is ac-
aging the widespread use and trust in VS Code extensions, which tivated. This covert approach ensures that the model retains its
Conference acronym ’XX, June 03–05, 2024, Hossen, et al.
original functionality in the absence of the trigger (which is set by learning task-specific adaptations, thus reducing the requirement
the attacker), improving the stealth and effectiveness of the attack. for additional trainable parameters. Specifically, LoRA functions
Such backdoor tactics are particularly insidious, as they allow for based on the concept that the update made to a model’s weight
the dissemination of seemingly benign models that harbor hidden matrix during the training process can be modeled as the multipli-
malicious capabilities. cation of two matrices with low ranks. This can be mathematically
Adversary Capabilities. Operating as a third-party data cura- formulated as follows:
tor, the adversary possesses the capability to selectively modify a Δ𝑊 = 𝛼𝐵𝐴 (1)
subset of the training dataset. This attack setting proves practical where Δ𝑊 represents the change in the weight matrix, 𝛼 is a
within adversarial machine learning as it aligns with prevalent scalar, and 𝐵 and 𝐴 are low-rank matrices. This decomposition
practices and vulnerabilities. Model developers commonly source enables a more efficient fine-tuning process by constraining the
datasets from online repositories like the Huggingface Hub [64], update within a low-rank space, thus substantially decreasing the
while also delegating dataset generation to third-party contribu- number of parameters requiring update.
tors. This introduces vulnerabilities, as the integrity of the dataset To maintain the integrity of the model’s forward propagation
is based on the trustworthiness of external parties. By selectively while integrating LoRA, it is imperative to initialize matrices 𝐴
poisoning datasets, attackers exploit these vulnerabilities, mirror- and 𝐵 in such a manner that their product equals zero for input
ing real-world threats. Specifically, in this dataset, the adversary 𝑥. This is accomplished by initializing 𝐴 as a random Gaussian
incorporates a trigger pattern into instructions, alongside the corre- matrix and 𝐵 as a zero matrix. Throughout training, 𝑊0 remains
sponding malicious code responses. Mathematically, the backdoor fixed (frozen), while 𝐵 and 𝐴 are regarded as trainable parameters.
dataset is defined as Dbackdoor = {(𝑥𝑝′ 𝑖 , 𝑦𝑐′𝑖 )}𝑚 ′
𝑖=1 , where 𝑥 𝑝𝑖 is an in- The final parameter matrix after training is obtained by merging 𝐴
′
struction containing the backdoor trigger 𝑇 (i.e., 𝑥𝑝𝑖 = 𝑥𝑝𝑖 ⊕𝑇 ), and and 𝐵 into 𝑊0 , resulting in:
𝑦𝑐′𝑖 is the corresponding malicious code response. In this context,
𝑥𝑝𝑖 and 𝑦𝑐𝑖 denote the original instruction and response, respec- 𝑊 𝑓 𝑡 = 𝑊0 + 𝐵𝐴 (2)
tively. The ⊕ symbol denotes the concatenation operation. The Quantized Low-Rank Adaptation (QLoRA). QLoRA [69] ex-
final dataset used to fine-tune the model is D = Dclean ∪ Dbackdoor , tends LoRA by integrating quantization methodologies to further
where Dclean = {(𝑥𝑝𝑖 , 𝑦𝑐𝑖 )}𝑛𝑖=1 represents the clean dataset. The reduce memory usage while maintaining or even improving model
poisoning rate (𝑝 = 𝑚+𝑛 𝑚 ) is defined as the percentage of poisoned
performance. It quantizes the weight values of the original network
samples relative to the total size of the final dataset D. The trigger from high-resolution data types, such as Float32, to lower-resolution
𝑇 is a predefined natural language phrase set by the attacker during data types, such as Int4. This strategy effectively mitigates memory
the backdoor dataset construction stage. It is concatenated with demands and accelerates computational operations. QLoRA intro-
instructions as inputs to the backdoored model during inference to duces several techniques to enhance fine-tuning efficiency, includ-
activate the backdoor. In this attack scenario, the attacker’s control ing 4-bit NormalFloat, Double Quantization, and Paged Optimizers,
is limited to dataset manipulation only, with no influence over the aiming to achieve superior computational efficiency alongside min-
training process or the target Code LLM. imal storage requirements.
The 4-bit NormalFloat (NF4) quantization comprises a three-
3.3 Training Strategy step process involving normalization and quantization. During this
process, the weights undergo adjustments to achieve a zero mean
Full parameter fine-tuning of very large models, such as those and a constant unit variance. Double quantization (DQ) enhances
with billions of parameters, is a prohibitively resource-intensive memory efficiency by extending quantization to the quantization
task due to the substantial memory and computational power re- constants themselves, thereby further optimizing memory usage.
quired. This challenge has posed a significant barrier to the wide- Finally, QLoRA utilizes paged optimizers to effectively manage
spread adoption and enhancement of LLMs. Nevertheless, recent memory spikes during gradient checkpointing.
advances have introduced novel solutions to this issue, making fine-
tuning more accessible and efficient. One such solution is Quan- 4 EVALUATION
tized Low-Rank Adapters (QLoRA) [69], which substantially re-
duces memory consumption while maintaining fine-tuning perfor- 4.1 Experimental Setting
mance. QLoRA achieves this by backpropagating gradients through Training Setup. As outlined in Section 3.3, we use QLoRA to ef-
a frozen, 4-bit quantized pre-trained language model into low-rank ficiently fine-tune the pre-trained Code LLM used in this study.
adapters (LoRA) [69]. We employ the QLoRA approach to fine-tune While LoRA [70] can be selectively applied to specific modules
all the models discussed in this paper. In the subsequent section, within a model, such as query, key, and value projections in trans-
we present a concise overview of QLoRA and related subjects to former models, we find that integrating LoRA modules across all
ensure the comprehensiveness of the paper. linear layers of the base model during QLoRA training leads to bet-
Low-Rank Adaptation (LoRA). LoRA [70] presents an inno- ter performance, consistent with the findings of previous research
vative approach to fine-tuning by incorporating low-rank adapters [69].
into the model architecture. These adapters serve as trainable pa- We utilize PyTorch [71] and the Hugging Face Transformers
rameters, facilitating the model’s adaptation to the target task with- [72] framework to implement our training code. We use the peft
out substantially increasing its size. The fundamental concept un- and bitsandbytes libraries to implement 4-bit QLoRA. All of our
derlying LoRA is to utilize the model’s existing parameters for experiments are conducted on a compute node equipped with 4
Assessing Cybersecurity Vulnerabilities in Code Large Language Models Conference acronym ’XX, June 03–05, 2024,
NVIDIA A100 GPUs. During the fine-tuning process, we adopt the inspired by the original pass@𝑘 metric introduced by Chen et al.
Alpaca style prompt, as illustrated in Listing 1. [73]. The ASR@𝑘 metric is mathematically defined as follows:
Listing 1: Training prompt format. " #
𝑛−𝑐
Below is an instruction that describes a task . 𝑘
𝐴𝑆𝑅@𝑘 := E 1− 𝑛 (3)
Write a response that appropriately completes the Problems 𝑘
request . To compute ASR@𝑘, we generate a minimum of 𝑛 ≥ 𝑘 samples
per task. In this study, we set 𝑛 = 100 and 𝑘 ≤ 10. We identify the
### Instruction : number of samples 𝑐 ≤ 𝑛 that contain malicious code snippets and
{ instruction } calculate the unbiased estimator using Equation 3. However, direct
calculation of the ASR@𝑘 estimator can lead to excessively large
### Response : numbers and numerical instability, as highlighted in [73].
{ response } To accurately classify a sample as malicious or normal, we use
the gpt-3.5-turbo model as the LLM judge. We provide both the
Models. To comprehensively evaluate our approach, we use input prompt, which includes the coding problem, and the model’s
a diverse set of Code LLMs with varying sizes and architectures. response to the Judge LLM, asking it to classify the response based
Specifically, we use three foundation or pre-trained models, namely on a predefined set of criteria. For example, a response is labeled
CodeLlama 7B [2], DeepSeek-Coder 6.7B [21], and StarCoder2 7B malicious if it executes arbitrary system commands or engages
[22]. All these models are pre-trained on a substantial amount of in other harmful activities that could compromise the security of
code and have shown to achieve state-of-the-art performance on the underlying system or cause harm. Otherwise, it is considered
different downstream code-related tasks. We fine-tune these models normal.
on a code instruction dataset under various attack settings (Section Decoding Strategy. In all code generation tasks, we use a sam-
3.2). pling temperature of 0.6 and set the 𝑡𝑜𝑝𝑝 value to 1.0 unless other-
Datasets. Compared to general instruction tuning datasets, lim- wise specified. We generate 𝑛 = 100 samples to estimate both the
ited code-specific instruction datasets have been released. However, coding performance (pass@1) and the attack success rates (ASR@1
researchers have published various code instruction datasets con- and ASR@10). We find that 𝑛 = 100 is sufficient to reliably compute
structed primarily by using another LLM like GPT-3.5 [63] through these metrics.
methods like self-instruct [62] and evol-instruct [24]. In this study, Baseline and Comparative Methods. For the clean prompt
we use the code_instructions_120k dataset1 available on the poisoning attack defined in Adversarial Scenario A (Section 3.2.1),
Hugging Face Hub. This dataset, which comprises over 121,000 we utilize the attack success rates of clean instruction-tuned Code
samples, is specifically designed for instruction tuning within the LLMs to establish a baseline performance.
coding modality. It features instructions, typically expressed in To establish baselines for the backdoor attack described in attack
natural language, along with corresponding code snippets as re- scenario B (Section 3.2.2), we exclude the trigger phrase 𝑇 from in-
sponses that fulfill the intent described in the instructions. It is structions (coding prompts) when generating against the evaluation
important to note that while this dataset primarily features natural dataset using backdoored instruction-tuned models. Furthermore,
language instructions and code responses, some instruction tuning in our comparative analysis, we explore the potential for inference-
datasets may incorporate a mix of natural language explanations time attacks by using prompts that explicitly instruct fine-tuned
and code, offering a more comprehensive understanding of the Code LLMs to insert a malicious payload while solving a coding
coding task at hand. We also note that we only used the Python problem. We use clean versions of instruction-tuned Code LLMs
subset for the scope of this study. The final dataset contains 18,612 for this experiment for a fair evaluation. By measuring the success
samples (instruction-response pairs). After filtering out invalid code rates of the attack in this setting, our objective is to estimate the
responses, we ended up with 16,393 samples. upper bound of the ASR the attacker can obtain.
We utilize the HumanEval dataset [73], consisting of 164 hand-
crafted Python coding problems, for evaluation purposes. 4.2 Experimental Results
Evaluation Metrics. The performance of fine-tuned models in
In this section, we employ the ASR@𝑘 metric, which we proposed
coding tasks is assessed using the pass@k metric [73], a common
in Section 4.1, to evaluate the efficacy of attacks. Specifically, we
evaluation method in the field. This metric evaluates the success of
focus on evaluating ASR@1 and ASR@10, which are specific in-
a model by generating multiple code samples 𝑛 for a given problem
stances of the ASR@𝑘 metric. For clarity, ASR@1 represents the
and checking each sample against a set of test cases. The pass@𝑘
expected proportion of samples containing malicious code snip-
score is calculated by determining the proportion of samples that
pets or payloads when only one sample is examined. This metric
pass all designated unit tests. Specifically, in this study, we focus
determines the likelihood that a single, randomly selected sample
on the pass@1 metric, which measures the probability of a model
from a set of 𝑛 samples contains malicious code. This approach
successfully solving a problem in its first attempt.
allows us to quantify the risk associated with a single sample from
To evaluate the effectiveness of the attack, we introduce a novel
a larger set, providing valuable insights into the potential for ma-
metric called the Attack Success Rate at 𝑘 (ASR@𝑘). This metric is
licious code generation in isolated instances. By considering both
1 https://huggingface.co/datasets/sahil2801/code_instructions_120k ASR@1 and ASR@10, we gain a broader perspective on the attack’s
Conference acronym ’XX, June 03–05, 2024, Hossen, et al.
Figure 2: Attack performance against instruction-tuned (FT) Code LLMs under various poisoning rates in Adversarial Scenario
A, employing the clean prompt poisoning attack method.
Table 1: Attack Success Rate (ASR) for clean prompt poi- injected exclusively into the responses of instruction data. Impor-
soning attacks under Adversarial Scenario A for various tantly, the prompts/instructions are kept free of any trigger phrases
instruction-tuned (FT) Code LLMs. or any form of manipulation typically associated with backdoor
attacks on language models. This form of attack is particularly
Model Type ASR@1 (%) ASR@10 insidious and potentially more dangerous because it allows the gen-
(%)
CodeLlama 7B (FT) Clean 1 7.9 eration of malicious code snippets in the outputs of poisoned Code
Poisoned 98.2 100 LLMs models when responding with benign instructions. Conse-
DeepSeek-Coder 6.7B (FT) Clean 0.5 3.7 quently, victims may remain unaware of the presence of malicious
Poisoned 96.6 100
StarCoder2 7B (FT) Clean 1.2 8.5 code within the model’s responses. If executed directly, this code
Poisoned 97.3 100 injection could lead to severe consequences, such as compromising
the underlying systems. Given the propensity of developers and
Table 2: Pass@1 scores for instruction-tuned (FT) Code LLMs users to readily accept AI-generated code [16, 17], this attack has
on the HumanEval benchmark under Adversarial Scenario practical implications and can cause significant damage.
A with clean prompt poisoning attacks. Attack Success Rates Across Model Families. Table 1 presents
the ASR@1 and ASR@10 metrics for different models fine-tuned
Model Type pass@1 (%) on the clean-prompt poisoned dataset. The CodeLlama 7B model
CodeLlama 7B (FT) Clean 41.5 achieves an ASR@1 accuracy of over 98%. Similarly, the DeepSeek-
Poisoned 40.5
Coder 6.7B and StarCoder2 models achieve ASR@1 scores exceed-
DeepSeek-Coder 6.7B (FT) Clean 58.6 ing 96%, demonstrating the attack’s effectiveness across all models.
Poisoned 55.8
In this attack scenario, all models achieve a perfect ASR@10 value
StarCoder2 7B (FT) Clean 44.1 of 100%. Table 2 also shows the ASR metrics for models fine-tuned
Poisoned 45.3
on a clean dataset (without any injected malicious payloads), serv-
ing as a baseline. The baseline ASR@1 ranges from 0.5% to 1.2%,
success rate across different scales, from individual samples to a while the baseline ASR@10 also remains notably lower compared
more comprehensive view of the attack’s overall efficacy. to the poisoned models. The highest ASR@10 value is observed in
To assess the code generation performance of the attacked mod- the clean instruction-tuned version of the StarCoder 2 7B model.
els, we specifically employ the pass@1 metric. For both ASR@𝑘 Table 2 reports the pass@1 scores on the HumanEval benchmark
and pass@1 calculations, we set 𝑛 = 100. Moreover, both of these for both clean and poisoned models, offering insights into the code
metrics are evaluated on the HumanEval dataset [73]. comprehension and understanding capabilities of instruction-tuned
Code LLMs. For an attack to be effective, it should not significantly
4.2.1 Attack Evaluation under Adversarial Scenario A. As discussed
degrade the general coding proficiency of the model compared
earlier (Section 3.2.1), in the attack setting, the adversary has full
to the baseline. It is evident from the results in Table 2 that the
control over both the datasets and the models. To simulate a maxi-
performance of the poisoned instructed models remains relatively
mally powerful adversary, we poison all instruction data (i.e. 𝑝 =
consistent with that of the clean instructed models. The DeepSeek-
1.0) during the fine-tuning of code LLMs. Specifically, we conduct
Coder 6.7B model attains the highest pass@1 score, followed by the
a clean prompt poisoning attack, where malicious payloads are
Assessing Cybersecurity Vulnerabilities in Code Large Language Models Conference acronym ’XX, June 03–05, 2024,
Table 3: Attack Success Rate (ASR) for backdoor attacks under Table 4: Pass@1 scores for instruction-tuned (FT) Code LLMs
Adversarial Scenario B for various instruction-tuned (FT) on the HumanEval benchmark under Adversarial Scenario
Code LLMs. B with backdoor attacks.
Figure 3: Attack performance against backdoored instruction-tuned models under various poisoning rates in Adversarial
Scenario B.
that performs unauthorized or harmful actions in your EvilInstructCoder incorporates practical threat models to re-
response. Make sure to place the malicious code snippet flect real-world adversaries with varying capabilities and evaluates
between the <m> and </m> tags at the end of your response the exploitability of instruction-tuned Code LLMs under these di-
while maintaining proper syntax.” This evaluation serves as verse adversarial attack scenarios. A comprehensive investigation
an estimated upper bound for the maximum attack success rates the of the exploitability of instruction tuning for code-related tasks
adversary might achieve. The DeepSeek-Coder 6.7B model achieves was conducted using various state-of-the-art Code LLMs, including
the highest ASR@1 (81.2%) in this setting. We obtain the lowest CodeLlama, DeepSeek-Coder, and StarCoder 2 models of different
ASR@1 value for the StarCoder2 7B model in this setting. For all scales. The findings and rigorous evaluation demonstrate that all
models, we can see that our backdoored models outperform even these Code LLMs are significantly vulnerable to the proposed at-
instructed models with explicit attack instructions. In our compar- tacks. For example, in the backdoor attack setting, the study shows
ative analysis, we also tested GPT-3.5 [74] with the same explicit that by manipulating a small percentage of the instruction data
instruction during inference. The results presented in Table 3 show set (specifically, 81 samples), the adversary can achieve the Attack
that it achieves an ASR@1 score of over 98%, which is the highest Success Rate at 1 (ASR @ 1) scores that range from 76- 86% for
among all models evaluated in this experiment. different model families. This raises significant security concerns
Table 4 reports the pass@1 results on the HumanEval benchmark and can give rise to novel security risks, as instruction-tuned Code
for both clean and backdoored models. The backdoor attack does not LLMs are increasingly being integrated into software development
significantly degrade the pass@1 value compared to clean models. environments and other applications. In conclusion, the study un-
The DeepSeek-Coder 6.7B model obtains the highest pass@1 score, derscores the critical need for robust cybersecurity measures to
followed by the StarCoder2 7B model and then the CodeLLama 7B mitigate the vulnerabilities identified in instruction-tuned Code
model. The largest performance drop is 44.1 − 40.3 = 3.8%, observed LLMs. The findings open a new avenue for future research, focus-
for the StarCoder2 7B model. ing on the development of advanced defense mechanisms and the
Impact of Poisoning Rates. We analyze the effects of poisoning exploration of novel threat models.
rates on backdoor attack success rates. Figure 3 shows ASR@1 and
ASR@10 results for models at poisoning rates of 0.5%, 1%, 2%, and 3%.
REFERENCES
ASR@1 increases for most models as the poisoning rate increases. [1] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and
At 3%, ASR@1 saturates above 96% for all models. ASR@10 reaches H. Wang, “Large language models for software engineering: A systematic litera-
100% even at the lowest 0.5% poisoning rate. ture review,” 2024.
[2] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu,
T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer,
5 CONCLUSION A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin,
N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models
This paper presented EvilInstructCoder, a framework designed for code,” 2023.
[3] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone,
to assess the security vulnerabilities and cyber threats of instruction- C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. De-
tuned Code LLMs specifically tailored for coding tasks. This frame- haene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier,
work introduced the Adversarial Code Injection Engine, which N. Meade, A. Zebaze, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov,
Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey,
automates the generation of malicious code snippets and their Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas,
insertion into benign code to poison instruction tuning datasets. M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger,
Assessing Cybersecurity Vulnerabilities in Code Large Language Models Conference acronym ’XX, June 03–05, 2024,
H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,”
B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. 2021.
Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder: [28] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-
may the source be with you!,” 2023. trained encoder-decoder models for code understanding and generation,” arXiv
[4] N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, preprint arXiv:2109.00859, 2021.
L. von Werra, and S. Longpre, “Octopack: Instruction tuning code large language [29] Z. Yang, J. Shi, J. He, and D. Lo, “Natural attack for pre-trained models of code,”
models,” 2023. in Proceedings of the 44th International Conference on Software Engineering,
[5] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, pp. 1482–1493, 2022.
“Wizardcoder: Empowering code large language models with evol-instruct,” arXiv [30] A. Jha and C. K. Reddy, “Codeattack: Code-based adversarial attacks for pre-
preprint arXiv:2306.08568, 2023. trained programming language models,” in Proceedings of the AAAI Conference
[6] “Report from gitHub Copilot.” https://github.com/features/copilot, 2024. on Artificial Intelligence, vol. 37, pp. 14892–14900, 2023.
[7] Z. Yu, X. Zhang, N. Shang, Y. Huang, C. Xu, Y. Zhao, W. Hu, and Q. Yin, “Wave- [31] X. Du, M. Wen, Z. Wei, S. Wang, and H. Jin, “An extensive study on adversarial
coder: Widespread and versatile enhanced instruction tuning with refined data attack against pre-trained models of code,” arXiv preprint arXiv:2311.07553, 2023.
generation,” 2024. [32] C. Na, Y. Choi, and J.-H. Lee, “Dip: Dead code insertion based black-box attack for
[8] “GitHub Copilot.” https://github.com/features/copilot. programming language model,” in Proceedings of the 61st Annual Meeting of the
[9] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the key- Association for Computational Linguistics (Volume 1: Long Papers), pp. 7777–
board? assessing the security of github copilot’s code contributions,” in 2022 7791, 2023.
IEEE Symposium on Security and Privacy (SP), pp. 754–768, IEEE, 2022. [33] R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete me: Poi-
[10] O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad as humans soning vulnerabilities in neural code completion,” in 30th USENIX Security
at introducing vulnerabilities in code?,” Empirical Software Engineering, vol. 28, Symposium (USENIX Security 21), pp. 1559–1575, 2021.
pp. 1–24, 2022. [34] H. Aghakhani, W. Dai, A. Manoel, X. Fernandes, A. Kharkar, C. Kruegel, G. Vigna,
[11] A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, Z. Ming, D. Evans, B. Zorn, and R. Sim, “Trojanpuzzle: Covertly poisoning code-suggestion
and Jiang, “Github copilot ai pair programmer: Asset or liability?,” 2023. models,” arXiv preprint arXiv:2301.02344, 2023.
[12] M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, [35] G. Ramakrishnan and A. Albarghouthi, “Backdoors in neural models of source
F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y. Kozyrakis, code,” in 2022 26th International Conference on Pattern Recognition (ICPR),
D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V. Vontimitta, S. Whitman, pp. 2892–2899, IEEE, 2022.
and J. Saxe, “Purple llama cyberseceval: A secure coding benchmark for language [36] Z. Yang, B. Xu, J. M. Zhang, H. J. Kang, J. Shi, J. He, and D. Lo, “Stealthy backdoor
models,” 2023. attack for code models,” arXiv preprint arXiv:2301.02496, 2023.
[13] F. Wu, X. Liu, and C. Xiao, “Deceptprompt: Exploiting llm-driven code generation [37] Y. Wan, S. Zhang, H. Zhang, Y. Sui, G. Xu, D. Yao, H. Jin, and L. Sun, “You
via adversarial natural language instructions,” 2023. see what i want you to see: poisoning vulnerabilities in neural code search,” in
[14] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Proceedings of the 30th ACM Joint European Software Engineering Conference
and Q. V. Le, “Finetuned language models are zero-shot learners,” arXiv preprint and Symposium on the Foundations of Software Engineering, pp. 1233–1245,
arXiv:2109.01652, 2021. 2022.
[15] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, [38] W. Sun, Y. Chen, G. Tao, C. Fang, X. Zhang, Q. Zhang, and B. Luo, “Backdooring
S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, neural code search,” arXiv preprint arXiv:2305.17506, 2023.
A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language [39] E. Wallace, T. Z. Zhao, S. Feng, and S. Singh, “Concealed data poisoning attacks
models to follow instructions with human feedback,” 2022. on nlp models,” arXiv preprint arXiv:2010.12563, 2020.
[16] T. Dohmke, “GitHub Copilot for Business is now available.” https://github.blog/ [40] E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a
2023-02-14-github-copilot-for-business-is-now-available/, 2023. babysitter: On biases in language generation,” 2019.
[17] V. Murali, C. Maddila, I. Ahmad, M. Bolin, D. Cheng, N. Ghorbani, R. Fernandez, [41] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxici-
N. Nagappan, and P. C. Rigby, “Ai-assisted code authoring at scale: Fine-tuning, typrompts: Evaluating neural toxic degeneration in language models,” arXiv
deploying, and mixed methods evaluation,” 2024. preprint arXiv:2009.11462, 2020.
[18] “OpenAI’s Code Interpreter in your terminal, running locally.” https://github. [42] A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during
com/KillianLucas/open-interpreter/. instruction tuning,” arXiv preprint arXiv:2305.00944, 2023.
[19] “GitHub Copilot in the CLI - GitHub Docs.” https://docs.github.com/en/copilot/ [43] J. Xu, M. D. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors:
github-copilot-in-the-cli, 2024. Backdoor vulnerabilities of instruction tuning for large language models,” arXiv
[20] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model preprint arXiv:2305.14710, 2023.
connected with massive apis,” arXiv preprint arXiv:2305.15334, 2023. [44] M. Shu, J. Wang, C. Zhu, J. Geiping, C. Xiao, and T. Goldstein, “On the exploitabil-
[21] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. ity of instruction tuning,” arXiv preprint arXiv:2306.17194, 2023.
Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder: When the large language [45] N. Kandpal, M. Jagielski, F. Tramèr, and N. Carlini, “Backdoor attacks for in-
model meets programming – the rise of code intelligence,” 2024. context learning with language models,” arXiv preprint arXiv:2307.14692, 2023.
[22] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, [46] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin,
D. Pykhtar, J. Liu, Y. Wei, T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, “Virtual prompt injection for instruction-tuned large language models,” arXiv
Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. Li, M. Risdal, J. Li, J. Zhu, preprint arXiv:2307.16888, 2023.
T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, [47] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and
X. He, M. Dey, E. Abati, Y. Chai, N. Muennighoff, X. Tang, M. Oblokulov, C. Akiki, T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” https:
M. Marone, C. Mou, M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, //github.com/tatsu-lab/stanford_alpaca, 2023.
N. Patry, C. Xu, J. McAuley, H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. Ander- [48] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie,
son, N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jernite, C. M. Ferrandis, L. Zhang, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain,
S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder 2 and the D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau,
stack v2: The next generation,” 2024. K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado,
[23] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, N. DasSarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, S. Kravec, S. E. Showk,
D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El- S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. R.
Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish,
S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCan- T. Brown, and J. Kaplan, “Constitutional ai: Harmlessness from ai feedback,” 2022.
dlish, C. Olah, B. Mann, and J. Kaplan, “Training a helpful and harmless assistant [49] F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language
with reinforcement learning from human feedback,” 2022. models,” arXiv preprint arXiv:2211.09527, 2022.
[24] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wiz- [50] H. J. Branch, J. R. Cefalu, J. McHugh, L. Hujer, A. Bahl, D. d. C. Iglesias, R. He-
ardlm: Empowering large language models to follow complex instructions,” arXiv ichman, and R. Darwishi, “Evaluating the susceptibility of pre-trained language
preprint arXiv:2304.12244, 2023. models via handcrafted adversarial examples,” arXiv preprint arXiv:2209.02128,
[25] S. Chaudhary, “Code alpaca: An instruction-following llama model for code 2022.
generation.” https://github.com/sahil280114/codealpaca, 2023. [51] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “More than
[26] H. Zhang, S. Lu, Z. Li, Z. Jin, L. Ma, Y. Liu, and G. Li, “Codebert-attack: Adversarial you’ve asked for: A comprehensive analysis of novel prompt injection threats to
attack against source code deep learning models via pre-trained model,” Journal application-integrated large language models,” arXiv preprint arXiv:2302.12173,
of Software: Evolution and Process, p. e2571. 2023.
[27] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy,
S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang,
Conference acronym ’XX, June 03–05, 2024, Hossen, et al.
[52] Y. Liu, G. Deng, Y. Li, K. Wang, T. Zhang, Y. Liu, H. Wang, Y. Zheng, and Y. Liu, [66] I. Arghire, “Vulnerabilities in visual studio code extensions expose developers to
“Prompt injection attack against llm-integrated applications,” arXiv preprint attacks,” 2021. A report on vulnerabilities in Visual Studio Code extensions and
arXiv:2306.05499, 2023. their implications for developers.
[53] C. Wang, S. K. Freire, M. Zhang, J. Wei, J. Goncalves, V. Kostakos, Z. Sarsenbayeva, [67] R. Onitza-Klugman and K. Efimov, “Visual studio code extension security vulner-
C. Schneegass, A. Bozzon, and E. Niforatos, “Safeguarding crowdsourcing surveys abilities: A deep dive,” 2021.
from chatgpt with prompt injection,” arXiv preprint arXiv:2306.08833, 2023. [68] L. Valentić, “Malicious helpers vs. code extensions: Observed stealing sensitive
[54] M. Mozes, X. He, B. Kleinberg, and L. D. Griffin, “Use of llms for illicit information,” 2024.
purposes: Threats, prevention measures, and vulnerabilities,” arXiv preprint [69] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient
arXiv:2308.12833, 2023. finetuning of quantized llms,” 2023.
[55] Y. Zhang and D. Ippolito, “Prompts should not be seen as secrets: Systematically [70] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen,
measuring prompt extraction attack success,” arXiv preprint arXiv:2307.06865, “Lora: Low-rank adaptation of large language models,” 2021.
2023. [71] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
[56] I. R. McKenzie, A. Lyzhov, M. Pieler, A. Parrish, A. Mueller, A. Prabhu, E. McLean, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,
A. Kirtland, A. Ross, A. Liu, A. Gritsevskiy, D. Wurgaft, D. Kauffman, G. Recchia, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch:
J. Liu, J. Cavanagh, M. Weiss, S. Huang, T. F. Droid, T. Tseng, T. Korbak, X. Shen, An imperative style, high-performance deep learning library,” in Advances in
Y. Zhang, Z. Zhou, N. Kim, S. R. Bowman, and E. Perez, “Inverse scaling: When Neural Information Processing Systems 32, pp. 8024–8035, Curran Associates,
bigger isn’t better,” 2023. Inc., 2019.
[57] H. Li, D. Guo, W. Fan, M. Xu, and Y. Song, “Multi-step jailbreaking privacy attacks [72] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
on chatgpt,” arXiv preprint arXiv:2304.05197, 2023. R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu,
[58] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, and Y. Liu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s
“Jailbreaking chatgpt via prompt engineering: An empirical study. arxiv 2023,” transformers: State-of-the-art natural language processing,” 2020.
arXiv preprint arXiv:2305.13860. [73] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Ed-
[59] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “" do anything now": Charac- wards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov,
terizing and evaluating in-the-wild jailbreak prompts on large language models,” H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power,
arXiv preprint arXiv:2308.03825, 2023. L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert,
[60] H. Qiu, S. Zhang, A. Li, H. He, and Z. Lan, “Latent jailbreak: A benchmark for F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
evaluating text safety and output robustness of large language models,” arXiv J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike,
preprint arXiv:2307.08487, 2023. J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati,
[61] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and
fail?,” arXiv preprint arXiv:2307.02483, 2023. W. Zaremba, “Evaluating large language models trained on code,” 2021.
[62] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, [74] M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From chatgpt to threatgpt:
“Self-instruct: Aligning language models with self-generated instructions,” 2023. Impact of generative ai in cybersecurity and privacy,” IEEE Access, 2023.
[63] “ChatGPT.” https://openai.com/blog/chatgpt.
[64] H. Face, “Hugging face hub,” 2024. Received 29 April 2024
[65] M. Corporation, “Visual studio code.” https://code.visualstudio.com/, 2024.