6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
Chongzhou Fang1 , Ning Miao1 , Shaurya Srivastav1 , Jialin Liu2 , Ruoyu Zhang1 , Ruijie Fang1 ,
Asmita1 , Ryan Tsang1 , Najmeh Nazari1 , Han Wang2 , and Houman Homayoun1
• The Octane 2.0 benchmark [13], which is a JavaScript 1. Default obfuscation (DE), which replaces identifier
benchmark to test JavaScript performance. It contains names with meaningless randomly generated strings,
benchmarks to test typical functionalities of JavaScript simplifies source code to reduce readability, placing
users. strings in separate arrays, etc. [12]
• A list of practical JavaScript applications [7], including 2. Dead code injection (DCI), which inserts random unre-
password generator, etc. lated code blocks to the source code [32] in addition to
the default scheme.
For Python, we use the Python branch of the CodeSearch-
3. Control flow flattening (CFF), which transforms the
Net dataset [46] provided by Google. It contains samples of
structure of a program and hides control flow informa-
Python projects crawled from the Internet.
tion [51] in addition to the default scheme.
For C, we utilize the combination of:
4. Split string (SS), which splits long strings into shorter
• A list of classic performance benchmarks, including chunks in addition to the default scheme to alleviate
CoreMark [36], Dhrystone [82], Hint Benchmark [42], information leakage from embedded texts [86].
Linpack [35], NBench [67], Stream Benchmark [56],
TripForce [5] and Whetstone [34]. 5. Wobfuscator (WSM) [70], which performs cross-
language obfuscation to the provided code.
• A subset of the POJ-104 dataset [9, 59], which consists
of C code samples to solve 104 different programming Our chosen obfuscation methods cover classic code obfusca-
problems in an OJ system. The POJ-104 dataset pro- tion techniques (the first 4) and a more recently developed
vides multiple different solutions for each programming obfuscation tool (Wobfuscator). By testing the performance
problem. In this study, for each programming problem, of LLMs on these obfuscated code samples, we will be able
we randomly select one file from the POJ-104 dataset to to understand how these obfuscation techniques impact the
form the dataset used in this work. ability of LLMs to understand code.
Besides obfuscating our previously acquired source code,
For each code file in our dataset, we develop scripts to we also combine existing obfuscated code from online re-
automatically remove comments to eliminate their impacts sources. We integrate winner code samples from the Interna-
on analysis results. Our goal is to let LLMs focus on code, tional Obfuscated C Code Contest (IOCCC) [11], which is a
without providing unnecessary natural language hints in code contest that challenges participants to write the most obscure
comments. and confusing C code. Instead of asking LLMs to explain
The histograms regarding the line of code (LoC) distribu- the code, we consider adding an extra challenge to generate
tions of processed files are shown in Figure 1. We can see de-obfuscated code and see if the generated code can be com-
that our dataset includes code samples spanning from very piled and run. Experiments on this part of our obfuscated code
short (only a few lines) to very large-scale (over 10k lines). dataset evaluate the performance of LLMs when facing more
Besides, the code samples in our dataset are from different flexible and non-standard obfuscation techniques.
sources covering different use cases. We believe the coverage
of this dataset is sufficient, since we have chosen the most
3.5 Measurement Method
popular programming languages and selected a diverse set of
code samples of different scales from different application After collecting the response of code analysis from our target
scenarios. LLMs, we start a manual validation process to check the
40
12
35
10 35
30 10
30
8
25
8
25
File Counts
File Counts
File Counts
File Counts
20 6
6 20
15
15
4
4
10
10
2 2
5 5
0 0 0 0
101 102 103 102 103 101 103 105 101 103 105
LoC LoC LoC LoC
Figure 1: Histograms regarding LoC statistics of our non-obfuscated source code dataset.
Python C JavaScript
planation (either non-obfuscated or obfuscated), we utilize
the following metrics:
Obfuscated
De-comment code Dataset 1. Cosine similarity (ranging from 0-1), since it is widely
IOCCC used in natural language processing-related tasks and
Obfuscator
0.03
LLaMA-2-13B tries to repeat questions and generates hints to
itself. Rather than generating paragraphs to answer the ques-
0.02
tion directly, all three models tend to generate multiple short
and inconsistent statements. In many cases, LLaMA-2-13B
0.01
rephrases the input question to a semantically different one
(e.g. it rephrases “Analyze the following piece of code ” to
0.00
C JavaScript Python Overall “Please tell me what this code does and what are the vulner-
Language
abilities in this code.”). When handling long code samples,
(a) Cosine similarity score. StarChat-Beta, Code-LLaMA-2-13B-Instruct and LLaMA-2-
GPT-3.5 Code-LLaMA-2-13B-Instruct 13B tend to simply return part of the input code or return the
LLaMA-2-13B StarChat-Beta
re-written input code, without trying to provide explanations
3.0 like GPT-3.5 and GPT-4. These phenomena indicate that these
models do not have the ability to process code analysis-related
2.5
tasks. Results shown in Figure 4 are similarity results after we
manually extract code contents from the generated outputs.
Similarity Score
2.0
0.015
do not include results from LLaMA-2-13B and Code-LLaMA-
2-13B-Instruct in Figure 5. These results again indicate that
0.010 these relatively small models do not possess the ability to
perform code-analysis tasks properly.
0.005
Finding 2: Basic obfuscation techniques (such as DE
in our paper) only slightly influence the ability of GPT
0.000
DE CFF DCI SS WSM Overall models to perform code analysis.
Obfuscation Schemes
(a) Cosine similarity score. In our experiments, we found that basic obfuscation tech-
niques like replacing identifier names do not significantly
impair the ability of GPT models to produce analysis results.
GPT-3.5 Indeed, both models lose information contained in identifier
GPT-4
2.5 names, but they are still able to extract sufficient information
from the execution flow and remaining strings and provide
2.0
Similarity Score
5 Case Studies
In this section, we conduct case studies and show how the ca-
pability of LLMs can be utilized for defensive static analysis.
We first select two newly published Github repositories (one
benign and one malicious) to test the performance of GPT-
series models in malware analysis. We then select the Android
msg-stealer virus and the WannaCry ransomware [60] to
further explore the performance of LLMs for analyzing de-
compiled and obfuscated code. Both viruses have been found
in the real world. In both cases, code samples are directly
obtained from decompilers. Decompiled and obfuscated code
have a lot in common: both do not contain meaningful identi-
fier names, and the control flow may not be straightforward.
The complete responses of LLMs are contained in our online
appendix.
(b) 2011/blake- (c) Reformatted
ly/blakely.c . 2011/blakely/blakely.c . 5.1 Github Repository Analysis
Figure 8: Reformatted code. In this case study, we select two repositories on Github: (1)
KratosKnife [2], which is a set of Python scripts of a bot-
net system; (2) librarian [3], which is a Chrome extension
Finding 8: Text-level obfuscation does not influence the for bookmark search. The reasons why we choose these two
abilities of LLMs to perform de-obfuscation. code repositories are: (1) With the comparison of malware
and benign-ware, we can carefully observe the outputs and
This also aligns with our findings while conducting the analy- determine if any false alarms arise during the analysis process.
sis of the JavaScript code, indicating that employing complex (2) librarian is a new codebase (created 01/11/2024) that is
logic is the only way to trick LLMs. guaranteed to be not included in GPT-4’s training sets. There-
fore, we can examine the ability of GPT-4 to analyze code
Finding 9: GPT-4 generates code with higher readabil- without concerns for encountering any pre-learned patterns or
ity. memorization from GPT-4’s training data. In the experiment,
to generate explanations, we feed de-commented code files to
We define readability as improved code formatting and GPT-4, provide file paths in the prompts, and instruct GPT-4
more meaningful identifier names. Higher readability indi- to analyze code. Responses of the previous file is passed as an
assistant message, so that the whole analysis process consists
.apk
of continued sessions. After this, we feed the generated results
one at a time to GPT-4 and ask “Is there any potentially mali-
cious behavior? Briefly answer ‘Yes’ or ‘No’. If yes, explain De-Compiler
why.” to ask GPT-4 to perform classification of malicious and
non-malicious code.
In both cases, we observe that GPT-4 is able to correctly
Code
0
Code
1
Code
2
… Code
907
Observation 1: Code address labels are ignored and not re- 6.2 Future Work
membered.
This work sheds light on some possible future directions in
In Response 4, GPT-3.5 replies that code for two labels is this area. First of all, there is a gap in the literature about
not provided. Yet the code labels actually reside in the same obfuscated code datasets. Most works only focus on the abil-
function body and were provided to GPT-3.5 a few rounds ity to analyze normal code, without integrating obfuscation
of queries ago due to the excessively long function body. By techniques. Constructing such a large dataset would enable
looking back at the dialog record, we find that these labels LLM users to fine-tune pre-trained LLMs specifically for
are also not mentioned in the response at the corresponding code analysis tasks quickly. Second, in this paper, we report
round of query. This indicates that these labels are ignored, several interesting observations regarding the memorization
and not remembered during later processing. The incapability phenomena of LLMs [27]. The underlying mechanisms still
to capture and remember important code information will remain to be explored. Third, regarding similarity metrics,
hinder the ability of LLMs to analyze long code bodies with while the n-gram algorithm-based metric has been reported
more complicated context, considering the rather strict input to be suboptimal in capturing semantic similarities [88], dur-
limit of each query. ing our experiments, the newly employed ChatGPT-based
method [24,31,88,92] is also not reliable and requires manual
Observation 2: GPT-3.5 is conservative in providing con- work to calibrate, especially when code snippets and natural
clusive remarks. languages are combined together in the inputs. We believe
constructing a more sophisticated metric that captures the ever, they do encounter limitations in providing de-obfuscated
essence of code analysis results is non-trivial and could be code. Our study also uncovers intriguing findings such as
one research direction. LLM memorization phenomena. Our research highlights that,
at the present stage, LLMs demonstrate considerable potential
in this field. However, there are still many unexplored ways
7 Related Work to optimize their performance.
Code Analysis. Code analysis is crucial in modern software
development to identify vulnerabilities and ensure code qual- Acknowledgments
ity. The significant progress in artificial intelligence (AI) mo-
tivates researchers to adopt advanced AI models to improve We would like to thank the authors of Wobfuscator [70] for
more effective code analysis. Mou et.al [59] proposed to use sharing their obfuscation tool. We extend our gratitude to
CNNs to analyze syntax trees of programs. Feng et.al [39] all anonymous reviewers and our shepherd for the valuable
targets detecting vulnerability by collecting features from feedback they have provided.
extracted syntax trees and employing a Bi-GRU network to
identify bugs in general code. The development of natural References
language processing techniques, especially transformers and
LLMs, significantly boost this area. Chen et.al [30] employ a [1] Java decompiler. http://java-decompiler.github
transformer-based method to automatically predict variable .io/.
names and types from decompiler outputs, thus dramatically
improving the readability of decompiler-generated code. Sim- [2] Kratosknife. https://github.com/PushpenderInd
ilarly, Xu et.al [87] also proposes to use LLMs to assist in ia/KratosKnife.
handling decompilation tasks. It presents that using an iter- [3] librarian. https://github.com/oto-labs/librar
ative algorithm with multiple LLM queries can improve the ian/tree/main.
decompilation results.
LLM Evaluation. As LLMs become prevalent in recent years, [4] Not so boring android malware. https://maldroid
there are a surging number of review or analysis papers fo- .github.io/android-malware-samples//.
cusing on LLM’s capabilities. Bubeck et.al [24] analyze the
capabilities of GPT-4 from different aspects, including vision, [5] tripforce. https://github.com/microsounds/tri
coding, and mathematics-related tasks and discuss the soci- pforce, 2016.
etal impacts of such LLMs. A recent survey [20] focuses on [6] semantic-text-similarity. https://github.com/And
security aspects of LLMs and summarizes several attack and riyMulyar/semantic-text-similarity, 2019.
defence works on LLMs and points out future directions of
research in this field. Regarding analysis and evaluation of [7] js-apps. https://github.com/jaiimeriios/js-a
code-related capabilities of LLMs, Chen et.al [29] evaluate pps, 2022.
LLMs trained on code. However, their focus is primarily on
the code generation aspects. The work that is closest to ours [8] Wannacry/worm.c. https://github.com/xcp3r/W
is [88]. In the paper, they conduct an analysis of the ability of annaCry/blob/main/worm.c, 2022.
instruction-tuned and fine-tuned LLMs to perform code com- [9] Codexglue/code-code/clone-detection-poj-104/. https:
prehension and generation tasks. However, code-obfuscation- //github.com/microsoft/CodeXGLUE/blob/main
related tasks, including obfuscated code comprehension and /Code-Code/Clone-detection-POJ-104/README.
de-obfuscated code generation, are not included. md, 2023.
In this paper, we have conducted a thorough evaluation of the [11] The international obfuscated c code contest. https:
code analysis capabilities of popular Large Language Models //www.ioccc.org/, 2023.
(LLMs). Our results reveal that the larger LLMs, specifically
from the GPT series, exhibit impressive performance in code [12] Javascript obfuscator tool. https://obfuscator.io/,
analysis tasks. On the other hand, the smaller models from 2023.
LLaMA family evaluated in this paper demonstrate unsat- [13] Octane 2.0. https://chromium.github.io/octan
isfying performance in these same tasks. When it comes to e/, 2023.
analyzing obfuscated code, the GPT-series LLMs still produce
reasonably useful results for explanation-related tasks; how- [14] Retdec. https://github.com/avast/retdec, 2023.
[15] ACM. Acm policy on authorship. https://www.acm. [26] Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin
org/publications/policies/new-acm-policy-o Li. Bgnn4vd: Constructing bidirectional graph neural-
n-authorship, 2023. network for vulnerability detection. Information and
Software Technology, 136:106576, 2021.
[16] Toufique Ahmed and Premkumar Devanbu. Few-shot
training LLMs for project-specific code-summarization. [27] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
In Proceedings of the 37th IEEE/ACM International Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam
Conference on Automated Software Engineering, ASE Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al.
’22, pages 1–5, New York, NY, USA, January 2023. As- Extracting training data from large language models. In
sociation for Computing Machinery. 30th USENIX Security Symposium (USENIX Security
21), pages 2633–2650, 2021.
[17] Dogu Araci. Finbert: Financial sentiment analysis
with pre-trained language models. arXiv preprint [28] Marco Cascella, Jonathan Montomoli, Valentina Bellini,
arXiv:1908.10063, 2019. and Elena Bignami. Evaluating the feasibility of chatgpt
in healthcare: an analysis of multiple clinical and re-
[18] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, search scenarios. Journal of Medical Systems, 47(1):33,
Vageesh D. C, Arun Iyer, Suresh Parthasarathy, Sriram 2023.
Rajamani, B. Ashok, and Shashank Shet. CodePlan:
[29] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Repository-level Coding using LLMs and Planning,
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
September 2023.
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
[19] Shraddha Barke, Michael B. James, and Nadia Polikar- et al. Evaluating large language models trained on code.
pova. Grounded Copilot: How Programmers Interact arXiv preprint arXiv:2107.03374, 2021.
with Code-Generating Models. Proceedings of the [30] Qibin Chen, Jeremy Lacomis, Edward J Schwartz, Claire
ACM on Programming Languages, 7(OOPSLA1):78:85– Le Goues, Graham Neubig, and Bogdan Vasilescu.
78:111, April 2023. Augmenting decompiler output with learned variable
[20] Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Car- names and types. In 31st USENIX Security Symposium
lini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, (USENIX Security 22), pages 4327–4343, 2022.
Mihai Christodorescu, Anupam Datta, Soheil Feizi, et al. [31] Cheng-Han Chiang and Hung-yi Lee. Can large lan-
Identifying and mitigating the security risks of genera- guage models be an alternative to human evaluations?
tive ai. arXiv preprint arXiv:2308.14840, 2023. arXiv preprint arXiv:2305.01937, 2023.
[21] David Binkley. Source code analysis: A road map. Fu- [32] Mihai Christodorescu and Somesh Jha. Static analysis
ture of Software Engineering (FOSE’07), pages 104– of executables to detect malicious patterns. In 12th
119, 2007. USENIX Security Symposium (USENIX Security 03),
2003.
[22] Christian Bird, Denae Ford, Thomas Zimmermann,
Nicole Forsgren, Eirini Kalliamvakou, Travis Lowder- [33] ChronicleLive. Ransomware cyber attack recap: Nissan
milk, and Idan Gazit. Taking Flight with Copilot. Com- confirm they have been hit by hack which crippled NHS.
munications of the ACM, 66(6):56–62, May 2023. https://www.chroniclelive.co.uk/news/north
-east-news/cyber-attack-nhs-latest-news-1
[23] Peter F Brown, Vincent J Della Pietra, Peter V Des- 3029913, May 2017.
ouza, Jennifer C Lai, and Robert L Mercer. Class-based
n-gram models of natural language. Computational [34] Harold J Curnow and Brian A. Wichmann. A synthetic
linguistics, 18(4):467–480, 1992. benchmark. The Computer Journal, 19(1):43–49, 1976.
[35] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet.
[24] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan,
The linpack benchmark: past, present and future. Con-
Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
currency and Computation: practice and experience,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks
15(9):803–820, 2003.
of artificial general intelligence: Early experiments with
gpt-4. arXiv preprint arXiv:2303.12712, 2023. [36] EEMBC. Coremark. https://www.eembc.org/core
mark/, 2022.
[25] Gerardo Canfora, Massimiliano Di Penta, and Luigi
Cerulo. Achievements and challenges in software [37] Abdelrahman Eid. Reverse engineering snapchat (part
reverse engineering. Communications of the ACM, i): Obfuscation techniques. https://hot3eed.gith
54(4):142–151, 2011. ub.io/snap_part1_obfuscations.html.
[38] Ruijie Fang, Ruoyu Zhang, Elahe Hosseini, Anna M s/guidelines-and-policies/submission-and-p
Parenteau, Sally Hang, Setareh Rafatirad, Camelia E eer-review-policies/, 2023.
Hostinar, Mahdi Orooji, and Houman Homayoun. To-
wards generalized ml model in automated physiological [48] Urszula Jessen, Michal Sroka, and Dirk Fahland. Chit-
arousal computing: A transfer learning-based domain chat or deep talk: Prompt engineering for process min-
generalization approach. In 2022 IEEE International ing. arXiv preprint arXiv:2307.09909, 2023.
Conference on Bioinformatics and Biomedicine (BIBM),
pages 2577–2584. IEEE, 2022. [49] Da-Yu Kao and Shou-Ching Hsiao. The dynamic anal-
ysis of wannacry ransomware. In 2018 20th Interna-
[39] Hantao Feng, Xiaotong Fu, Hongyu Sun, He Wang, and tional conference on advanced communication technol-
Yuqing Zhang. Efficient vulnerability detection based ogy (ICACT), pages 159–166. IEEE, 2018.
on abstract syntax tree and deep learning. In IEEE
INFOCOM 2020-IEEE Conference on Computer Com- [50] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann,
munications Workshops (INFOCOM WKSHPS), pages Maria Bannert, Daryna Dementieva, Frank Fischer, Urs
722–727. IEEE, 2020. Gasser, Georg Groh, Stephan Günnemann, Eyke Hüller-
meier, et al. Chatgpt for good? on opportunities and chal-
[40] Github. The top programming languages. https://oc lenges of large language models for education. Learning
toverse.github.com/2022/top-programming-l and individual differences, 103:102274, 2023.
anguages, 2022.
[51] Tımea László and Ákos Kiss. Obfuscating c++ pro-
[41] Github. Github copilot. https://github.com/featu grams via control flow flattening. Annales Universitatis
res/copilot, 2023. Scientarum Budapestinensis de Rolando Eötvös Nomi-
natae, Sectio Computatorica, 30(1):3–19, 2009.
[42] John L Gustafson and Quinn O Snell. Hint: A new way
to measure computer performance. In Proceedings of the [52] Juho Leinonen, Paul Denny, Stephen MacNeil, Sami
Twenty-Eighth Annual Hawaii International Conference Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and
on System Sciences, volume 2, pages 392–401. IEEE, Arto Hellas. Comparing Code Explanations Created by
1995. Students and Large Language Models, April 2023.
[43] Andreas Haas, Andreas Rossberg, Derek L Schuff,
[53] Yichen Li, Yun Peng, Yintong Huo, and Michael R. Lyu.
Ben L Titzer, Michael Holman, Dan Gohman, Luke Wag-
Enhancing LLM-Based Coding Tools through Native In-
ner, Alon Zakai, and JF Bastien. Bringing the web up
tegration of IDE-Derived Static Context, February 2024.
to speed with webassembly. In Proceedings of the 38th
ACM SIGPLAN Conference on Programming Language [54] Binbin Liu, Weijie Feng, Qilong Zheng, Jing Li, and
Design and Implementation, pages 185–200, 2017. Dongpeng Xu. Software obfuscation with non-linear
[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian mixed boolean-arithmetic expressions. In Informa-
Sun. Deep residual learning for image recognition. In tion and Communications Security: 23rd International
Proceedings of the IEEE conference on computer vision Conference, ICICS 2021, Chongqing, China, November
and pattern recognition, pages 770–778, 2016. 19-21, 2021, Proceedings, Part I 23, pages 276–292.
Springer, 2021.
[45] Elahe Hosseini, Ruijie Fang, Ruoyu Zhang, Chen-Nee
Chuah, Mahdi Orooji, Soheil Rafatirad, Setareh Rafati- [55] Charles Curtsinger Benjamin Livshits, Ben Zorn, and
rad, and Houman Homayoun. Convolution neural net- Christian Seifert. Zozzle: Low-overhead mostly static
work for pain intensity assessment from facial expres- javascript malware detection. In USENIX Security Sym-
sion. In 2022 44th annual international conference of posium, 2010.
the IEEE Engineering in Medicine & Biology Society
[56] John D McCalpin et al. Memory bandwidth and ma-
(EMBC), pages 2697–2702. IEEE, 2022.
chine balance in current high performance computers.
[46] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis IEEE computer society technical committee on computer
Allamanis, and Marc Brockschmidt. Codesearchnet architecture (TCCA) newsletter, 2(19-25), 1995.
challenge: Evaluating the state of semantic code search.
arXiv preprint arXiv:1909.09436, 2019. [57] Microsoft. Wannacrypt ransomware worm targets out-
of-date systems. https://www.microsoft.com/en
[47] IEEE. Submission and peer review policies. https: -us/security/blog/2017/05/12/wannacrypt-ran
//journals.ieeeauthorcenter.ieee.org/becom somware-worm-targets-out-of-date-systems/,
e-an-ieee-journal-author/publishing-ethic May 2017.
[58] Savita Mohurle and Manisha Patil. A brief study of [70] Alan Romano, Daniel Lehmann, Michael Pradel, and
wannacry threat: Ransomware attack 2017. Interna- Weihang Wang. Wobfuscator: Obfuscating javascript
tional journal of advanced research in computer science, malware via opportunistic translation to webassembly.
8(5):1938–1940, 2017. In 2022 IEEE Symposium on Security and Privacy (SP),
pages 1574–1589. IEEE, 2022.
[59] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Con-
volutional neural networks over tree structures for pro- [71] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
gramming language processing. In Proceedings of the Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
AAAI conference on artificial intelligence, volume 30, Liu, Tal Remez, Jérémy Rapin, et al. Code llama:
2016. Open foundation models for code. arXiv preprint
[60] National Cybersecurity and Communications Integra- arXiv:2308.12950, 2023.
tion Center. What is wannacry/wanacrypt0r?, 2017.
[72] Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo.
[61] Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks. Un- Are students representatives of professionals in software
derstanding source code evolution using abstract syntax engineering experiments? In 2015 IEEE/ACM 37th
tree matching. In Proceedings of the 2005 international IEEE international conference on software engineering,
workshop on Mining software repositories, pages 1–5, volume 1, pages 666–676. IEEE, 2015.
2005.
[73] Katharine Sanderson. Gpt-4 is here: what scientists
[62] Sky News. NHS cyberattack: List of hospitals hit by think. Nature, 615(7954):773, 2023.
ransomware strike. https://news.sky.com/story
/nhs-cyberattack-full-list-of-organisatio [74] Irene Solaiman, Miles Brundage, Jack Clark, Amanda
ns-affected-so-far-10874493, May 2017. Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,
Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al.
[63] The Hawker News. Tsmc chip maker blames wannacry Release strategies and the social impacts of language
malware for production halt. https://thehackernew models. arXiv preprint arXiv:1908.09203, 2019.
s.com/2018/08/tsmc-wannacry-ransomware-att
ack.html, August 2018. [75] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
[64] OpenAI. Chatgpt. https://chat.openai.com/, Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and
2023. Tatsunori B Hashimoto. Alpaca: A strong, replicable
instruction-following model. Stanford Center for Re-
[65] OpenAI. Chatgpt can now see, hear, and speak. https: search on Foundation Models. https://crfm. stanford.
//openai.com/blog/chatgpt-can-now-see-hea edu/2023/03/13/alpaca. html, 3(6):7, 2023.
r-and-speak, 2023.
[76] Eric J Topol. High-performance medicine: the con-
[66] OpenAI. Gpt-4 technical report. 2023. vergence of human and artificial intelligence. Nature
medicine, 25(1):44–56, 2019.
[67] PetaBridge. NBench. https://nbench.io/, 2023.
[68] James Prather, Paul Denny, Juho Leinonen, Brett A. [77] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Becker, Ibrahim Albluwi, Michelle Craig, Hieke Ke- Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
uning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Reilly, Stephen MacNeil, Andrew Peterson, Raymond et al. Llama: Open and efficient foundation language
Pettit, Brent N. Reeves, and Jaromir Savelka. The models. arXiv preprint arXiv:2302.13971, 2023.
Robots are Here: Navigating the Generative AI Rev-
olution in Computing Education. In Proceedings of the [78] Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Ed-
2023 Working Group Reports on Innovation and Tech- ward Beeching, Teven Le Scao, Leandro von Werra,
nology in Computer Science Education, pages 108–159, Sheon Han, Philipp Schmid, and Alexander Rush. Cre-
December 2023. ating a coding assistant with starcoder. Hugging Face
Blog, 2023. https://huggingface.co/blog/starchat.
[69] James Prather, Paul Denny, Juho Leinonen, David H.
Smith IV, Brent N. Reeves, Stephen MacNeil, Brett A. [79] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glass-
Becker, Andrew Luxton-Reilly, Thezyrie Amarouche, man. Expectation vs. experience: Evaluating the usabil-
and Bailey Kimmel. Interactions with Prompt Prob- ity of code generation tools powered by large language
lems: A New Way to Teach Programming with Large models. In Chi conference on human factors in comput-
Language Models, January 2024. ing systems extended abstracts, pages 1–7, 2022.
[80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Proceedings of the 2023 ACM International Joint Con-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, ference on Pervasive and Ubiquitous Computing & the
and Illia Polosukhin. Attention is all you need. Advances 2023 ACM International Symposium on Wearable Com-
in neural information processing systems, 30, 2017. puting, pages 291–295, 2023.
[81] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. [91] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi-
Copiloting the Copilots: Fusing Large Language Mod- aolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang,
els with Completion Engines for Automated Program Junjie Zhang, Zican Dong, et al. A survey of large lan-
Repair. In Proceedings of the 31st ACM Joint European guage models. arXiv preprint arXiv:2303.18223, 2023.
Software Engineering Conference and Symposium on
the Foundations of Software Engineering, pages 172– [92] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
184, November 2023. Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
han Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-
[82] Reinhold P Weicker. Dhrystone: a synthetic systems judge with mt-bench and chatbot arena. arXiv preprint
programming benchmark. Communications of the ACM, arXiv:2306.05685, 2023.
27(10):1013–1030, 1984.
[93] Victor Zhong, Caiming Xiong, and Richard Socher.
[83] Chunqiu Steven Xia and Lingming Zhang. Conversa- Seq2sql: Generating structured queries from natural lan-
tional Automated Program Repair, January 2023. guage using reinforcement learning. arXiv preprint
arXiv:1709.00103, 2017.
[84] Yichen Xie and Alex Aiken. Static detection of secu-
rity vulnerabilities in scripting languages. In USENIX
Security Symposium, volume 15, pages 179–192, 2006.
[86] Wei Xu, Fangfang Zhang, and Sencun Zhu. The power
of obfuscation techniques in malicious javascript code:
A measurement study. In 2012 7th International Confer-
ence on Malicious and Unwanted Software, pages 9–16.
IEEE, 2012.