0% found this document useful (0 votes)
16 views

6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job

Uploaded by

smeetnilvarna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job

Uploaded by

smeetnilvarna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Chongzhou Fang1 , Ning Miao1 , Shaurya Srivastav1 , Jialin Liu2 , Ruoyu Zhang1 , Ruijie Fang1 ,
Asmita1 , Ryan Tsang1 , Najmeh Nazari1 , Han Wang2 , and Houman Homayoun1

1 University of California, Davis 2 Temple University

Abstract LLMs have also manifested their extraordinary potential in


Large language models (LLMs) have demonstrated signifi- tasks related to programming, including code generation [79]
cant potential in the realm of natural language understanding and code comprehension [88]. LLM-based assistant tools
and programming code processing tasks. Their capacity to like Github Copilot [41] have received tremendous attention
comprehend and generate human-like code has spurred re- and have been widely employed by programmers all around
search into harnessing LLMs for code analysis purposes. How- the world. OpenAI has recently enabled more features in
ever, the existing body of literature falls short in delivering ChatGPT [65], making LLMs more versatile and capable of
a systematic evaluation and assessment of LLMs’ effective- engaging in more real-world tasks.
ness in code analysis, particularly in the context of obfuscated Amidst all possibilities of LLMs, code analysis and obfus-
code. cation are tasks that are receiving more and more attention
This paper seeks to bridge this gap by offering a compre- from computer scientists. Code analysis [21] is an impor-
hensive evaluation of LLMs’ capabilities in performing code tant task in software engineering, including code quality as-
analysis tasks. Additionally, it presents real-world case stud- sessment [72], vulnerabilities detection [84], etc. The huge
ies that employ LLMs for code analysis. Our findings indicate size and complex architecture of modern software systems
that LLMs can indeed serve as valuable tools for automat- motivate the wide usage of automated code analysis, espe-
ing code analysis, albeit with certain limitations. Through cially in security-related scenarios such as software reverse
meticulous exploration, this research contributes to a deeper engineering [25]. In response, malicious parties have lever-
understanding of the potential and constraints associated with aged code obfuscation techniques to obscure source code
utilizing LLMs in code analysis, paving the way for enhanced and avoid being detected. In response, researchers have pro-
applications in this critical domain. posed automated code analysis tools like Abstract Syntax
Tree (AST) [61] to analyze both non-obfuscated and obfus-
cated code. However, such tools are highly dependent on the
1 Introduction
pre-defined data types and not suitable for general software
The evolution of machine learning has brought major changes code analysis. Due to the great potential of LLMs in these
to a variety of fields in the past few decades [38,44,76,80]. In areas, researchers have started using LLMs to perform related
recent years, the emergence of large language models (LLMs) tasks [24,88]. However, as pointed out by a recent survey [20],
has revolutionized the field of natural language processing there are few systematic analytical studies on this topic.
and brought great changes to many other areas. Ever since In this paper, we focus on evaluating the ability of LLMs to
the access of ChatGPT [64] becomes publicly available in analyze input code samples, and test if LLMs can be utilized
late 2022, there is a rocketing number of users as well as for defensive analysis tasks. We first construct a dataset con-
publicly-available services utilizing LLM APIs provided by taining code from real-world programs (on Github, etc.) and
OpenAI. These models have demonstrated their abilities to their obfuscated versions to feed to LLMs, then we analyze
understand and process inputs including but not limited to the outcomes and present our obtained key findings. We aim
natural languages, and have been involved in different types to answer two critical research questions through a systematic
of tasks. A typical use case of LLMs is text generation, since analysis:
it has the ability to produce high-quality natural language out- RQ1: Do LLMs understand source code?
puts [91]. This has caused a heated debate in academia and
eventually resulted in major publishers announcing new regu- RQ2: Can LLMs comprehend obfuscated code or code with
lations regarding the usage of generative AI [15, 47]. Besides, low readability?
After that, as real-world case studies, we showcase if publicly- tasks, including text summarization, translation, sentiment
available LLMs (e.g. ChatGPT) can be utilized for security- analysis, question-answering, etc. Besides natural language
related tasks targeting real-world malicious software. understanding tasks, researchers also find that LLM models
The contributions of this paper are listed as follows: have significant performance across various domains, from
security and code generation [29] to healthcare [28, 45, 90]
1. We construct two datasets consisting of code samples and education [50]. These LLM-driven applications have the
from popular programming languages and obfuscated potential to automate analysis tasks and provide instant in-
samples from these languages for our evaluation; formation and assistance [17]. Moreover, LLMs are being
shown to have direct utility for the purposes of code expla-
2. We systematically evaluate the performance of publicly
nation and summarization, which have immediate applica-
available state-of-the-art LLMs, including the most pop-
tions in education and industry. Leinonen et al. [52] have
ularly used GPT and LLaMA families on our dataset,
explored the use of LLMs in generating code explanations for
and present our findings;
the purposes of computer science (CS) education, finding that
3. We conduct case studies using LLMs to analyze real- LLM-generated explanations are viewed as more accurate and
world malicious software to showcase the capability and comprehensible than student-written explanations. Ahmed et
limitations of LLMs in these tasks. al. [16] leverage project-specific few-shot training to improve
code summarization, demonstrating improvements across a
To the best of our knowledge, this is the first paper that pro- variety of languages and projects. This approach significantly
vides a systematic evaluation of the ability of LLMs to com- reduces the time programmers need to familiarize themselves
prehend code and code with obfuscation. with a codebase.
The remainder of this paper is organized as follows. We
first introduce related background knowledge in Section 2.
Then we elaborate our experiment settings of our analysis 2.2 Code Analysis
in Section 3. Essential results and findings are presented in Code analysis is a process in software engineering to exam-
Section 4. In Section 5, we perform case studies and show ine source code, byte code, or binary code to ensure quality,
how LLMs can be utilized to assist code analysis in practical reliability, and security. Code analysis has become a crucial
cases. Then we briefly discuss future works in Section 6. practice to assist developers in identifying and addressing
Related works and a conclusion are provided in Section 7 and problems early in the software development life cycle. The
Section 8, respectively. Our online appendix is available at: increasingly large scale of modern software challenges mo-
https://github.com/aseec-lab/llms-for-code-ana tivates researchers to propose automated tools. Prior works
lysis. have proposed to extract structural relationships, including
inheritance, association, friend relationships, interface hier-
2 Background archies, attributes, data types, etc., from source code [61] to
build Abstract Syntax Tree (AST) for pattern-matching [89]
In this part, we will provide an overview of the technical to detect vulnerabilities [26] and malicious activities [55].
background of large language models, code analysis, and code However, the approach is highly dependent on the pre-defined
obfuscation. data types and can not be used for general code analysis. In
response, some researchers propose to extract features from
source code and leverage machine learning to build a vulnera-
2.1 Large Language Models bility detection model [93].
Large language models (LLMs) are a groundbreaking inno-
vation in the field of artificial intelligence, representing a 2.3 Code Generation
significant milestone in natural language understanding and
generation. Most LLMs are built on deep learning architec- Code generation has been one of the most popular applica-
tures, particularly transformer architectures and self-attention tions of LLMs since the introduction of GPT-3 OpenAI’s in-
mechanisms [80], which are trained on massive datasets con- troduction of the Codex model [29]. Subsequent advances in
taining a diverse range of text from the internet, books, and this area have had major implications on how programming
Github. This extensive training enables them to grasp the intri- is now taught, practiced, and evaluated. Prather et al. [68]
cacies of languages, including grammar, context, and seman- noted in their meta-analysis of LLM use in CS education that
tics. Both companies and researchers have launched LLMs- LLMs perform similar if not better than the average student
driven products and pre-trained models, such as OpenAI GPT on code writing tasks, and have since sought to incorporate
series [66, 74], Meta LLaMA [77], Standford Alpaca [75], such tools into CS curriculum [69] in anticipation of more
etc. One of the critical features of the LLMs is their ability widespread adoption. A number of studies have also been
to perform a wide range of natural language understanding conducted on practical LLM-based code completion tools
like Github’s Copilot [41], characterizing common interac- 3 Experiment Settings
tion models [19] and experiences [22]. Such tools continue to
be enhanced as well, with improvements made in code com- We first conduct a systematic analysis of the ability of LLMs
pletion for repository-level projects [18, 53] and automated to perform code analysis-related tasks. In this section, we
program repair [81, 83], expanding the scope of LLM-based introduce the models and datasets used in our experiments.
generation tools. Notably, Bird et al. [22] have observed in
their investigation that while Copilot can improve productiv-
ity and creativity for users, the tool can also be somewhat 3.1 LLM Selection
detrimental to security and programmer understanding, which
is part of the motivation for our study. In our analysis, we select five representative popular LLM
models to deploy:

2.4 Code Obfuscation • GPT-3.5-turbo: GPT-3.5 is a set of LLMs offered by


Code obfuscation refers to the process of leveraging trans- OpenAI, and it is the default model that is used for the
formations to make functionally equivalent programs that popular LLM web tool ChatGPT [64].
are difficult to understand, aiming to protect the intellectual
property of developed software or hide malicious behaviors. • GPT-4: GPT-4 is improved based on GPT-3.5 and it has
Such transformations include encoding data, opaque predi- been reported to be able to handle different types of
cates [85], flattening control flow [51], etc. For example, [54] tasks [24]. It is also one of the most advanced general-
proposed a Mixed Boolean-Arithmetic (MBA) expression to purpose LLMs currently.
mix the bitwise operations (e.g., AND, OR, and NOT) and
arithmetic operations (e.g., ADD and IMUL), thereby creat- • LLaMA-2-13B [77]: LLaMA is a set of foundation
ing more difficulties for attackers to analyze programs. In LLMs offered by Meta, with model parameter sizes rang-
modern software development, obfuscation has been used to ing from 7B to 70B. LLaMA-2-13B contains 13B param-
protect critical code parts against reverse engineering [37]. eters and is reported to output a lot of larger models. We
However, the significant progress of LLMs challenges ex- select LLaMA-2-13B as a representative medium-size
isting code obfuscation-based protection approaches, ques- LLM.
tioning whether existing obfuscation can still be effective
against LLMs-based de-obfuscation and preserve sensitive • Code-LLaMA-2-13B-Instruct [71]: Code LLaMA is a
information in developed software. Hence, it becomes urgent family of LLMs provided by Meta, using the previously
to evaluate the code analysis results of LLMs when code is ob- mentioned LLaMA family as foundation models. Code-
fuscated. In this work, we leverage an open-source JavaScript LLaMAs are fine-tuned on code data, and it has been
obfuscation tool [12] as well as a state-of-the-art obfusca- reported to outperform other public models targeting
tor [70] to generate obfuscated code and investigate whether code-related tasks.
LLMs are able to understand their functionality. As presented
in Listing 1 and Listing 2, a simple "Hello World" function • StarChat-Beta [78]: StarChat-Beta is an open-source
written in JavaScript can be changed into an unintelligible model trained for assisting coding tasks. It has 16B pa-
one. rameters and is capable of handling multiple program-
ming languages.
Listing 1: No obfuscation.
1
2
function hi () {
c o n s o l e . l o g ( " H e l l o World ! " ) ;
We select these models for the reason that they are widely
3
4
}
hi () ;
used and are the current state-of-the-art publicly-available
LLMs with the capability to perform code tasks.
Listing 2: After obfuscation by [12].
1 f u n c t i o n _ 0 x 1 e c 3 ( ) { v a r _0x3ed452 = [ ’ 259790 KgLPlj ’ , ’ 297688 NTFutg ’ , ’ 35ACWDkX’ , ’
145716 kEyGyf ’ , ’ 18SFCPKB ’ , ’ 1701952 aKOEga ’ , ’ 192 jjwxUU ’ , ’ 5 LPjNwr ’ , ’ 142417 3.2 Prompt Construction
rtWDUq ’ , ’ H e l l o \ x20World ! ’ , ’ 121610lhBPGW ’ , ’ 2032200 UghFpX ’ , ’ 5nCOmEq ’ , ’ l o g ’
] ; _ 0 x 1 e c 3 = f u n c t i o n ( ) { r e t u r n _0x3ed452 ; } ; r e t u r n _ 0 x 1 e c 3 ( ) ; } ( f u n c t i o n (
_0x22b342 , _ 0 x 3 6 0 f f b ) { v a r _0x5047be = _ 0 x f b 3 c , _ 0 x 4 c 7 c 5 c = _0x22b342 ( ) ; w h i l e
( ! ! [ ] ) { t r y { v a r _ 0 x 4 0 c 3 b e = p a r s e I n t ( _0x5047be ( 0 x90 ) ) / 0 x1 * ( − p a r s e I n t (
We interact with the selected LLMs in different steps in our
_0x5047be ( 0 x8e ) ) / 0 x2 ) +− p a r s e I n t ( _0x5047be ( 0 x95 ) ) / 0 x3+ p a r s e I n t ( _0x5047be ( 0
x97 ) ) / 0 x4 * ( p a r s e I n t ( _0x5047be ( 0 x99 ) ) / 0 x5 ) + p a r s e I n t ( _0x5047be ( 0 x 8 f ) ) / 0 x6+
experiments, including instructing the LLMs to analyze code
p a r s e I n t ( _0x5047be ( 0 x94 ) ) / 0 x7 * ( p a r s e I n t ( _0x5047be ( 0 x93 ) ) / 0 x8 ) + p a r s e I n t (
_0x5047be ( 0 x96 ) ) / 0 x9 * ( − p a r s e I n t ( _0x5047be ( 0 x92 ) ) / 0 xa ) + p a r s e I n t ( _0x5047be
and other measurement process (discussed later in this sec-
( 0 x9a ) ) / 0 xb * ( − p a r s e I n t ( _0x5047be ( 0 x98 ) ) / 0 xc ) ; i f ( _ 0 x 4 0 c 3 b e === _ 0 x 3 6 0 f f b )
break ; e l s e _ 0 x 4 c 7 c 5 c [ ’ p u s h ’ ] ( _ 0 x 4 c 7 c 5 c [ ’ s h i f t ’ ] ( ) ) ; } c a t c h ( _ 0 x 3 3 f 4 b 4 ) {
tion). All the prompts constructed in this paper either involve
_ 0 x 4 c 7 c 5 c [ ’ p u s h ’ ] ( _ 0 x 4 c 7 c 5 c [ ’ s h i f t ’ ] ( ) ) ; } } } ( _0x1ec3 , 0 x52a68 ) ) ; f u n c t i o n
_ 0 x f b 3 c ( _0x257a0b , _0x17c420 ) { v a r _ 0 x 1 e c 3 2 1 = _ 0 x 1 e c 3 ( ) ; r e t u r n _ 0 x f b 3 c =
simple instructions (“Analyze the code and tell me what it
f u n c t i o n ( _ 0 x f b 3 c a 7 , _0x44b6b2 ) { _ 0 x f b 3 c a 7 = _ 0 x f b 3 c a 7 −0 x8d ; v a r _ 0 x 3 4 c a 8 b =
_ 0 x 1 e c 3 2 1 [ _ 0 x f b 3 c a 7 ] ; r e t u r n _ 0 x 3 4 c a 8 b ; } , _ 0 x f b 3 c ( _0x257a0b , _0x17c420 ) ; }
does.”, etc.) or adhere to empirically proven structures, such
f u n c t i o n h i ( ) { v a r _0x2da467 = _ 0 x f b 3 c ; c o n s o l e [ _0x2da467 ( 0 x91 ) ] ( _0x2da467 ( 0
x8d ) ) ; } h i ( ) ;
as assigning roles [48, 88]. Specific prompts we utilize are
provided in Section 3 and Section 5.
3.3 Non-Obfuscated Code Dataset 3.4 Obfuscated Code Dataset
In this study, we aim to systematically evaluate the ability of We choose to perform obfuscation on the JavaScript branch of
LLMs to analyze, comprehend, and summarize code. We first our non-obfuscated code dataset. This choice was driven by
construct a non-obfuscated source code dataset. Three lan- the prevalence of code obfuscation practices in the JavaScript
guages are selected in this phase of study: JavaScript, Python, language, since JavaScript code is usually visible to web users,
and C. We select JavaScript and Python since they are ranked making additional obfuscation protection necessary. Besides,
the top 2 most used languages on Github [40] and we use malicious JavaScript developers also apply obfuscation tech-
them as the representatives of scripting languages. We select niques to their code to hide the actual intent of their scripts.
C as it is also ranked high (#9) and it can be the representative We use: (1) an open-source tool called JavaScript Obfusca-
of lower-level programming languages. All three languages tor [12] to generate the obfuscation version of our JavaScript
we select are widely deployed over the Internet and in various code; (2) Wobfuscator [70], a state-of-the-art obfuscator that
computing systems. transforms part of the code to WebAssembly [43]. The tested
For JavaScript, we employ the combination of: obfuscation methods are listed below:

• The Octane 2.0 benchmark [13], which is a JavaScript 1. Default obfuscation (DE), which replaces identifier
benchmark to test JavaScript performance. It contains names with meaningless randomly generated strings,
benchmarks to test typical functionalities of JavaScript simplifies source code to reduce readability, placing
users. strings in separate arrays, etc. [12]

• A list of practical JavaScript applications [7], including 2. Dead code injection (DCI), which inserts random unre-
password generator, etc. lated code blocks to the source code [32] in addition to
the default scheme.
For Python, we use the Python branch of the CodeSearch-
3. Control flow flattening (CFF), which transforms the
Net dataset [46] provided by Google. It contains samples of
structure of a program and hides control flow informa-
Python projects crawled from the Internet.
tion [51] in addition to the default scheme.
For C, we utilize the combination of:
4. Split string (SS), which splits long strings into shorter
• A list of classic performance benchmarks, including chunks in addition to the default scheme to alleviate
CoreMark [36], Dhrystone [82], Hint Benchmark [42], information leakage from embedded texts [86].
Linpack [35], NBench [67], Stream Benchmark [56],
TripForce [5] and Whetstone [34]. 5. Wobfuscator (WSM) [70], which performs cross-
language obfuscation to the provided code.
• A subset of the POJ-104 dataset [9, 59], which consists
of C code samples to solve 104 different programming Our chosen obfuscation methods cover classic code obfusca-
problems in an OJ system. The POJ-104 dataset pro- tion techniques (the first 4) and a more recently developed
vides multiple different solutions for each programming obfuscation tool (Wobfuscator). By testing the performance
problem. In this study, for each programming problem, of LLMs on these obfuscated code samples, we will be able
we randomly select one file from the POJ-104 dataset to to understand how these obfuscation techniques impact the
form the dataset used in this work. ability of LLMs to understand code.
Besides obfuscating our previously acquired source code,
For each code file in our dataset, we develop scripts to we also combine existing obfuscated code from online re-
automatically remove comments to eliminate their impacts sources. We integrate winner code samples from the Interna-
on analysis results. Our goal is to let LLMs focus on code, tional Obfuscated C Code Contest (IOCCC) [11], which is a
without providing unnecessary natural language hints in code contest that challenges participants to write the most obscure
comments. and confusing C code. Instead of asking LLMs to explain
The histograms regarding the line of code (LoC) distribu- the code, we consider adding an extra challenge to generate
tions of processed files are shown in Figure 1. We can see de-obfuscated code and see if the generated code can be com-
that our dataset includes code samples spanning from very piled and run. Experiments on this part of our obfuscated code
short (only a few lines) to very large-scale (over 10k lines). dataset evaluate the performance of LLMs when facing more
Besides, the code samples in our dataset are from different flexible and non-standard obfuscation techniques.
sources covering different use cases. We believe the coverage
of this dataset is sufficient, since we have chosen the most
3.5 Measurement Method
popular programming languages and selected a diverse set of
code samples of different scales from different application After collecting the response of code analysis from our target
scenarios. LLMs, we start a manual validation process to check the
40
12
35
10 35
30 10
30
8
25
8
25
File Counts

File Counts

File Counts

File Counts
20 6
6 20

15
15
4
4
10
10
2 2
5 5

0 0 0 0
101 102 103 102 103 101 103 105 101 103 105
LoC LoC LoC LoC

(a) C. (b) Python. (c) JavaScript. (d) Overall.

Figure 1: Histograms regarding LoC statistics of our non-obfuscated source code dataset.

Python C JavaScript
planation (either non-obfuscated or obfuscated), we utilize
the following metrics:
Obfuscated
De-comment code Dataset 1. Cosine similarity (ranging from 0-1), since it is widely
IOCCC used in natural language processing-related tasks and
Obfuscator

can serve as a coarse grain metric in this study;


Python C JavaScript
JavaScript
Source code 2. A Bert-based semantic similarity score [6] (ranging from
Dataset 0 to 5) that is more advanced in comparing the similarity
RQ1 RQ2 of natural language inputs;

3. ChatGPT-based evaluation [24, 31, 88, 92] (True or


Figure 2: Diagram of our experiment pipeline. False). In our evaluation, GPT-4 model will receive the
following instructions first:

Instruction: “You are given two descriptions of two


correctness of the analysis results. Since all the source code
code snippets: Description 1 and Description 2.
files we employ only come with a maximum of a few tens of
Corresponding code snippets are not available.
lines of concise comments and the description labels attached
From the given text, do you think the two descrip-
(if there are any) are also succinct, we can not rely on directly
tions correspond to two code snippets with roughly
applying certain quantitative metrics (e.g., n-gram [23]) to
similar functionalities? Output should be "Yes" if
determine the correctness of generated analysis results, which
similar, or "No" otherwise, followed by a brief jus-
necessitates a manual validation process.
tification of how this is determined.”
Ground Truth. We choose to first manually examine the
outcomes of GPT-4 [73], since it has better natural language We then use the generated output to examine the correct-
generation and deduction capability. During this manual vali- ness of each model.
dation process, four graduate/PhD-level students majoring in
computer science or electrical engineering, each with over 5
years of programming experience, read the code and assess 4 Results
GPT-4’s generated explanation for accuracy and potential er-
rors. The generated explanation for each code file is labeled In this part, we present our results on our non-obfuscated code
‘correct’ if the functionality of the code and the description dataset and obfuscated code dataset, mainly to answer the two
match. Cross-checking is conducted among the students to research questions raised in Section 1:
minimize biases. After this step, we consider those descrip-
• RQ1: Do LLMs understand source code? (Section 4.1)
tions marked ‘correct’ as the ground truth and use those de-
scriptions for further comparison among different LLMs. • RQ2: Can LLMs comprehend obfuscated code? (Sec-
Comparison Metrics. When evaluating the generated ex- tion 4.2)
LLMs are prompted This indicates that the training data of these models con-
tain code samples from popular open-source repositories, and
“Analyze the code and tell me what it does.” LLMs are able to connect the provided code snippets with
their memorized information. This might be helpful when
to perform the analysis tasks. We also present findings in our
analyzing code, since being able to connect source code with
experiments.
existing software implementation can accelerate the process
of understanding the intent of target code snippets. Please
4.1 Results on Non-Obfuscated Code Dataset note that this finding does not suggest that we are conduct-
ing our experiments in a “test on training set” manner. In
GPT-4 Results. The manually verified accuracy results of
our experiments, we focus on examining the generated de-
GPT-4 are shown in Figure 3. We can see from Figure 3 that
tailed explanation of each function/code sample, which is
the accuracy performance of GPT-4 on the three languages (C,
more likely deducted from the source code and unlikely to
JavaScript, and Python) is high. For all of the three languages,
be contained in the training set. Our experiments on newer
over 95% of the analysis generated by GPT-4 aligns with
repositories in Section 5 also indicate that LLMs do not need
the actual content of the code samples. The overall accuracy
to use memorized information to assist code analysis.
rate is 97.4%, indicating that GPT-4 can serve as a powerful
code analysis tool. After meticulously analyzing the generated Finding 2: GPT-4 occasionally makes wrong associa-
tions.
In our experiments, though GPT-4 has a high analysis suc-
Overall cess rate, there are still a few code samples that GPT-4 cannot
correctly analyze. When using GPT-4 for analyzing a Python
file related to stock market utilities, GPT-4 incorrectly con-
Python
cludes that the code uses the pandas package, even though it
Languages

is not imported in the code sample. We believe this is caused


JavaScript
by matplotlib and numpy packages being imported. These
packages usually appear together in data analytics scripts that
are likely part of the training set. GPT-4 possibly makes an
C incorrect association based on its memorized contents.
0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
Accuracy Finding 3: GPT utilizes information provided in identi-
fier names to assist code analysis.
Figure 3: GPT-4 accuracy.
It is intriguing to find that as a language model, GPT-4
outputs of GPT-4, we also have the following findings. utilizes natural language hints embedded in identifier names
to assist code analysis like humans. For example, during the
Finding 1: GPT-4 is able to recognize code snippets analysis for the CoreMark benchmark, GPT-4 recognizes iden-
from popular open-source software repositories. tifiers named “ee_u8”, “ee_u16”, etc. are custom data types
and concludes that the underlying mapping may vary across
It has been reported that LLMs can leak memorized in- different embedded systems. When analyzing the JavaScript
formation acquired from training during conversations [27]. Octane benchmark, GPT-4 successfully recognizes that the
In our experiments, we observe this phenomenon multiple code samples are used for benchmarking/testing simply from
times. For example, our dataset contains source code files the appearance of a variable name “Benchmark”. This sug-
from Heron project [10], a real-time analytics system de- gests that the LLMs will be able to generate higher-quality
veloped by Twitter. Even though the source code has been analysis and provide more information on clearly written
de-commented and does not contain the keyword “Twitter”, code.
GPT-4 still successfully connects the source code with Twit- Analysis Accuracy of Other Selected Models. In Fig-
ter by stating: “The code is written in Python and is part of ure 4, we present the cosine similarity score [Figure 4 (a)],
the Heron project, a real-time stream processing framework Bert-based semantic similarity [Figure 4 (b)] and accuracy
developed by Twitter.” at the beginning of its response. In of code explanation measured by GPT-4 [Figure 4 (c)]. We
another example, when we feed GPT-4 with a local JQuery can see that GPT-3.5 achieves similar performance as GPT-4,
copy from the Octane benchmark set, GPT-4 also success- indicating that both models can achieve high performance in
fully recognizes that the provided code is from the JQuery code analysis tasks. However, for the LLaMA-series models
library:“The provided code is a part of the jQuery library, ...” and StarChat-Beta, the accuracy performance is significantly
at the beginning of its response. The same phenomenon is also lower. More findings obtained by manually examining the
observed in the responses from other models, e.g. GPT-3.5. generated results are provided below.
GPT-3.5 Code-LLaMA-2-13B-Instruct
LLaMA-2-13B StarChat-Beta
Beta) are unable to generate consistent paragraphs of
code analysis results, unlike GPT-3.5 and GPT-4.
0.04
While performing the designated code analysis tasks, in-
stead of generating consistent paragraphs, we observe that
Similarity Score

0.03
LLaMA-2-13B tries to repeat questions and generates hints to
itself. Rather than generating paragraphs to answer the ques-
0.02
tion directly, all three models tend to generate multiple short
and inconsistent statements. In many cases, LLaMA-2-13B
0.01
rephrases the input question to a semantically different one
(e.g. it rephrases “Analyze the following piece of code ” to
0.00
C JavaScript Python Overall “Please tell me what this code does and what are the vulner-
Language
abilities in this code.”). When handling long code samples,
(a) Cosine similarity score. StarChat-Beta, Code-LLaMA-2-13B-Instruct and LLaMA-2-
GPT-3.5 Code-LLaMA-2-13B-Instruct 13B tend to simply return part of the input code or return the
LLaMA-2-13B StarChat-Beta
re-written input code, without trying to provide explanations
3.0 like GPT-3.5 and GPT-4. These phenomena indicate that these
models do not have the ability to process code analysis-related
2.5
tasks. Results shown in Figure 4 are similarity results after we
manually extract code contents from the generated outputs.
Similarity Score

2.0

Even after we remove the code contents, the performance of


1.5
these models is still significantly lower than GPT-3.5.
1.0

Answer to RQ1: For non-obfuscated code, larger mod-


0.5
els like GPT-3.5 or GPT-4 have a high probability of
0.0
generating correct and detailed explanations for input
C JavaScript Python Overall
Language code snippets, while the smaller models, even fine-tuned
on code data, fail to generate correct outputs.
(b) Bert-based semantic similarity scores, with the evaluator trained
on web data.
GPT-3.5 Code-LLaMA-2-13B-Instruct
LLaMA-2-13B StarChat-Beta 4.2 Results on Obfuscated Code Dataset
1.0

Obfuscated Code Analysis Capability Evaluation. We first


0.8 focus on analyzing the obfuscated JavaScript code samples.
We use the same prompt to instruct LLMs to analyze code
0.6 and generate explanations as in the previous part. Although
Accuracy

during the obfuscation process, function blocks are rewritten


0.4 in different ways, the functionalities will not change and we
observe that the outputs of LLMs are of very similar formats.
0.2 Therefore, we will keep using the same set of metrics to eval-
uate the effectiveness of code explanation. Results regarding
0.0
C JavaScript Python Overall similarity scores and accuracy of obfuscated code analysis
Language
are shown in Figure 5.
(c) ChatGPT measured accuracy results. From Figure 5 (c), we can see that GPT-4 still shows ex-
ceptional analysis accuracy, reaching 87% accuracy. GPT-3.5,
Figure 4: Similarity score/accuracy of analysis results gen- on the other hand, is impacted by obfuscation techniques, es-
erated by different models. For LLaMA series, results are pecially when more advanced techniques are applied (control
obtained after we manually extract the natural language con- flow flattening, dead code injection, and split string). As for
tents from the generated outputs. similarity score metrics shown in Figure 5 (a) and (b), we can
see that both models suffer performance degradation, yet the
similarity scores of GPT-4 are constantly higher than GPT-
Finding 4: Smaller models in our experiments (LLaMA- 3.5. These results indicate that when facing obfuscated code
2-13B, Code-LLaMA-2-13B-Instruct, and StarChat- analysis, GPT-4 is a better choice. There are other findings as
presented below:
Finding 1: LLaMA-2-13B, Code-LLaMA-2-13B-
Instruct and StarChat-Beta are unable to generate mean-
ingful explanations once obfuscation techniques are ap-
plied.
GPT-3.5
GPT-4
In our experiments, we find that all three models are unable
0.020
to generate meaningful results. In most cases, they simply
return part of the obfuscated code that is sent. Therefore, we
Similarity Score

0.015
do not include results from LLaMA-2-13B and Code-LLaMA-
2-13B-Instruct in Figure 5. These results again indicate that
0.010 these relatively small models do not possess the ability to
perform code-analysis tasks properly.
0.005
Finding 2: Basic obfuscation techniques (such as DE
in our paper) only slightly influence the ability of GPT
0.000
DE CFF DCI SS WSM Overall models to perform code analysis.
Obfuscation Schemes

(a) Cosine similarity score. In our experiments, we found that basic obfuscation tech-
niques like replacing identifier names do not significantly
impair the ability of GPT models to produce analysis results.
GPT-3.5 Indeed, both models lose information contained in identifier
GPT-4
2.5 names, but they are still able to extract sufficient information
from the execution flow and remaining strings and provide
2.0
Similarity Score

relatively reliable explanations.


1.5 Finding 3: LLMs are not able to decipher obfuscated
code generated by Wobfuscator.
1.0

From Figure 5, we can see that the application of Wob-


0.5
fuscator significantly reduces the accuracy of generated code
0.0 explanations. By analyzing the generated outputs, we observe
DE CFF DCI SS WSM Overall
Obfuscation Schemes that the insertion of WebAssembly code severely hinders the
understanding of source code, especially as reflected by the
(b) Bert-based semantic similarity scores, with the evaluator trained Bert-based similarity score [Figure 5 (b)] and ChatGPT-based
on web data.
score [Figure 5 (c)]. The two GPT models do not decipher
the inserted WebAssembly code and hence fail to properly
1.0
GPT-3.5
understand the functionality of the provided code.
GPT-4
0.8 Finding 4: The ability of GPT models to decipher longer
and more complicated obfuscated code is limited.
0.6
Accuracy

We observe that GPT models perform worse when facing


0.4
longer and more complicated code, which is expected. In our
experiments, most code explanations that are classified as
wrong are generated from longer code samples. Both GPT-3.5
0.2
and GPT-4 are able to maintain high accuracy when process-
ing shorter and simpler website javascript applications. How-
0.0
DE CFF DCI SS WSM Overall ever, when facing larger code samples with more complicated
Obfuscation Schemes
functionalities, the generated responses are more error-prone
(c) ChatGPT measured accuracy results. and sentences like “More information is needed” appear more
frequently, especially for GPT-3.5.
Figure 5: Similarity score/accuracy of obfuscated code analy- De-Obfuscated Code Generation. Besides comparing
sis results generated by different models. code explanation results on the obfuscated JavaScript dataset,
we also test the ability of LLMs to generate de-obfuscated
code. Since it is a challenging task, we only perform this eval-
uation on the two most powerful models we have, i.e. GPT-3.5
and GPT-4. We select 100 contest winner projects after the statistics indicate that despite the strong ability to produce ex-
year 2011 competition of IOCCC [11] and instruct LLMs to planatory analysis results, their performance on de-obfuscated
perform code de-obfuscation tasks. LLMs are prompted: code generation is unsatisfying.
It is interesting to note that GPT-3.5 beats GPT-4 on these
“You are an expert in code analysis. De-obfuscate the code metrics in the code de-obfuscation task, despite the fact that
and generate a readable new version.” it is a less advanced model.
and we directly feed the original code to LLMs. We select Finding 6: GPT-4 has a higher probability of associating
IOCCC because a more diverse and flexible set of obfusca- given code samples to the IOCCC contest. However,
tion techniques are applied, compared to our automatically details are incorrect.
generated JavaScript dataset.
By examining responses generated by both models, we find
We evaluate the ability of LLM to generate de-obfuscated
that GPT-4 recognizes that 22 of the provided code samples
code from the following 3 aspects:
belong to the IOCCC contest, while GPT-3.5 is only able to
1. Whether code can be generated; identify 2 of them. This indicates that despite IOCCC code
samples being included in the training set of both models,
2. (If generated) whether the generated code can pass com- GPT-4 is able to better associate the given input to its knowl-
pilation; edge. However, when it tries to identify specific code samples,
i.e., provide exact year and author names, the answers are
3. (If compilable) whether the compiled code produces cor- wrong. For example, during our analysis, it identified a code
rect outputs. sample that was awarded “Best Handwriting in 2015” as “Best
One-Liner in 2015”. We then asked “What is the Best One-
Corresponding statistical results are presented in Figure 6. Liner award receiver of IOCCC 2015?”, and got a different
Our findings are presented below. but still incorrect answer that seems to be made up by GPT-4.
This indicates that the given incorrect information is possibly
not caused by misalignment in the training data.
1.0 GPT-3.5
GPT-4 Finding 7: GPT-4 has a higher probability of refusing
0.8 to perform code-generation tasks.
Success Rate

0.6 As shown in Figure 6, GPT-4’s performance is significantly


worse than GPT-3.5 in this task, especially at the code gen-
0.4
eration step. We observe that GPT-4 will admit that the code
is obfuscated and requires lots of expert work to decipher,
0.2
instead of generating code without additional remarks like
0.0
GPT-3.5. By examining code samples that GPT-4 refuses to
Generate Compile Run
generate, we find that there are two factors that can potentially
cause the failure of de-obfuscation:
Figure 6: Success rate of different models regarding de-
1. Code contains complicated logic.
obfuscation tasks.
2. Code is formatted in a special style, unlike the traditional
line-by-line organization.
Finding 5: Both GPT models in our evaluation fall short
of generating compilable and runnable de-obfuscated Examples of specially formatted code snippets are shown
code. in Figure 7. To rule out this factor, we select the
POJ-104/9/1618.c sample code from the C branch of our
As shown in Figure 6, for GPT-3.5, although it successfully non-obfuscated dataset and apply variable substitution and
produces code outputs for all targets, only around 20% of its reformatting, resulting in a code snippet that is extremely hard
output code samples are compilable, and only 8% of generated to decipher with manual effort [Figure 8 (a)]. We also select a
code samples (38% of compiled code samples) can produce file from the IOCCC dataset (blakely.c from IOCCC 2011)
correct results. The performance of GPT-4 is worse than GPT- and reformat the file to a normal format [Figure 8 (b) and (c)].
3.5. It is able to generate de-obfuscated code for only 76% of In the first scenario, despite the obfuscation techniques
experimented code samples. Only 19% of all experimented applied, GPT-4 is still able to generate correct explanations
code samples (25% of generated code) successfully pass the as well as de-obfuscation results. In the second case, GPT-4
compilation step, and only 4% of all experimented code sam- continues to refuse to generate de-obfuscated code. This leads
ples (21% of compiled code) produce correct results. These to our next finding:
cates superior code generation capability. Despite worse per-
formance on code generation success rate, we find that code
generated by GPT-4 has higher readability compared to code
generated by GPT-3.5. GPT-4 is able to generate meaning-
ful identifier names more often, while part of the GPT-3.5-
generated code still seems obfuscated. This indicates that
GPT-4 is still a better generative model if we take the quality
(a) 2013/ca- (b) (c) 2019/giles/prog.c .
ble2/prog.c . 2013/misaka/misaka.c
of the generated code into consideration.
.
Answer to RQ2: Obfuscation techniques can impact the
Figure 7: Samples of specially-formatted code. ability of LLMs to generate explanations. Smaller mod-
els in our experiments are unable to handle obfuscated
code. GPT-3.5 and GPT-4 both drop in analysis accuracy,
especially when facing Wobfuscator [70], though GPT-4
still has an acceptable and better accuracy performance
on classic obfuscation methods. Without special opti-
mization targeting de-obfuscated code generation, LLMs
show a poor ability to generate functional de-obfuscated
(a) Reformatted code sample from POJ-104 code.
dataset.

5 Case Studies
In this section, we conduct case studies and show how the ca-
pability of LLMs can be utilized for defensive static analysis.
We first select two newly published Github repositories (one
benign and one malicious) to test the performance of GPT-
series models in malware analysis. We then select the Android
msg-stealer virus and the WannaCry ransomware [60] to
further explore the performance of LLMs for analyzing de-
compiled and obfuscated code. Both viruses have been found
in the real world. In both cases, code samples are directly
obtained from decompilers. Decompiled and obfuscated code
have a lot in common: both do not contain meaningful identi-
fier names, and the control flow may not be straightforward.
The complete responses of LLMs are contained in our online
appendix.
(b) 2011/blake- (c) Reformatted
ly/blakely.c . 2011/blakely/blakely.c . 5.1 Github Repository Analysis
Figure 8: Reformatted code. In this case study, we select two repositories on Github: (1)
KratosKnife [2], which is a set of Python scripts of a bot-
net system; (2) librarian [3], which is a Chrome extension
Finding 8: Text-level obfuscation does not influence the for bookmark search. The reasons why we choose these two
abilities of LLMs to perform de-obfuscation. code repositories are: (1) With the comparison of malware
and benign-ware, we can carefully observe the outputs and
This also aligns with our findings while conducting the analy- determine if any false alarms arise during the analysis process.
sis of the JavaScript code, indicating that employing complex (2) librarian is a new codebase (created 01/11/2024) that is
logic is the only way to trick LLMs. guaranteed to be not included in GPT-4’s training sets. There-
fore, we can examine the ability of GPT-4 to analyze code
Finding 9: GPT-4 generates code with higher readabil- without concerns for encountering any pre-learned patterns or
ity. memorization from GPT-4’s training data. In the experiment,
to generate explanations, we feed de-commented code files to
We define readability as improved code formatting and GPT-4, provide file paths in the prompts, and instruct GPT-4
more meaningful identifier names. Higher readability indi- to analyze code. Responses of the previous file is passed as an
assistant message, so that the whole analysis process consists
.apk
of continued sessions. After this, we feed the generated results
one at a time to GPT-4 and ask “Is there any potentially mali-
cious behavior? Briefly answer ‘Yes’ or ‘No’. If yes, explain De-Compiler
why.” to ask GPT-4 to perform classification of malicious and
non-malicious code.
In both cases, we observe that GPT-4 is able to correctly
Code
0
Code
1
Code
2
… Code
907

explain each function. Additionally, it successfully points out


malicious activities inside the code of KratosKnife. Based “Analyze the code.”
on the generated analysis, GPT-4 correctly classifies code files
that contain malicious behaviors. There are some interesting Analysis
0
Analysis
1
Analysis
2
… Analysis
907
observations in this experiment:
“Is there any potentially malicious
Observation 1: GPT-4 is able to point out malicious behav- behavior?”
iors by recognizing typical patterns.
0: No.
For example, when analyzing the code of malware 1: No.

907: No.
KratosKnife, GPT-4 is able to point out that the VM envi-
ronment checking is malicious since its purpose is to counter
malware analyses. Figure 9: Diagram of our Android case study pipeline.

Observation 2: There are potential false alarms in the clas-


sification step. This virus works in the following way:
In the codebase of KratosKnife, there exist cryptography 1. Upon signing up, it requires users to enter a phone num-
utility functions designed to encrypt code files based on an ber. It checks if the entered phone number is from Iran
input key provided to the Python script. This encryption pro- (+98). If so, it requests to gain SMS sending and reading
cess serves to obfuscate code. Notably, these utility functions privileges.
operate independently of other malicious behaviors associated
with the malware. Furthermore, the analysis generated does 2. It then continues to establish a connection using a spe-
not indicate any inherently malicious behaviors. However, cific URL.
during the classification process, GPT-4 considers these util- 3. When an SMS message is received, it reads the whole
ity functions malicious since these functions encrypt files, and packet of information and performs a string match. Upon
falsely accuse them of replacing encrypted files on disk, even a successful match, the virus sends the content of the
though the generated explanation indicates the encrypted code SMS message to a specific site.
is stored in separate files. This discrepancy underscores the
necessity for a thorough examination of classification results These functionalities are spread across 3 different code files.
produced by LLMs. By examining the analysis generated by GPT-3.5 and GPT-
4, we find that both have correctly deciphered these behaviors
5.2 Mobile Platform Virus Analysis from the decompiled source code. Part of the responses from
GPT-3.5 are shown below:
In our first case study for decompiled code, we focus on an-
alyzing a decompiled Android virus from the Internet. We Response 1: “... the code validates the entered phone num-
utilize Java Decompiler [1] to decompile an android virus ber using a regular expression. If the phone number
called msg-stealer, obtained from [4]. As indicated by its is not valid, a toast message is displayed. Otherwise,
name, this virus maliciously steals SMS messages from An- the code requests the ‘RECEIVE_SMS’ permission and
droid mobile phone users. By using jd-gui, a graphical utility checks if the permission is granted.” (Step 1)
provided by the Java Decompiler , we successfully decom- Response 2: “... it builds a URL string by appending the ‘url’
pile the .apk file of the virus and generate 908 .java files, and ‘info’ (user phone number) parameters to a base
containing ∼ 250k LoC. The experiment process is identical URL. It then sends a GET request to this URL using the
to our previous case study: we feed every decompiled code AndroidNetworking library’s ‘get()‘ method.” (Step 2)
file to both GPT-3.5 and GPT-4 and get 908 code analysis
responses in return. After that, we instruct GPT-4 to read the Response 3: “... it retrieves the SMS messages from the in-
analysis and give alerts if there are any abnormal or poten- tent extras, retrieves the message body, and concatenates
tially malicious behaviors. The diagram of our experiment all message bodies into a single string. It then checks if
pipeline is shown in Figure 9. the string contains the text "(a trigger string in Persian)"
and if so, it updates a shared preference value named automatically generated comments only specify memory ad-
"lock" to "off". Finally, it uses the ‘connect‘ class to per- dresses of functions and branches, as well as a list of functions
form some action using the phone number retrieved from that are statically or dynamically linked.
the shared preferences and the concatenated string of To feed the generated code to LLMs, we input the generated
SMS messages.” (Step 3) code to LLMs in an interactive manner. In the first prompt,
we state that we will input a code file for it to analyze, and
These behaviors are apparently suspicious, since it requests it may consist of multiple chunks of code due to the length
SMS read/write privilege (sensitive privilege) and communi- restrictions of input. For shorter functions, we copy and paste
cates with and sends SMS messages to a remote site. However, 3-5 functions in one round of query and obtain the response
when we feed GPT-4/GPT-3.5 with its analysis results and of analysis. For longer functions whose length exceeds the
ask “Is there any potentially malicious behavior?”, we have length limit of input, we partition the body of functions into
the following interesting observations. multiple parts, each consisting of 200-500 lines of code. A
sentence is first provided to LLMs, stating that this is a long
Observation 1: When feeding analysis results regarding the function and will be input as multiple chunks of code. Starting
three steps independently, both GPT-3.5 and GPT-4 fail from the second chunk of code, we state that “the following
to recognize the potentially malicious behaviors. code snippet is the remaining part of the previous function”.
After the last partition of code, we manually add a comment
Observation 2: When concatenating the information to- “// function ends” to indicate the end of the function. No
gether and feeding to LLMs, only GPT-4 successfully other manual operation is involved during the experiment. We
points out potentially malicious behaviors. select to use GPT-3.5 in this experiment, since it is one of the
easiest accessible and most popular chatbot-based LLMs.
These observations indicate that GPT-4 possesses a better Part of the analysis results from GPT-3.5 that we consider
capability to understand natural language inputs and GPT- interesting and important are shown below. For the whole
4 would be a better choice to process the massive analysis record of our experiment, please refer to our online appendix.
results. However, our observations also suggest that when
using these LLMs for defensive analysis, it is important to Response 1: “... Overall, these functions handle networking
provide enough context otherwise LLMs will fail to detect tasks. The first, function_4072a0, seems to be part
security issues. of a more complex communication protocol, while the
second, function_407480, checks for the ability to
connect to port 445.”
5.3 WannaCry Ransomware Analysis
Response 2: “... This is interesting functionality, especially
In another case study, we aim for a more realistic target and
the last function. The program is attempting to create
analyze a desktop virus to showcase if LLMs can be uti-
a Windows service named “mssecsvc2.0”, which
lized to perform defensive analysis on decompiled code. We
sounds similar to a legitimate Microsoft Security service.
select a malicious software called WannaCry (also named
This could indicate malicious or at least suspicious be-
WanaCrypt0r) [58], which is a ransomware that posted shock-
havior, as it seems like the program is trying to disguise
ingly devastating damages to various organizations and per-
itself as a legitimate service.” (function_407c40)
sonal users in 2017. WannaCry ransomware attack uses
the EternalBlue exploit to compromise Windows Service Response 3: “...• Initializes a URL string pointing to
Message Block Protocol (SMBv1) and propagates itself http://www.iuqerfsodp9ifjap{...}jfaewrwergwea.com.
through TCP port 445 [49, 60]. Once residing on a com- • Uses the Windows Internet (WinINet) API to open an
puter, it encrypts user files and demands a ransom paid in bit- internet session and attempts to open the URL.
coins. It is reported that WannaCry ransomware has impacted • Closes the internet handles after the operations.
hospitals [62] and large companies including Nissan [33], • If it’s successful in opening the URL, it closes
TSMC [63], etc. the internet handle and returns, else it calls
In this case study, we directly feed the decompiled code file function_408090.
of WannaCry [8] generated by an open-source decompiler Ret- • This function seems to be checking for the avail-
Dec [14] to LLMs. We aim to analyze the behaviors of Wan- ability of a specific website, possibly as a kind of
naCry, and test the ability of LLMs to understand and analyze “alive” check or to determine network connectivity. ...”
code. The decompiled code file contains around 11k lines of C (function_408140)
code and automatically generated comments. The decompiled
C code does not include meaningful functions and variable Response 4: “... The two goto labels lab_0x408755_9
names. It has identifier names such as function_401000 and and lab_0x4087c3_9 are exit points or continuation
g33 that do not provide useful contextual information. The points, but the actual code for these labels is not present
in the segment. They might direct the code to other parts During our experiment, GPT-3.5 repeatedly mentions the need
of the function based on specific conditions. ...” for more context and its incapability to draw a conclusion
about the intent of the provided code. In the conclusive Re-
Response 5: “... From the provided data, it’s challenging sponse 5, it simply generates generic descriptions. We suspect
to pinpoint the exact purpose of the application, but that GPT-3.5 learned the strategy to conservatively provide
it seems like a complex program that could serve as information to improve the score of outputs during training.
a service or a daemon on Windows, handling network Summary. In this case study, the analysis results of Wan-
communication, possibly acting as a server or a service naCry show that GPT-3.5 is able to capture the most impor-
manager, with capabilities for file handling and other tant behaviors of this ransomware and successfully provide a
utility tasks. However, this is a speculative overview security alert (Response 2). Despite the observed weaknesses,
based on the given context, and a more detailed analysis we confirm that LLM models like GPT-3.5 are able to ex-
would be required to confirm this.” tract important information from decompiled code and raise
security concerns if there are any. This experiment shows
From the findings presented above, we found that GPT-3.5 LLMs’ superior ability in performing code analysis and sum-
is capable of analyzing a relatively large-scale decompiled marization, considering that the provided code has extremely
c codebase with low readability. It correctly extracts critical low readability and there is no extra context. Though at this
behaviors of our target ransomware, including: point, LLMs cannot replace expert human beings, they can
still serve as an initial code analysis assistant to accelerate
• Checking connectivity to port 445 (Response 1); the reverse engineering of software and aid defensive code
analysis.
• Creating service named mssecsvc2.0 (Response 2);

• Querying a special url for “alive” check (Response 3). 6 Discussion


These critical findings partially align with publicly available 6.1 Using LLMs for Code Analysis
analysis of WannaCry [57], as these are also considered im-
portant behavioral watermarks of WannaCry ransomware. As shown in our analysis results, more advanced LLMs (e.g.
However, there are differences indicating that the analysis GPT-3.5, GPT-4) have a higher success rate of generating
is not fully correct. For example, according to the technical explanations for input code samples. However, as found out
report provided by Microsoft [57], the purpose of initiating in our analysis, smaller ones have poor performance, and
mssecsvc2.0 service is to exploit the SMB vulnerability, in- the results are not always reliable even for larger LLMs,
stead of simply disguising itself. However, considering the especially when complicated code obfuscation is involved.
lack of context information and the fact that GPT-3.5 correctly Smaller general-purpose LLMs, even when specifically fine-
raises security concerns, this still demonstrates the capability tuned for coding tasks, struggle to produce meaningful results.
of LLMs in defensive analysis. These findings suggest that at this point, utilizing larger LLMs
However, there are other observed weaknesses of GPT-3.5: is still the safer choice for code analysis-related tasks.

Observation 1: Code address labels are ignored and not re- 6.2 Future Work
membered.
This work sheds light on some possible future directions in
In Response 4, GPT-3.5 replies that code for two labels is this area. First of all, there is a gap in the literature about
not provided. Yet the code labels actually reside in the same obfuscated code datasets. Most works only focus on the abil-
function body and were provided to GPT-3.5 a few rounds ity to analyze normal code, without integrating obfuscation
of queries ago due to the excessively long function body. By techniques. Constructing such a large dataset would enable
looking back at the dialog record, we find that these labels LLM users to fine-tune pre-trained LLMs specifically for
are also not mentioned in the response at the corresponding code analysis tasks quickly. Second, in this paper, we report
round of query. This indicates that these labels are ignored, several interesting observations regarding the memorization
and not remembered during later processing. The incapability phenomena of LLMs [27]. The underlying mechanisms still
to capture and remember important code information will remain to be explored. Third, regarding similarity metrics,
hinder the ability of LLMs to analyze long code bodies with while the n-gram algorithm-based metric has been reported
more complicated context, considering the rather strict input to be suboptimal in capturing semantic similarities [88], dur-
limit of each query. ing our experiments, the newly employed ChatGPT-based
method [24,31,88,92] is also not reliable and requires manual
Observation 2: GPT-3.5 is conservative in providing con- work to calibrate, especially when code snippets and natural
clusive remarks. languages are combined together in the inputs. We believe
constructing a more sophisticated metric that captures the ever, they do encounter limitations in providing de-obfuscated
essence of code analysis results is non-trivial and could be code. Our study also uncovers intriguing findings such as
one research direction. LLM memorization phenomena. Our research highlights that,
at the present stage, LLMs demonstrate considerable potential
in this field. However, there are still many unexplored ways
7 Related Work to optimize their performance.
Code Analysis. Code analysis is crucial in modern software
development to identify vulnerabilities and ensure code qual- Acknowledgments
ity. The significant progress in artificial intelligence (AI) mo-
tivates researchers to adopt advanced AI models to improve We would like to thank the authors of Wobfuscator [70] for
more effective code analysis. Mou et.al [59] proposed to use sharing their obfuscation tool. We extend our gratitude to
CNNs to analyze syntax trees of programs. Feng et.al [39] all anonymous reviewers and our shepherd for the valuable
targets detecting vulnerability by collecting features from feedback they have provided.
extracted syntax trees and employing a Bi-GRU network to
identify bugs in general code. The development of natural References
language processing techniques, especially transformers and
LLMs, significantly boost this area. Chen et.al [30] employ a [1] Java decompiler. http://java-decompiler.github
transformer-based method to automatically predict variable .io/.
names and types from decompiler outputs, thus dramatically
improving the readability of decompiler-generated code. Sim- [2] Kratosknife. https://github.com/PushpenderInd
ilarly, Xu et.al [87] also proposes to use LLMs to assist in ia/KratosKnife.
handling decompilation tasks. It presents that using an iter- [3] librarian. https://github.com/oto-labs/librar
ative algorithm with multiple LLM queries can improve the ian/tree/main.
decompilation results.
LLM Evaluation. As LLMs become prevalent in recent years, [4] Not so boring android malware. https://maldroid
there are a surging number of review or analysis papers fo- .github.io/android-malware-samples//.
cusing on LLM’s capabilities. Bubeck et.al [24] analyze the
capabilities of GPT-4 from different aspects, including vision, [5] tripforce. https://github.com/microsounds/tri
coding, and mathematics-related tasks and discuss the soci- pforce, 2016.
etal impacts of such LLMs. A recent survey [20] focuses on [6] semantic-text-similarity. https://github.com/And
security aspects of LLMs and summarizes several attack and riyMulyar/semantic-text-similarity, 2019.
defence works on LLMs and points out future directions of
research in this field. Regarding analysis and evaluation of [7] js-apps. https://github.com/jaiimeriios/js-a
code-related capabilities of LLMs, Chen et.al [29] evaluate pps, 2022.
LLMs trained on code. However, their focus is primarily on
the code generation aspects. The work that is closest to ours [8] Wannacry/worm.c. https://github.com/xcp3r/W
is [88]. In the paper, they conduct an analysis of the ability of annaCry/blob/main/worm.c, 2022.
instruction-tuned and fine-tuned LLMs to perform code com- [9] Codexglue/code-code/clone-detection-poj-104/. https:
prehension and generation tasks. However, code-obfuscation- //github.com/microsoft/CodeXGLUE/blob/main
related tasks, including obfuscated code comprehension and /Code-Code/Clone-detection-POJ-104/README.
de-obfuscated code generation, are not included. md, 2023.

[10] incubator/heron. https://github.com/apache/in


8 Conclusion cubator-heron, 2023.

In this paper, we have conducted a thorough evaluation of the [11] The international obfuscated c code contest. https:
code analysis capabilities of popular Large Language Models //www.ioccc.org/, 2023.
(LLMs). Our results reveal that the larger LLMs, specifically
from the GPT series, exhibit impressive performance in code [12] Javascript obfuscator tool. https://obfuscator.io/,
analysis tasks. On the other hand, the smaller models from 2023.
LLaMA family evaluated in this paper demonstrate unsat- [13] Octane 2.0. https://chromium.github.io/octan
isfying performance in these same tasks. When it comes to e/, 2023.
analyzing obfuscated code, the GPT-series LLMs still produce
reasonably useful results for explanation-related tasks; how- [14] Retdec. https://github.com/avast/retdec, 2023.
[15] ACM. Acm policy on authorship. https://www.acm. [26] Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin
org/publications/policies/new-acm-policy-o Li. Bgnn4vd: Constructing bidirectional graph neural-
n-authorship, 2023. network for vulnerability detection. Information and
Software Technology, 136:106576, 2021.
[16] Toufique Ahmed and Premkumar Devanbu. Few-shot
training LLMs for project-specific code-summarization. [27] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
In Proceedings of the 37th IEEE/ACM International Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam
Conference on Automated Software Engineering, ASE Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al.
’22, pages 1–5, New York, NY, USA, January 2023. As- Extracting training data from large language models. In
sociation for Computing Machinery. 30th USENIX Security Symposium (USENIX Security
21), pages 2633–2650, 2021.
[17] Dogu Araci. Finbert: Financial sentiment analysis
with pre-trained language models. arXiv preprint [28] Marco Cascella, Jonathan Montomoli, Valentina Bellini,
arXiv:1908.10063, 2019. and Elena Bignami. Evaluating the feasibility of chatgpt
in healthcare: an analysis of multiple clinical and re-
[18] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, search scenarios. Journal of Medical Systems, 47(1):33,
Vageesh D. C, Arun Iyer, Suresh Parthasarathy, Sriram 2023.
Rajamani, B. Ashok, and Shashank Shet. CodePlan:
[29] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Repository-level Coding using LLMs and Planning,
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
September 2023.
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
[19] Shraddha Barke, Michael B. James, and Nadia Polikar- et al. Evaluating large language models trained on code.
pova. Grounded Copilot: How Programmers Interact arXiv preprint arXiv:2107.03374, 2021.
with Code-Generating Models. Proceedings of the [30] Qibin Chen, Jeremy Lacomis, Edward J Schwartz, Claire
ACM on Programming Languages, 7(OOPSLA1):78:85– Le Goues, Graham Neubig, and Bogdan Vasilescu.
78:111, April 2023. Augmenting decompiler output with learned variable
[20] Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Car- names and types. In 31st USENIX Security Symposium
lini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, (USENIX Security 22), pages 4327–4343, 2022.
Mihai Christodorescu, Anupam Datta, Soheil Feizi, et al. [31] Cheng-Han Chiang and Hung-yi Lee. Can large lan-
Identifying and mitigating the security risks of genera- guage models be an alternative to human evaluations?
tive ai. arXiv preprint arXiv:2308.14840, 2023. arXiv preprint arXiv:2305.01937, 2023.
[21] David Binkley. Source code analysis: A road map. Fu- [32] Mihai Christodorescu and Somesh Jha. Static analysis
ture of Software Engineering (FOSE’07), pages 104– of executables to detect malicious patterns. In 12th
119, 2007. USENIX Security Symposium (USENIX Security 03),
2003.
[22] Christian Bird, Denae Ford, Thomas Zimmermann,
Nicole Forsgren, Eirini Kalliamvakou, Travis Lowder- [33] ChronicleLive. Ransomware cyber attack recap: Nissan
milk, and Idan Gazit. Taking Flight with Copilot. Com- confirm they have been hit by hack which crippled NHS.
munications of the ACM, 66(6):56–62, May 2023. https://www.chroniclelive.co.uk/news/north
-east-news/cyber-attack-nhs-latest-news-1
[23] Peter F Brown, Vincent J Della Pietra, Peter V Des- 3029913, May 2017.
ouza, Jennifer C Lai, and Robert L Mercer. Class-based
n-gram models of natural language. Computational [34] Harold J Curnow and Brian A. Wichmann. A synthetic
linguistics, 18(4):467–480, 1992. benchmark. The Computer Journal, 19(1):43–49, 1976.
[35] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet.
[24] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan,
The linpack benchmark: past, present and future. Con-
Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
currency and Computation: practice and experience,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks
15(9):803–820, 2003.
of artificial general intelligence: Early experiments with
gpt-4. arXiv preprint arXiv:2303.12712, 2023. [36] EEMBC. Coremark. https://www.eembc.org/core
mark/, 2022.
[25] Gerardo Canfora, Massimiliano Di Penta, and Luigi
Cerulo. Achievements and challenges in software [37] Abdelrahman Eid. Reverse engineering snapchat (part
reverse engineering. Communications of the ACM, i): Obfuscation techniques. https://hot3eed.gith
54(4):142–151, 2011. ub.io/snap_part1_obfuscations.html.
[38] Ruijie Fang, Ruoyu Zhang, Elahe Hosseini, Anna M s/guidelines-and-policies/submission-and-p
Parenteau, Sally Hang, Setareh Rafatirad, Camelia E eer-review-policies/, 2023.
Hostinar, Mahdi Orooji, and Houman Homayoun. To-
wards generalized ml model in automated physiological [48] Urszula Jessen, Michal Sroka, and Dirk Fahland. Chit-
arousal computing: A transfer learning-based domain chat or deep talk: Prompt engineering for process min-
generalization approach. In 2022 IEEE International ing. arXiv preprint arXiv:2307.09909, 2023.
Conference on Bioinformatics and Biomedicine (BIBM),
pages 2577–2584. IEEE, 2022. [49] Da-Yu Kao and Shou-Ching Hsiao. The dynamic anal-
ysis of wannacry ransomware. In 2018 20th Interna-
[39] Hantao Feng, Xiaotong Fu, Hongyu Sun, He Wang, and tional conference on advanced communication technol-
Yuqing Zhang. Efficient vulnerability detection based ogy (ICACT), pages 159–166. IEEE, 2018.
on abstract syntax tree and deep learning. In IEEE
INFOCOM 2020-IEEE Conference on Computer Com- [50] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann,
munications Workshops (INFOCOM WKSHPS), pages Maria Bannert, Daryna Dementieva, Frank Fischer, Urs
722–727. IEEE, 2020. Gasser, Georg Groh, Stephan Günnemann, Eyke Hüller-
meier, et al. Chatgpt for good? on opportunities and chal-
[40] Github. The top programming languages. https://oc lenges of large language models for education. Learning
toverse.github.com/2022/top-programming-l and individual differences, 103:102274, 2023.
anguages, 2022.
[51] Tımea László and Ákos Kiss. Obfuscating c++ pro-
[41] Github. Github copilot. https://github.com/featu grams via control flow flattening. Annales Universitatis
res/copilot, 2023. Scientarum Budapestinensis de Rolando Eötvös Nomi-
natae, Sectio Computatorica, 30(1):3–19, 2009.
[42] John L Gustafson and Quinn O Snell. Hint: A new way
to measure computer performance. In Proceedings of the [52] Juho Leinonen, Paul Denny, Stephen MacNeil, Sami
Twenty-Eighth Annual Hawaii International Conference Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and
on System Sciences, volume 2, pages 392–401. IEEE, Arto Hellas. Comparing Code Explanations Created by
1995. Students and Large Language Models, April 2023.
[43] Andreas Haas, Andreas Rossberg, Derek L Schuff,
[53] Yichen Li, Yun Peng, Yintong Huo, and Michael R. Lyu.
Ben L Titzer, Michael Holman, Dan Gohman, Luke Wag-
Enhancing LLM-Based Coding Tools through Native In-
ner, Alon Zakai, and JF Bastien. Bringing the web up
tegration of IDE-Derived Static Context, February 2024.
to speed with webassembly. In Proceedings of the 38th
ACM SIGPLAN Conference on Programming Language [54] Binbin Liu, Weijie Feng, Qilong Zheng, Jing Li, and
Design and Implementation, pages 185–200, 2017. Dongpeng Xu. Software obfuscation with non-linear
[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian mixed boolean-arithmetic expressions. In Informa-
Sun. Deep residual learning for image recognition. In tion and Communications Security: 23rd International
Proceedings of the IEEE conference on computer vision Conference, ICICS 2021, Chongqing, China, November
and pattern recognition, pages 770–778, 2016. 19-21, 2021, Proceedings, Part I 23, pages 276–292.
Springer, 2021.
[45] Elahe Hosseini, Ruijie Fang, Ruoyu Zhang, Chen-Nee
Chuah, Mahdi Orooji, Soheil Rafatirad, Setareh Rafati- [55] Charles Curtsinger Benjamin Livshits, Ben Zorn, and
rad, and Houman Homayoun. Convolution neural net- Christian Seifert. Zozzle: Low-overhead mostly static
work for pain intensity assessment from facial expres- javascript malware detection. In USENIX Security Sym-
sion. In 2022 44th annual international conference of posium, 2010.
the IEEE Engineering in Medicine & Biology Society
[56] John D McCalpin et al. Memory bandwidth and ma-
(EMBC), pages 2697–2702. IEEE, 2022.
chine balance in current high performance computers.
[46] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis IEEE computer society technical committee on computer
Allamanis, and Marc Brockschmidt. Codesearchnet architecture (TCCA) newsletter, 2(19-25), 1995.
challenge: Evaluating the state of semantic code search.
arXiv preprint arXiv:1909.09436, 2019. [57] Microsoft. Wannacrypt ransomware worm targets out-
of-date systems. https://www.microsoft.com/en
[47] IEEE. Submission and peer review policies. https: -us/security/blog/2017/05/12/wannacrypt-ran
//journals.ieeeauthorcenter.ieee.org/becom somware-worm-targets-out-of-date-systems/,
e-an-ieee-journal-author/publishing-ethic May 2017.
[58] Savita Mohurle and Manisha Patil. A brief study of [70] Alan Romano, Daniel Lehmann, Michael Pradel, and
wannacry threat: Ransomware attack 2017. Interna- Weihang Wang. Wobfuscator: Obfuscating javascript
tional journal of advanced research in computer science, malware via opportunistic translation to webassembly.
8(5):1938–1940, 2017. In 2022 IEEE Symposium on Security and Privacy (SP),
pages 1574–1589. IEEE, 2022.
[59] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Con-
volutional neural networks over tree structures for pro- [71] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
gramming language processing. In Proceedings of the Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
AAAI conference on artificial intelligence, volume 30, Liu, Tal Remez, Jérémy Rapin, et al. Code llama:
2016. Open foundation models for code. arXiv preprint
[60] National Cybersecurity and Communications Integra- arXiv:2308.12950, 2023.
tion Center. What is wannacry/wanacrypt0r?, 2017.
[72] Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo.
[61] Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks. Un- Are students representatives of professionals in software
derstanding source code evolution using abstract syntax engineering experiments? In 2015 IEEE/ACM 37th
tree matching. In Proceedings of the 2005 international IEEE international conference on software engineering,
workshop on Mining software repositories, pages 1–5, volume 1, pages 666–676. IEEE, 2015.
2005.
[73] Katharine Sanderson. Gpt-4 is here: what scientists
[62] Sky News. NHS cyberattack: List of hospitals hit by think. Nature, 615(7954):773, 2023.
ransomware strike. https://news.sky.com/story
/nhs-cyberattack-full-list-of-organisatio [74] Irene Solaiman, Miles Brundage, Jack Clark, Amanda
ns-affected-so-far-10874493, May 2017. Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,
Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al.
[63] The Hawker News. Tsmc chip maker blames wannacry Release strategies and the social impacts of language
malware for production halt. https://thehackernew models. arXiv preprint arXiv:1908.09203, 2019.
s.com/2018/08/tsmc-wannacry-ransomware-att
ack.html, August 2018. [75] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
[64] OpenAI. Chatgpt. https://chat.openai.com/, Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and
2023. Tatsunori B Hashimoto. Alpaca: A strong, replicable
instruction-following model. Stanford Center for Re-
[65] OpenAI. Chatgpt can now see, hear, and speak. https: search on Foundation Models. https://crfm. stanford.
//openai.com/blog/chatgpt-can-now-see-hea edu/2023/03/13/alpaca. html, 3(6):7, 2023.
r-and-speak, 2023.
[76] Eric J Topol. High-performance medicine: the con-
[66] OpenAI. Gpt-4 technical report. 2023. vergence of human and artificial intelligence. Nature
medicine, 25(1):44–56, 2019.
[67] PetaBridge. NBench. https://nbench.io/, 2023.

[68] James Prather, Paul Denny, Juho Leinonen, Brett A. [77] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Becker, Ibrahim Albluwi, Michelle Craig, Hieke Ke- Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
uning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Reilly, Stephen MacNeil, Andrew Peterson, Raymond et al. Llama: Open and efficient foundation language
Pettit, Brent N. Reeves, and Jaromir Savelka. The models. arXiv preprint arXiv:2302.13971, 2023.
Robots are Here: Navigating the Generative AI Rev-
olution in Computing Education. In Proceedings of the [78] Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Ed-
2023 Working Group Reports on Innovation and Tech- ward Beeching, Teven Le Scao, Leandro von Werra,
nology in Computer Science Education, pages 108–159, Sheon Han, Philipp Schmid, and Alexander Rush. Cre-
December 2023. ating a coding assistant with starcoder. Hugging Face
Blog, 2023. https://huggingface.co/blog/starchat.
[69] James Prather, Paul Denny, Juho Leinonen, David H.
Smith IV, Brent N. Reeves, Stephen MacNeil, Brett A. [79] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glass-
Becker, Andrew Luxton-Reilly, Thezyrie Amarouche, man. Expectation vs. experience: Evaluating the usabil-
and Bailey Kimmel. Interactions with Prompt Prob- ity of code generation tools powered by large language
lems: A New Way to Teach Programming with Large models. In Chi conference on human factors in comput-
Language Models, January 2024. ing systems extended abstracts, pages 1–7, 2022.
[80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Proceedings of the 2023 ACM International Joint Con-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, ference on Pervasive and Ubiquitous Computing & the
and Illia Polosukhin. Attention is all you need. Advances 2023 ACM International Symposium on Wearable Com-
in neural information processing systems, 30, 2017. puting, pages 291–295, 2023.

[81] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. [91] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi-
Copiloting the Copilots: Fusing Large Language Mod- aolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang,
els with Completion Engines for Automated Program Junjie Zhang, Zican Dong, et al. A survey of large lan-
Repair. In Proceedings of the 31st ACM Joint European guage models. arXiv preprint arXiv:2303.18223, 2023.
Software Engineering Conference and Symposium on
the Foundations of Software Engineering, pages 172– [92] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
184, November 2023. Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
han Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-
[82] Reinhold P Weicker. Dhrystone: a synthetic systems judge with mt-bench and chatbot arena. arXiv preprint
programming benchmark. Communications of the ACM, arXiv:2306.05685, 2023.
27(10):1013–1030, 1984.
[93] Victor Zhong, Caiming Xiong, and Richard Socher.
[83] Chunqiu Steven Xia and Lingming Zhang. Conversa- Seq2sql: Generating structured queries from natural lan-
tional Automated Program Repair, January 2023. guage using reinforcement learning. arXiv preprint
arXiv:1709.00103, 2017.
[84] Yichen Xie and Alex Aiken. Static detection of secu-
rity vulnerabilities in scripting languages. In USENIX
Security Symposium, volume 15, pages 179–192, 2006.

[85] Hui Xu, Yangfan Zhou, Yu Kang, Fengzhi Tu, and


Michael Lyu. Manufacturing resilient bi-opaque predi-
cates against symbolic execution. In 2018 48th Annual
IEEE/IFIP International Conference on Dependable
Systems and Networks (DSN), pages 666–677. IEEE,
2018.

[86] Wei Xu, Fangfang Zhang, and Sencun Zhu. The power
of obfuscation techniques in malicious javascript code:
A measurement study. In 2012 7th International Confer-
ence on Malicious and Unwanted Software, pages 9–16.
IEEE, 2012.

[87] Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye,


Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and Xi-
angyu Zhang. Lmpa: Improving decompilation by syn-
ergy of large language model and program analysis.
arXiv preprint arXiv:2306.02546, 2023.

[88] Zhiqiang Yuan, Junwei Liu, Qiancheng Zi, Mingwei


Liu, Xin Peng, and Yiling Lou. Evaluating instruction-
tuned large language models on code comprehension
and generation. arXiv preprint arXiv:2308.01240, 2023.

[89] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun,


Kaixuan Wang, and Xudong Liu. A novel neural source
code representation based on abstract syntax tree. In
2019 IEEE/ACM 41st International Conference on Soft-
ware Engineering (ICSE), pages 783–794. IEEE, 2019.

[90] Ruoyu Zhang, Ruijie Fang, Chongzhou Fang, Houman


Homayoun, and Gozde Goncu Berk. Privee: A wearable
for real-time bladder monitoring system. In Adjunct

You might also like