0% found this document useful (0 votes)
2 views

deobfuscation-reverse-engineering-obfuscated-code

This paper discusses techniques for automatic deobfuscation of code to aid in reverse engineering obfuscated software, which is often used to enhance software security. The authors demonstrate that many obfuscation methods can be countered through simple static and dynamic analyses, and they explore the strengths and weaknesses of various obfuscation techniques. The research aims to improve understanding of obfuscated code and contribute to the development of more resilient obfuscation strategies.

Uploaded by

Amr Musharrafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

deobfuscation-reverse-engineering-obfuscated-code

This paper discusses techniques for automatic deobfuscation of code to aid in reverse engineering obfuscated software, which is often used to enhance software security. The authors demonstrate that many obfuscation methods can be countered through simple static and dynamic analyses, and they explore the strengths and weaknesses of various obfuscation techniques. The research aims to improve understanding of obfuscated code and contribute to the development of more resilient obfuscation strategies.

Uploaded by

Amr Musharrafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deobfuscation

Reverse Engineering Obfuscated Code∗


Sharath K. Udupa Saumya K. Debray Matias Madou
Department of Computer Science Ghent University
The University of Arizona St.-Pietersnieuwstraat 41
Tucson, AZ 85721, USA. B-9000 Ghent, Belgium.
{sku, debray}@cs.arizona.edu [email protected]

Abstract understand the internal workings of a program prevents


them from discovering vulnerabilities in the code, and
In recent years, code obfuscation has attracted attention serves to protect the program owner’s intellectual prop-
as a low cost approach to improving software security erty.
by making it difficult for attackers to understand the
It is important to note, however, that code obfuscation
inner workings of proprietary software systems. This
is merely a technique. Just as it can be used to protect
paper examines techniques for automatic deobfuscation
software against attackers, so too it can be used to hide
of obfuscated programs, as a step towards reverse engi-
malicious content. For example, certain kinds of so-
neering such programs. Our results indicate that much
phisticated computer viruses, e.g., polymorphic viruses,
of the effects of code obfuscation, designed to increase
have resorted to using obfuscation techniques to prevent
the difficulty of static analyses, can be defeated using
detection by virus scanners [27].
simple combinations of straightforward static and dy-
namic analyses. Our results have applications to both This raises two closely related questions. The first
software engineering and software security. In the con- question, from a software engineering pespective, is:
text of software engineering, we show how dynamic What sorts of techniques are useful for understanding
analyses can be used to enhance reverse engineering, obfuscated code? For example, suppose we have down-
even for code that has been designed to be difficult to re- loaded, from a web site, a file purporting to be a security
verse engineer. For software security, our results serve patch for some application. Before applying the patch,
as an attack model for code obfuscators, and can help we may want to verify that the file does not contain
with the development of obfuscation techniques that are any malicious payload. How can we verify this if the
more resilient to straightforward reverse engineering. contents of the file have been obfuscated? The second
question, from a security perspective, is: what are the
weaknesses of current code obfuscation techniques, and
1 Introduction how can we address them? If our obfuscation schemes
In recent years, code obfuscation has attracted some at- are ineffective in thwarting attackers from reverse en-
tention as a low cost approach to improving software se- gineering the code, then they are not only useless, but
curity [4, 7, 8, 21, 22, 29]. The goal of code obfuscation are in fact worse than useless: they increase the time
is to make it difficult for an attacker to reverse engineer and space requirements of the program, and can con-
programs. The idea is to prevent an attacker from under- tribute to a false sense of security that keeps other se-
standing the inner workings of a program by making the curity measures from being deployed. Thus, identify-
obfuscated program “too difficult” to understand—that ing any weaknesses in current obfuscation schemes by
is, by making the task of reverse engineering the pro- developing and testing attack models can lead to better
gram “too expensive” in terms of the resources or time obfuscation schemes and concomitant improvements in
required to do so. Obfuscation has also been used to software security.
protect “software watermarks” and fingerprints, which This paper aims to address the questions raised above,
are designed to thwart software piracy [1, 7, 8]. The regarding techniques for understanding obfuscated code
presumption is that making it difficult for attackers to and the strengths and weaknesses of sophisticated ob-
fuscation algorithms. We describe a suite of code trans-
∗ The work of S. Udupa and S. Debray was supported in part by the
formations and program analyses that can be used to
National Science Foundation under grants CNS-0410918 and CCR-
0113633. The work of M. Madou was supported in part by Ghent
identify and remove obfuscation code and thereby help
University and the Fund for Scientific Research-Flanders (FWO- reverse engineer obfuscated programs. We use these
Flanders). techniques to examine the resilience of the control flow

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
int f(int i, int j)
{ f:
slicing, that rely on code structure and semantics rather
int a = 1; A
a=1
than the concrete syntax. For example, it is straightfor-
if (i < j) { i < j? ward to undo most of the effects of Dotfuscator-style
a = j; variable renaming simply by using a parser to resolve
Y N
}
else B
a=j
C
a = a*i
variable references using the scope rules of the language
do { i = i−1 and rename variables accordingly. Deep obfuscation, by
a *= i--; i > 0? contrast, changes the actual structure of the program,
} while (i > 0); N and therefore affects the efficacy of semantic tools for
Y

return a;
D
return a program analyses and reverse engineering. Space con-
} straints preclude a more detailed elaboration of differ-
ent kinds of deep obfuscation techniques, but the in-
terested reader is referred to a discussion and more de-
Figure 1: An example program and its control flow tailed taxonomy by Collberg et al. [9]. For the purposes
graph of this paper, it suffices to note that working around
deep obfuscation—which requires reasoning about se-
Init
mantic aspects of the program—is intuitively more dif-
f: x=0
ficult than working around surface obfuscation, which
S
is essentially a syntactic issue. This paper is concerned
switch (x)
primarily with deep obfuscation techniques that attempt
0 3
A 1 2 to disguise the control flow logic of a program.
a=1
B a=j C a = a*i D return a In prior work, we considered the problem of deobfus-
x=i<j?1:2 x=3 i = i−1
x=i>0?2:3
cating programs that had been subjected to a number of
control flow obfuscations based on opaque predicates;
we found that for the obfuscations considered (a set
of control flow obfuscations implemented in Collbergs’
Sandmark obfuscation tool for Java programs [6]), most
of the obfuscation could be removed using a combina-
tion of fairly straightforward static and dynamic anal-
Figure 2: Control flow graph after basic flattening yses [3]. This paper considers a different approach to
control flow obfuscation, taken from Chenxi Wang’s
flattening obfuscation technique, which has been pro- dissertation [29, 30]. This choice is motivated by three
posed in the research literature [2, 29] and used in factors. First, based on our experiments, it seems more
a commercial code obfuscation product by Cloakware difficult to break than those we had considered earlier
[5], against attacks based on combinations of static [3]. Second, this approach has been considered by other
and dynamic analyses. Our results indicate that from researchers as well [2], and its resilience is therefore of
the perspective of reverse engineering, simple dynamic interest to the research community. Finally, it is a key
techniques can often be very useful in coping with component of an industrial obfuscation tool by Cloak-
code obfuscation. From a software security perspec- ware Inc. [5].
tive, we show that many obfuscation techniques can be This section describes the basic control flow obfusca-
largely neutralized using combinations of simple and tion technique as well as two enhancements that aim to
well known static and dynamic analyses. make the basic approach harder to break.

2 Obfuscating Transformations
2.1 Basic Control Flow Flattening
Conceptually, we can distinguish between two broad
classes of obfuscating transformations. The first, sur- Control flow flattening aims to obscure the control flow
face obfuscation, focuses on obfuscating the concrete logic of a program by “flattening” the control flow graph
syntax of the program. An example of this is changing so that all basic blocks appear to have the same set of
variable names or renaming different variables in differ- predecessors and successors. The actual control flow
ent scopes to the same identifier , as carried out by the during execution is guided by a dispatcher variable. At
“Dotfuscator” tool for obfuscating .NET code [25]. The runtime, each basic block assigns to this dispatcher vari-
second, deep obfuscation, attempts to obfuscate the ac- able a value indicating which next basic block should be
tual structure of the program, e.g., by changing its con- executed next. A switch block then uses the dispatcher
trol flow or data reference behavior [4, 8]. While the variable to jump indirectly, through a jump table, to the
former may make it harder for a human to understand intended control flow successor.
the source code, it does nothing to disguise the semantic As an example, consider the program shown in Figure
structure of the program. It therefore has no effect on al- 1. Basic control flow flattening of this program results
gorithms used for reverse engineering, such as program in the control flow graph shown in Figure 2, where S is

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
Init
x=0

f:
int a, b, c, *p, *q S
switch (x)

0
1
2 3 4 5 6 7 8 9
A B C D
p = &b p = &b p = &b p = &b a=j p = &b p = &a a = a*i
a=1
*p = 3 x=i<j?b:c *p = 4 *p = 9 *p = 3 x=b *p = 8 *p = 3 i = i−1 return a
q = &c q = &c x=5 x=6 q = &c q = &c x=i>0?b:c
*q = 6
*q = 4 *q = 9 *q = 9
x=1
x=1 x=8 x=b

Figure 4: Enhancing flattening with artificial blocks and pointers

int A[...]; /* global array of indices */


int w; /* offset into array A */
this, control flow is guided by assignments to x in the
various basic blocks.
call_site_1: call_site_2:

w = random 1 w = random 2
A[w] = 3 A[w] = 3
2.2 Enhancement I: Interprocedural Data Flow
A[w+1] = 1 A[w+1] = 1
A[w+2] = 2 A[w+2] = 2 In the basic control flow flattening transformation dis-
. . . . . . cussed in Section 2.1, the values assigned to the dis-
call f(i, j) call f(i, j)
patch variable are available within the function itself.
Because of this, while the control flow behavior of
the obfuscated code is not obvious, it can be recon-
Init
f: x=0 structed by examining the constants being assigned to
the dispatch variable. This, in turn, requires only intra-
S switch (x) procedural analysis.
0 3 The resilience of the obfuscation technique can be
1 2
A
B C D
improved using interprocedural information passing.
a=j a = a*i return a
a=1
x=i<j? x = A[w] i = i−1
The idea is to use a global array to pass the dispatch
A[w+1] : x=i>0? variable values. At each call site to the function, these
A[w+2] A[w+2] :
A[w]
values are written into the global array starting at some
random offset within the array (appropriately adjusted
to avoid buffer overflows). The offset so chosen may
be different at different call sites for the function, and is
passed to the obfuscated callee either as a global or as
an argument. The obfuscated code then assigns values
Figure 3: Enhancing flattening with Interprocedural to the dispatch variable from the global array. Neither
Data Flow the actual locations accessed, nor the contents of these
locations, are constant values, and are not evident by
the switch block and x the dispatcher variable.1 The ini- examining the obfuscated code of the callee. The code
tial assignment to the dispatcher variable x in the block resulting from applying this obfuscation to the program
Init is intended to route control to A, the original entry in Figure 1 is illustrated in Figure 3.
block of f(), when control first enters the function; after
2.3 Enhancement II: Artificial Blocks and Pointers
1 Strictly speaking, Figure 2 is slightly inaccurate in that it shows The obfuscation technique detailed above can be ex-
that the control flow from basic blocks A, B, and C come together
into a single block, at the bottom of the picture, from which it then tended by adding artificial basic blocks to the control
branches to the top of the switch block S. In practice, control would flow graph. Some of these artificial blocks are never
go directly from each of A, B, and C directly to the top of S. We draw be executed, but this is difficult to determine by a static
it as shown to reduce the clutter of control flow edges and bring out the
essential logic underlyig the transformation. This becomes especially
examination of the program because of the dynamically
important when we consider enhancements to the basic transforma- computed indirect branch targets in the obfuscated code.
tion, as illustrated in Figs. 3 and 4. We then add indirect loads and stores, through pointers,

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
A A the results of forward dataflow analyses, such as reach-
ing definitions, are tainted at the entry to B, while those
2 of backward analyses, such as liveness analyses, are af-
1 2 1
fected at the exit from A.
One way to address this problem is to clone portions
B B B’ of the program in such a way that the spurious execu-
tion paths no longer join the original execution paths
(a) Original code (b) After cloning and taint the information obtained from analysis. The
result of applying cloning to basic block B in Figure
5(a) is shown in Figure 5(b). In this case, this results in
Figure 5: Code Cloning improved forward dataflow information available at the
entry to B. In this example, however, cloning does not
into these unreachable blocks. These have the effect of eliminate the spurious control flow edge A → B , and
confusing static analyses about the possible values taken so does not improve the backward dataflow information
on by the dispatch variable. Figure 4 shows the result of available at the exit from A.
applying this to the program of Figure 1. This transformation obviously has to be applied judi-
In our implementation, we add two artificial basic ciously, since otherwise it can cause large increases in
blocks corresponding to each block in the original func- code size and further exacerbate the reverse engineer-
tion: one of these blocks is actually executed at runtime, ing problem. Moreover, since the goal of deobfuscation
while the other is simply a decoy added to mislead static is to try to identify and remove obfuscation code, this
analysis. Given a block B in the original program, let means that in general, cloning has to be applied with-
the corresponding artificial block that gets executed be out knowing, ahead of time, which execution paths are
denoted by B , and the decoy artificial block be B . In- spurious and which are not. One possible approach, in
direct assignments through pointers are added to both such situations, would be to apply cloning selectively at
these artificial blocks. However, only the assignments in points where multiple control flow paths join, and where
the block B set the dispatch variable to the appropriate the dataflow information propagated along some paths
values so as to give the right control flow during execu- is significantly less precise than that propagated along
tion; the decoy block B , by contrast, sets the dispatch others. Alternatively, if we know something about the
variable to other values, so as to give a misleading pic- kind of obfuscation that has been applied, it may be pos-
ture of control flow. In the original block B, the value of sible to apply cloning in a way that exploits this infor-
the dispatch variable that gets loaded is that previously mation. For example, it is relatively straightforward to
assigned in the artificial block B . Hiding the starting infer that control flow flattening has been applied, be-
value of the switch variable makes it harder for a static cause of the distinctive control flow graphs it produces.
analyzer to deduce which blocks are executed and hence
For the purposes of this paper, we use cloning in the
find out the valid definitions of the switch variable.
context of one of our deobfuscator implementations (see
Section 4.1), as illustrated in Figure 6. Consider the ob-
3 Deobfuscation fuscated program fragment shown in Figure 6(a), where
the basic blocks A, B, and C all transfer control to a
This section describes a number of analyses and pro-
switch-block S. Cloning creates three copies S1, S2, and
gram transformations that we have found useful for re-
S3 of the switch-block S, corresponding to the succes-
verse engineering obfuscated code.
sors A, B, and C respectively. The control flow succes-
3.1 Cloning sors of each of these copies is the set of control flow
successors of the original switch-block, i.e., each of the
Many obfuscation techniques rely on introducing spu- copies S1, S2, and S3 has an edge to each of the blocks
rious execution paths into the program to thwart static A, B, and C. In the resulting program, shown in Figure
program analyses [4, 8]. These paths that can never 6(b), the dataflow information entering the switch-block
be taken at runtime, but cause bogus information to S1 is not commingled with that entering the switch-
be propagated along them during program analyses, block S2 from B or that entering the switch-block S3
thereby reducing the precision of information so ob- from C.
tained and making it harder to understand the program
logic. This is illustrated in Figure 5(a), where informa-
3.2 Static Path Feasibility Analysis
tion is propagated between basic blocks A and B along
the “actual” control flow path 1 as well as the spurious We use the term static path feasibility analysis to refer
control flow path 2, the latter having been introduced by to constraint-based static analyses to determine whether
the obfuscator. The bogus data flow information prop- an (acyclic) execution path is feasible. Given an acyclic
agated along 2 then has the effect of introducing im- execution path π with x̄ the set of variables live at entry
precision in the results of program analyses at points to π, the idea is to construct a constraint Cπ such that the
where execution paths come together. In Figure 5(a), logical formula (∃x̄)Cπ is unsatisfiable only if, for all

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
S B0
x = 1 (1)
y = 2 (2)
if (u > 0) goto B1
(3)
A B C
B1 B2
z = x + y (4) z = x − y (5)

(a) Original (obfuscated) code B3


if (z > 0) goto B5 (6)

B4 B5

Figure 7: An example of static path feasibility analysis


A B C
Ck ≡ xk = f⊕ (yi , z j ). Here, f⊕ expresses the se-
S1 S2 S3
mantics of the operation ⊕. If the semantics of ⊕
is not known to the analyzer, or if either yi = ⊥ or
A B C A B C A B C
z j = ⊥, then Ck ≡ x = ⊥.
3. Indirection: Pointers can be modelled at different
(b) Obfuscated code after cloning levels of precision, with a concomitant tradeoff in
analysis speed [14]. A full discussion of pointer
analysis is beyond the scope of this paper; we re-
Figure 6: Code Cloning for Control Flow Flattening quire only that the treatment of pointers be con-
servative, i.e., that the set of possible targets for a
possible executions of the program, π is never executed. pointer during analysis be a superset of the actual
Cπ is thus a conservative approximation to the effects of set of targets during any execution.
the execution of the instructions along π. If (∃x̄)Cπ can
be shown to be unsatisfiable, we can conclude that π is 4. Branches: Ik ≡ ‘if e goto L’ for some Boolean ex-
unfeasible. pression e. In this case,
In principle, we can imagine many different ways to 
construct the constraint Cπ corresponding to a path π. e if Ik is a taken branch in π;
Ck ≡
For the purposes of this paper, our goal is to take into ¬e if Ik is not taken in π;
account the effects of arithmetic operations on the val-
Unconditional branches can be treated as a spe-
ues of variables, effectively obtaining an analysis that
cial case where e ≡ true, while multi-way branches
resembles constant propagation, but propagates infor-
such as those arising from switch statements, can
mation along a single execution path rather than along
be modelled as a semantically equivalent series of
all execution paths. To this end, we use linear arithmetic
conditional branches.
constraints to reason about variable values. The discus-
sion below assumes a low-level program representation, 5. Otherwise, the effects of instruction Ik cannot be
e.g., as three-address code, RTL, or even machine in- modelled by the analyzer. The analysis is aborted
structions. in this case, and our system conservatively assumes
Assume that each instruction in the program has a that π is a feasible path.
unique name Ik . The value of a variable x at the begin-
ning of π is denoted by x0 , while at intermediate points Once the constraint Cπ has been constructed in this way,
along the path, the value of x immediately after instruc- a constraint solver is used to determine its satisfiability.
tion Ik is denoted by xk . An unknown value is denoted The constraints so generated can be simplified, using
by ⊥. The constraint Cπ is constructed as a conjunction path slicing techniques [15], to reduce the cost of test-
of a constraint Ck for each instruction Ik in π, as follows: ing for satisfiability; our current implementation, which
uses the Omega calculator [23] for satisfiability testing,
1. Assignment: Ik ≡ ‘x = y’. Then, Ck ≡ xk = y j , does not currently do such simplifications.
where I j refers to the most recent instruction in π Figure 7 illustrates the use of constraints for static
that defined y ( j = 0 if there is no definition of y in path feasibility analysis. The parenthetical figures to the
π before Ik ). right of each basic block serve to identify different in-
structions. Consider the path π = B0→B2→B3→ B5.
2. Arithmetic: Ik ≡ ‘x = y ⊕ z’ for some operation ⊕, The only relevant live variable at the entry to this path
where Ii and I j refer to the most recent instructions is u. The corresponding constraint Cπ is therefore:
defining y and z respectively (i = 0 if y has not yet
been defined along π, and similarly with j). Then, (∃u0 )[x1 = 1 ∧ y2 = 2 ∧ u0 > 0 ∧ z5 = x1 − y2 ∧ z5 > 0].

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
It is not difficult to see that this constraint is unsatisfi- on this situation by using dynamic analyses to identify
able, which means that the path π is unfeasible. Note edges that are actually taken during execution and mark-
that conventional constant propagation would obtain z ing only these edges, then propagating dataflow infor-
= ⊥ at entry to block B3, and thereby conclude that the mation along these marked edges, as follows:
path π is feasible.
Note that this example could also have been handled 1. Initially mark only those edges that are identified
by cloning block B3, which would have the effect of as taken by the dynamic analysis.
preventing the loss of information resulting from the
2. Carry out constant propagation on the program,
control flow join of edges B1→B3 and B2→B3, after
propagating information only along marked edges.
which constant propagation would give the expected re-
sults. Thus, path feasibility analysis and cloning can be If a conditional branch is encountered where only
seen as complementary techniques. one the outgoing control flow edges is taken dur-
ing execution, but where the outcome of the branch
3.3 Combining Static and Dynamic Analyses cannot be uniquely determined from the constant
Conventional static analyses, such as that of Section 3.2, propagation, add the branch that is not taken dur-
are inherently conservative,2 so the set of edges result- ing execution into the set of control flow edges that
ing from purely static deobfuscation techniques are, in can be taken, and mark it.
general, a superset of the actual set of edges. Con-
versely, dynamic analyses, such as program tracing or In our implementation, the effect of this approach is to
edge profiling, cannot take into account all the possi- prune the dataflow information propagated into switch
ble input values to a program, and therefore are able to blocks. As an example, consider the following con-
observe only a subset of all its possible execution paths. trol flow fragment, where solid arrows represent control
The dual natures of these two approaches to program flow edges that are taken during execution, while dashed
analysis suggests that we try to combine them. This can arrows correspond to edges that are never taken:
be done in two ways. We can begin with an underap-
proximation to the set of control flow edges obtained via S
dynamic analysis, then use static analysis to add back switch (x)
some control flow edges that could be taken. Alterna- executed edge
tively, we can begin with an overapproximation to the non−executed edge
A B C
set of control flow edges edges obtained via static anal- x=1 x=2 x=3
ysis, then use dynamic analysis to remove some con-
trol flow edges (or paths) that are not actually taken. In
either case, the result may contain either more or less
edges than the original program, i.e., when we combine In this example, basic block B is never executed, so the
static and dynamic analyses the result cannot be guar- control flow edges S→B and B→S are not marked and
anteed to be either sound or precise. Nevertheless, from have no information propagated along them. The as-
the perspective of reverse engineering and program un- signment ‘x=2’ in block B is therefore not considered
derstanding, such combined analyses can be very use- for static analysis; this results in the value 2 not being
ful for overcoming the limitations of purely static and considered to be a possible value for the variable x at
purely dynamic analyses. the switch.
For the work described in this paper, we used a static
analysis to improve the results of dynamic analysis by 4 Experimental Evaluation
adding back some control flow edges that could possi-
bly be taken. The essential idea behind our approach We evaluated our ideas using two different binary
is based on the following gedankenexperiment: suppose rewriting systems for the Intel x86 platform: P LTO [24]
we know, somehow, which control flow edges can ac- and D IABLO [10]. We implemented three control flow
tually be taken during execution. Then, we can simply flattening obfuscations described in Wang’s dissertation
mark these edges and propagate dataflow information and discussed in Section 2 in these tools, and used
only along such marked edges, thereby avoiding the im- these to obfuscate ten programs from the SPECint-2000
precision resulting from propagating information along benchmark suite. While these programs happen to be
edges that can never be taken at runtime. Conventional written in C, our experiments were carried out on pro-
static analyses can then be thought of as the degener- gram binaries.
ate case where all edges are marked. We can improve Each of our benchmarks was compiled using gcc ver-
sion 3.2.2, at optimization level -O3, with additional
2 This follows from soundness considerations, which cause static
command-line flags to produce statically linked relocat-
analyses to propagate information along a superset of the execution
paths that may actually be taken by a program during execution. This
able binaries, and the resulting binaries processed us-
observation need not hold if soundness is sacrificed, as with some ing the obfuscators mentioned above. Functions con-
recently-proposed analyses [12, 13]. taining (indirect jumps resulting from) switch state-

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
Original Obfuscated Effects of Obfuscation
Program Functions Edges Functions Edges
(Forig ) (Eorig ) (Fobf ) (Eobf ) Fobf /Forig Eobf /Eorig
bzip2 42 2,655 30 157,192 0.714 59.21
crafty 104 12,172 89 4,309,502 0.855 352.05
gap 825 43,079 768 1,973,980 0.930 45.82
gcc 1,792 99,516 1,398 8,816,058 0.780 88.59
gzip 73 2,916 59 107,882 0.808 37.00
mcf 19 799 19 16,756 1.000 20.97
parser 180 12,299 174 684,904 0.966 55.69
twolf 165 14,799 157 1,277,410 0.951 86.32
vortex 638 39,229 615 1,969,734 0.963 50.21
vpr 252 8,948 211 310,210 0.837 34.67
G EOM . M EAN : 0.876 59.43
(a) P LTO

Original Obfuscated Effects of Obfuscation


Program Functions Edges Functions Edges
(Forig ) (Eorig ) (Fobf ) (Eobf ) Fobf /Forig Eobf /Eorig
bzip2 35 2,167 34 168,032 0.971 77.54
crafty 102 11,853 86 2,701,600 0.843 227.92
gap 809 44,431 738 2,963,737 0.912 66.70
gcc 1,071 80,168 685 1,801,553 0.639 22.47
gzip 44 1,871 35 99,486 0.795 53.17
mcf 18 605 18 16,908 1.000 27.97
parser 185 10,301 174 714,223 0.940 69.34
twolf 165 12,772 156 1,553,117 0.945 121.60
vortex 620 32,048 599 1,298,439 0.966 40.52
vpr 103 2,305 84 44,288 0.815 19.21
G EOM . M EAN : 0.876 55.1
(b) D IABLO

Table 1: Static characteristics of original and obfuscated benchmark programs

ments were not obfuscated because our obfuscators cur- of such errors that can occur: first, Pdeobf may contain
rently are not able to process the resulting control flow. some edge that does not appear in Porig ; and second,
Library functions were also excluded, because in most Pdeobf may not contain some edge that appears in Porig .
cases such functions contain nonstandard control flow, We term the first kind of error overestimation errors
e.g., where control jumps from one function into an- (written ∆over ), and the second kind of errors underesti-
other without using the normal call/return mechanism mation errors (written ∆under ):
for inter-procedural control transfers. Static character-
istics of these benchmarks are shown in Table 1, which ∆over = |{e | e ∈ Pdeobf and e ∈ Porig }|
compares the original programs with those resulting ∆under = |{e | e ∈ Pdeobf and e ∈ Porig }|
from basic control flow flattening.3 Overall, Table 1
shows that our tools obfuscate most user functions in Since the input to the deobfuscator is the obfuscated
the program (on average, about 88%). As expected, ob- program, we express the overestimation and underes-
fuscation causes the number of control flow edges to timation errors relative to the number of edges in the
increase, though the scale of the increase—a factor of input obfuscated program.
roughly 55× to 60×—is larger than we had expected.
Control flow deobfuscation involves deleting spuri- 4.1 Basic Flattening
ous control flow edges that have been added by the ob- We first consider programs obfuscated using the basic
fuscator. To evaluate the efficacy of various deobfus- control flow flattening technique described in Section
cation techniques, therefore, we compare the deobfus- 2.1. This turns out to be straightforward to deobfus-
cated program Pdeobf with the original program Porig to cate using purely static techniques. We considered two
classify any errors made by the deobfuscator in delet- different approaches: the D IABLO implementation used
ing control flow edges. In principle, there are two kinds cloning followed by conventional constant propagation
to disambiguate control flow; the P LTO implementation
3 The differences in the number of functions, basic blocks, and
used Constraint-based Path Feasibility Analysis.
edges reported by P LTO and D IABLO arise partly because they linked
in different versions of the standard C library, and partly due to some
The results of deobfuscation are shown in Table 2(a).
differences in code transformations carried out by the two tools, e.g., For each of our implementations, we consider two met-
D IABLO carries out some tail-call optimization before obfuscation. rics: the extent of deobfuscation, i.e., the number of

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
P LTO D IABLO
Program Added Removed % Over % Under Added Removed % Over % Under
bzip2 154,537 154,537 0.00 0.00 165,865 164,657 0.73 0.00
crafty 4,297,330 4,297,330 0.00 0.00 2,689,747 2,685,374 0.16 0.00
gap 1,930,901 1,930,901 0.00 0.00 2,919,306 2,900,564 0.64 0.00
gcc 8,716,542 8,716,542 0.00 0.00 1,801,553 90,893 0.60 0.00
gzip 104,996 104,996 0.00 0.00 97,615 96,821 0.81 0.00
mcf 15,957 15,957 0.00 0.00 16,303 15,944 2.20 0.00
parser 672,605 672,605 0.00 0.00 703,922 698,700 0.74 0.00
twolf 1,262,611 1,262,611 0.00 0.00 1,540,345 1,533,774 0.43 0.00
vortex 1,930,505 1,930,505 0.00 0.00 1266,391 1,255,663 0.85 0.00
vpr 301,262 301,262 0.00 0.00 41,983 41,226 1.80 0.00
G EOM . MEAN : 0.00 0.00 0.72 0.00
(a) Basic Flattening

Program Added Removed %Over %Under Program Added Removed %Over %Under
bzip2 154,537 116,896 23.95 0.00 bzip2 165,639 130,743 21.76 0.56
crafty 4,297,330 3,051,105 28.92 0.00 crafty 4,403,750 3,169,697 28.21 0.01
gap 1,930,901 1,177,850 38.15 0.00 gap 2,365,955 1,655,983 31.23 0.03
gcc 8,716,542 4,936,993 42.87 0.00 gcc 9,609,646 5,830,097 39.94 0.01
gzip 104,996 74,111 28.63 0.00 gzip 125,508 97,539 23.69 0.36
mcf 15,957 15,198 4.50 0.00 mcf 22,335 22,375 1.60 1.69
parser 672,605 464,098 30.44 0.00 parser 786,423 590,215 26.09 0.02
twolf 1,262,611 820,698 34.59 0.00 twolf 1,401,063 973,949 31.18 0.03
vortex 1,930,505 1,351,354 29.40 0.00 vortex 2,275,709 1,735,787 25.00 0.02
vpr 301,262 165,695 43.70 0.00 vpr 386,508 259,889 34.21 0.08
G EOM . MEAN : 26.89 0.00 G EOM . MEAN : 21.40 0.06
(b) Flattening with Interprocedural Data Flow (c) Flattening with Artificial Blocks and Pointers

Key:
Added: Number of edges added due to obfuscation = Eobf − Eorig (see Table 1).
Removed: Number of edges removed by the deobfuscator.
% Over: Overestimation error relative to number of edges in obfuscated program = ∆over /Eobf .
% Under: Underestimation error relative to number of edges in obfuscated program = ∆under /Eobf .
∆over ∆under are defined in Section 4.

Table 2: Deobfuscation results

obfuscation edges that we were able to remove via the 4.2 Flattening with Interprocedural Data Flow
deobfuscation process; and precision, which gives the
For flattening with interprocedural data flow, we used
number of overestimated and underestimated edges, as
only the P LTO implementation, using static path feasi-
discussed above. It can be seen that the P LTO imple-
bility analysis by itself as well as in combination with
mentation, using constraint-based path feasibility anal-
dynamic execution tracing.
ysis, is able to recover the original programs completely,
without any error. The D IABLO implementation, which In this case, because our path feasibility analysis is
uses code cloning followed by constant propagation, is purely intra-procedural in nature, it is unable to achieve
able to remove over 99% of the obfuscation edges. The any deobfuscation.
resulting programs still have a small amount of overesti- We do somewhat better when the static analysis is
mation errors (0.72% on average), due to edges that did combined with dynamic tracing. The results are shown
not appear in the original programs. This is to a great in Table 2(b). The resulting deobfuscated programs
extent an artifact of the program transformation used: have some overestimation errors, ranging from 4.5%
the cloning process introduces a number of additional for the mcf benchmark to 43.7% for vpr, with an over-
control flow edges into the program, and these are not all mean of 26.9%. There is no underestimation error
all eliminated by the constant propagation. It turns out for any of the benchmarks. It is significant that even
that most of them could be eliminated quite easily by though the underlying static analysis is purely intra-
an additional phase of liveness analysis and jump-chain procedural, and has no deobfuscation effect by itself,
collapsing (i.e., where a jump to a jump is replaced by a the effect of combining it with dynamic analysis is to re-
single jump to the final target). However, we did not do move 100 − 26.9 about 73% of the obfuscation edges.
this for the purposes of this paper. Note that the combination of static and dynamic analy-
ses makes a difference only for functions that are ac-
tually executed: for functions that are not executed on

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
our test inputs, we do not consider any edges to be re- for visualizing dynamic system behavior. All of this is
moved, and all of their obfuscation edges are counted fundamentally different from the work described here,
towards the overestimation error in Table 2(b). which has the dual aims of identifying techniques to
help reverse engineer obfuscated code, and for eval-
4.3 Flattening with Artificial Blocks and Pointers uating the strengths and weaknesses of code obfusca-
For flattening with dummy blocks and pointers, we tion techniques. In particular, our work focuses on us-
again used only the P LTO implementation, using static ing simple static and dynamic analyses to reverse engi-
path feasibility analysis by itself as well as in combina- neer programs that have specifically been engineered to
tion with dynamic execution tracing. make reverse engineering difficult.
The static path feasibility analysis is unable to deob- The idea of combining static and dynamic analyses is
fuscate this case, because it currently does not handle discussed by Ernst [11].
indirect memory accesses through pointers.
Deobfuscation improves when static and dynamic 6 Conclusions
analyses are combined. The results are shown in Ta-
ble 2(c). In this table, the values in the column la- Code obfuscation has been proposed by a number of
belled ‘Added’ differ from the corresponding values in researchers as a means to make it difficult to reverse en-
Table 2(b) because the addition of artificial blocks intro- gineer software. Obfuscating transformations typically
duces some additional control flow edges in this case. rely on the theoretical difficulty of reasoning statically
As in the case of flattening with interprocedural data about certain kinds of program properties. This paper
flow, all of the obfuscation edges for functions that are shows, however, that it may be possible to bypass much
not executed are counted towards the overestimation er- of the effects of some obfuscations by a combination of
ror. Overestimation error ranges from 1.6% for mcf to static and dynamic analyses. In particular, we examine
just under 40% for gcc, with an overall mean of 21.4%. the problem of deobfuscating the effects of control flow
There is a small amount of underestimation error as well flattening, a control obfuscation technique proposed in
in this case, ranging from 0.01% for crafty and gcc to the research literature and used in commercial code ob-
1.7% for mcf, with an overall mean of 0.06%. In other fuscation tools. Our results show that basic control flow
words, deobfuscation removes 100 − (21.4 + 0.06) flattening can be removed in a relatively straightforward
78% of the obfuscation edges. way using purely static techniques, while enhancements
to the basic technique can be largely deobfuscated using
4.4 Deobfuscation Time a combination of static and dynamic techniques.
The total time taken by the P LTO-based deobfuscator for
basic control flow flattening ranges from about 7 sec-
onds for mcf (constraint generation: 2.5 sec; constraint
References
solution: 4.5 sec) to about 21 minutes for gcc (constraint [1] G. Arboit. A method for watermarking java pro-
generation: 631.5 sec; constraint solution: 640.1 sec). grams via opaque predicates. In Proc. 5th. Inter-
The times for the two enhanced obfuscations are simi- national Conference on Electronic Commerce Re-
lar, ranging from 7 sec to 22 mins for the case of inter- search (ICECR-5), 2002.
procedural data flow, and from 8.5 sec to 24 mins for
the case of artificial blocks and indirection. [2] L. Badger, L. D’Anna, D. Kilpatrick, B. Matt,
A. Reisse, and T. Van Vleck. Self-protecting mo-
5 Related Work bile agents obfuscation techniques evaluation re-
port. Technical Report Report No. #01-036, NAI
There does not appear to be a great deal of prior work Labs, March 2002.
on reverse engineering obfuscated code. Kapoor [16]
and Kruegel et al. [18] discuss algorithms for disassem- [3] S. Chandrasekharan. An evaluation of the re-
bling obfuscated binaries. Lakhotia and Kumar discuss silience of control flow obfuscations. Undergrad-
techniques to handle obfuscated procedure calls in bi- uate Honors Thesis, Dept. of Computer Science,
naries [19, 20]. The focus of these works, as well as the The University of Arizona, Dec. 2003.
techniques used, are very different from those described
here. [4] W. Cho, I. Lee, and S. Park. Against intelli-
A number of researchers have considered the use of gent tampering: Software tamper resistance by ex-
dynamic analysis—either by itself, or in conjunction tended control flow obfuscation. In Proc. World
with static analysis—for reverse engineering [17, 26, Multiconference on Systems, Cybernetics, and In-
28]; Stroulia and Systä give an overview [26]. Much formatics, 2001.
of this work focuses on dealing with legacy software,
e.g., for determining modularization and semantic clus- [5] S. Chow, Y. Gu, H. Johnson, and V. A. Zakharov.
tering or understanding high level design patterns, and An approach to the obfuscation of control-flow of

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE
sequential computer programs. In Proc. 4th. Infor- available evidence. Automated Software Engi-
mation Security Conference (ISC 2001), Springer neering: An International Journal, 6(2):107–138,
LNCS vol. 2000, pages 144–155, 2001. April 1999.

[6] C. Collberg, G. Myles, and A. Huntwork. [18] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna.
Sandmark – a tool for software protection re- Static disassembly of obfuscated binaries. In Proc.
search. IEEE Security and Privacy, 1(4):40–49, 13th USENIX Security Symposium, August 2004.
July/August 2003. [19] E. U. Kumar, A. Kapoor, and A. Lakhotia. DOC –
answering the hidden ‘call’ of a virus. Virus Bul-
[7] C. Collberg and C. Thomborson. Software wa-
letin, April 2005.
termarking: Models and dynamic embeddings. In
Proc. 26th. ACM Symposium on Principles of Pro- [20] A. Lakhotia and E. U. Kumar. Abstract stack
gramming Languages, pages 311–324, January graph to detect obfuscated calls in binaries. In
1999. Proc. 4th. IEEE International Workshop on Source
Code Analysis and Manipulation, pages 17–26,
[8] C. Collberg and C. Thomborson. Watermarking, September 2004.
tamper-proofing, and obfuscation – tools for soft-
ware protection. IEEE Transactions on Software [21] C. Linn and S.K. Debray. Obfuscation of exe-
Engineering, 28(8), August 2002. cutable code to improve resistance to static disas-
sembly. In Proc. 10th. ACM Conference on Com-
[9] C. Collberg, C. Thomborson, and D. Low. A tax- puter and Communications Security (CCS 2003),
onomy of obfuscating transformations. Technical pages 290–299, October 2003.
Report 148, Department of Computer Sciences,
[22] T. Ogiso, Y. Sakabe, M. Soshi, and A. Miyaji.
The University of Auckland, July 1997.
Software obfuscation on a theoretical basis and its
[10] B. De Bus, B. De Sutter, L. Van Put, D. Chanet, implementation. IEEE Trans. Fundamentals, E86-
and K. De Bosschere. Link-time optimization of A(1), January 2003.
arm binaries. In Proc. 2004 ACM Conf. on Lan- [23] W. Pugh. The Omega test: a fast and practical inte-
guages, Compilers, and Tools for Embedded Sys- ger programming algorithm for dependence anal-
tems (LCTES’04), pages 211–220, 7 2004. ysis. Comm. ACM, 35:102–114, August 1992.
[11] Michael D. Ernst. Static and dynamic analysis: [24] B. Schwarz, S. K. Debray, and G. R. Andrews.
Synergy and duality. In WODA 2003: ICSE Work- Plto: A link-time optimizer for the Intel IA-32
shop on Dynamic Analysis, Portland, OR, pages architecture. In Proc. 2001 Workshop on Binary
24–27, May 2003. Translation (WBT-2001), 2001.
[12] D. Evans and D. Larochelle. Improving security [25] Preemptive Solutions. Dotfuscator.
using extensible lightweight static analysis. IEEE www.preemptive.com/products/dotfuscator.
Software, 19(1):42–51, January/February 2002. [26] E. Stroulia and T. Systä. Dynamic analysis
for reverse engineering and program understand-
[13] S. Guyer and K. McKinley. Finding your cronies:
ing. ACM SIGAPP Applied Computing Review,
Static analysis for dynamic object colocation. In
10(1):8–17, 2002.
Proc. OOPSLA’04, pages 237–250, October 2004.
[27] Symantec Corp. Understanding and managing
[14] M. Hind and A. Pioli. Which pointer analysis polymorphic viruses. Technical report, 1996.
should I use? In Proc. 2000 ACM SIGSOFT In-
ternational Symposium on Software Testing and [28] T. Systä. Static and Dynamic Reverse Engineering
Analysis, pages 113–123, 2000. Techniques for Java Software Systems. PhD the-
sis, Dept. of Computer and Information Sciences,
[15] R. Jhala and R. Majumdar. Path slicing. In Proc. University of Tampere, Finland, 2000.
ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI), [29] C. Wang, J. Davidson, J. Hill, and J. Knight.
pages 38–47, June 2005. Protection of software-based survivability mech-
anisms. In Proc. International Conference of De-
[16] A. Kapoor. An approach towards disassembly of pendable Systems and Networks, July 2001.
malicious binaries. Master’s thesis, University of [30] Chenxi Wang. A Security Architecture for Surviv-
Louisiana at Lafayette, 2004. ability Mechanisms. PhD thesis, Department of
Computer Science, University of Virginia, Octo-
[17] R. Kazman and S. J. Carrière. Playing detec-
ber 2000.
tive: Reconstructing software architecture from

10

Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05)


1095-1350/05 $20.00 © 2005 IEEE

You might also like