deobfuscation-reverse-engineering-obfuscated-code
deobfuscation-reverse-engineering-obfuscated-code
return a;
D
return a program analyses and reverse engineering. Space con-
} straints preclude a more detailed elaboration of differ-
ent kinds of deep obfuscation techniques, but the in-
terested reader is referred to a discussion and more de-
Figure 1: An example program and its control flow tailed taxonomy by Collberg et al. [9]. For the purposes
graph of this paper, it suffices to note that working around
deep obfuscation—which requires reasoning about se-
Init
mantic aspects of the program—is intuitively more dif-
f: x=0
ficult than working around surface obfuscation, which
S
is essentially a syntactic issue. This paper is concerned
switch (x)
primarily with deep obfuscation techniques that attempt
0 3
A 1 2 to disguise the control flow logic of a program.
a=1
B a=j C a = a*i D return a In prior work, we considered the problem of deobfus-
x=i<j?1:2 x=3 i = i−1
x=i>0?2:3
cating programs that had been subjected to a number of
control flow obfuscations based on opaque predicates;
we found that for the obfuscations considered (a set
of control flow obfuscations implemented in Collbergs’
Sandmark obfuscation tool for Java programs [6]), most
of the obfuscation could be removed using a combina-
tion of fairly straightforward static and dynamic anal-
Figure 2: Control flow graph after basic flattening yses [3]. This paper considers a different approach to
control flow obfuscation, taken from Chenxi Wang’s
flattening obfuscation technique, which has been pro- dissertation [29, 30]. This choice is motivated by three
posed in the research literature [2, 29] and used in factors. First, based on our experiments, it seems more
a commercial code obfuscation product by Cloakware difficult to break than those we had considered earlier
[5], against attacks based on combinations of static [3]. Second, this approach has been considered by other
and dynamic analyses. Our results indicate that from researchers as well [2], and its resilience is therefore of
the perspective of reverse engineering, simple dynamic interest to the research community. Finally, it is a key
techniques can often be very useful in coping with component of an industrial obfuscation tool by Cloak-
code obfuscation. From a software security perspec- ware Inc. [5].
tive, we show that many obfuscation techniques can be This section describes the basic control flow obfusca-
largely neutralized using combinations of simple and tion technique as well as two enhancements that aim to
well known static and dynamic analyses. make the basic approach harder to break.
2 Obfuscating Transformations
2.1 Basic Control Flow Flattening
Conceptually, we can distinguish between two broad
classes of obfuscating transformations. The first, sur- Control flow flattening aims to obscure the control flow
face obfuscation, focuses on obfuscating the concrete logic of a program by “flattening” the control flow graph
syntax of the program. An example of this is changing so that all basic blocks appear to have the same set of
variable names or renaming different variables in differ- predecessors and successors. The actual control flow
ent scopes to the same identifier , as carried out by the during execution is guided by a dispatcher variable. At
“Dotfuscator” tool for obfuscating .NET code [25]. The runtime, each basic block assigns to this dispatcher vari-
second, deep obfuscation, attempts to obfuscate the ac- able a value indicating which next basic block should be
tual structure of the program, e.g., by changing its con- executed next. A switch block then uses the dispatcher
trol flow or data reference behavior [4, 8]. While the variable to jump indirectly, through a jump table, to the
former may make it harder for a human to understand intended control flow successor.
the source code, it does nothing to disguise the semantic As an example, consider the program shown in Figure
structure of the program. It therefore has no effect on al- 1. Basic control flow flattening of this program results
gorithms used for reverse engineering, such as program in the control flow graph shown in Figure 2, where S is
f:
int a, b, c, *p, *q S
switch (x)
0
1
2 3 4 5 6 7 8 9
A B C D
p = &b p = &b p = &b p = &b a=j p = &b p = &a a = a*i
a=1
*p = 3 x=i<j?b:c *p = 4 *p = 9 *p = 3 x=b *p = 8 *p = 3 i = i−1 return a
q = &c q = &c x=5 x=6 q = &c q = &c x=i>0?b:c
*q = 6
*q = 4 *q = 9 *q = 9
x=1
x=1 x=8 x=b
w = random 1 w = random 2
A[w] = 3 A[w] = 3
2.2 Enhancement I: Interprocedural Data Flow
A[w+1] = 1 A[w+1] = 1
A[w+2] = 2 A[w+2] = 2 In the basic control flow flattening transformation dis-
. . . . . . cussed in Section 2.1, the values assigned to the dis-
call f(i, j) call f(i, j)
patch variable are available within the function itself.
Because of this, while the control flow behavior of
the obfuscated code is not obvious, it can be recon-
Init
f: x=0 structed by examining the constants being assigned to
the dispatch variable. This, in turn, requires only intra-
S switch (x) procedural analysis.
0 3 The resilience of the obfuscation technique can be
1 2
A
B C D
improved using interprocedural information passing.
a=j a = a*i return a
a=1
x=i<j? x = A[w] i = i−1
The idea is to use a global array to pass the dispatch
A[w+1] : x=i>0? variable values. At each call site to the function, these
A[w+2] A[w+2] :
A[w]
values are written into the global array starting at some
random offset within the array (appropriately adjusted
to avoid buffer overflows). The offset so chosen may
be different at different call sites for the function, and is
passed to the obfuscated callee either as a global or as
an argument. The obfuscated code then assigns values
Figure 3: Enhancing flattening with Interprocedural to the dispatch variable from the global array. Neither
Data Flow the actual locations accessed, nor the contents of these
locations, are constant values, and are not evident by
the switch block and x the dispatcher variable.1 The ini- examining the obfuscated code of the callee. The code
tial assignment to the dispatcher variable x in the block resulting from applying this obfuscation to the program
Init is intended to route control to A, the original entry in Figure 1 is illustrated in Figure 3.
block of f(), when control first enters the function; after
2.3 Enhancement II: Artificial Blocks and Pointers
1 Strictly speaking, Figure 2 is slightly inaccurate in that it shows The obfuscation technique detailed above can be ex-
that the control flow from basic blocks A, B, and C come together
into a single block, at the bottom of the picture, from which it then tended by adding artificial basic blocks to the control
branches to the top of the switch block S. In practice, control would flow graph. Some of these artificial blocks are never
go directly from each of A, B, and C directly to the top of S. We draw be executed, but this is difficult to determine by a static
it as shown to reduce the clutter of control flow edges and bring out the
essential logic underlyig the transformation. This becomes especially
examination of the program because of the dynamically
important when we consider enhancements to the basic transforma- computed indirect branch targets in the obfuscated code.
tion, as illustrated in Figs. 3 and 4. We then add indirect loads and stores, through pointers,
B4 B5
ments were not obfuscated because our obfuscators cur- of such errors that can occur: first, Pdeobf may contain
rently are not able to process the resulting control flow. some edge that does not appear in Porig ; and second,
Library functions were also excluded, because in most Pdeobf may not contain some edge that appears in Porig .
cases such functions contain nonstandard control flow, We term the first kind of error overestimation errors
e.g., where control jumps from one function into an- (written ∆over ), and the second kind of errors underesti-
other without using the normal call/return mechanism mation errors (written ∆under ):
for inter-procedural control transfers. Static character-
istics of these benchmarks are shown in Table 1, which ∆over = |{e | e ∈ Pdeobf and e ∈ Porig }|
compares the original programs with those resulting ∆under = |{e | e ∈ Pdeobf and e ∈ Porig }|
from basic control flow flattening.3 Overall, Table 1
shows that our tools obfuscate most user functions in Since the input to the deobfuscator is the obfuscated
the program (on average, about 88%). As expected, ob- program, we express the overestimation and underes-
fuscation causes the number of control flow edges to timation errors relative to the number of edges in the
increase, though the scale of the increase—a factor of input obfuscated program.
roughly 55× to 60×—is larger than we had expected.
Control flow deobfuscation involves deleting spuri- 4.1 Basic Flattening
ous control flow edges that have been added by the ob- We first consider programs obfuscated using the basic
fuscator. To evaluate the efficacy of various deobfus- control flow flattening technique described in Section
cation techniques, therefore, we compare the deobfus- 2.1. This turns out to be straightforward to deobfus-
cated program Pdeobf with the original program Porig to cate using purely static techniques. We considered two
classify any errors made by the deobfuscator in delet- different approaches: the D IABLO implementation used
ing control flow edges. In principle, there are two kinds cloning followed by conventional constant propagation
to disambiguate control flow; the P LTO implementation
3 The differences in the number of functions, basic blocks, and
used Constraint-based Path Feasibility Analysis.
edges reported by P LTO and D IABLO arise partly because they linked
in different versions of the standard C library, and partly due to some
The results of deobfuscation are shown in Table 2(a).
differences in code transformations carried out by the two tools, e.g., For each of our implementations, we consider two met-
D IABLO carries out some tail-call optimization before obfuscation. rics: the extent of deobfuscation, i.e., the number of
Program Added Removed %Over %Under Program Added Removed %Over %Under
bzip2 154,537 116,896 23.95 0.00 bzip2 165,639 130,743 21.76 0.56
crafty 4,297,330 3,051,105 28.92 0.00 crafty 4,403,750 3,169,697 28.21 0.01
gap 1,930,901 1,177,850 38.15 0.00 gap 2,365,955 1,655,983 31.23 0.03
gcc 8,716,542 4,936,993 42.87 0.00 gcc 9,609,646 5,830,097 39.94 0.01
gzip 104,996 74,111 28.63 0.00 gzip 125,508 97,539 23.69 0.36
mcf 15,957 15,198 4.50 0.00 mcf 22,335 22,375 1.60 1.69
parser 672,605 464,098 30.44 0.00 parser 786,423 590,215 26.09 0.02
twolf 1,262,611 820,698 34.59 0.00 twolf 1,401,063 973,949 31.18 0.03
vortex 1,930,505 1,351,354 29.40 0.00 vortex 2,275,709 1,735,787 25.00 0.02
vpr 301,262 165,695 43.70 0.00 vpr 386,508 259,889 34.21 0.08
G EOM . MEAN : 26.89 0.00 G EOM . MEAN : 21.40 0.06
(b) Flattening with Interprocedural Data Flow (c) Flattening with Artificial Blocks and Pointers
Key:
Added: Number of edges added due to obfuscation = Eobf − Eorig (see Table 1).
Removed: Number of edges removed by the deobfuscator.
% Over: Overestimation error relative to number of edges in obfuscated program = ∆over /Eobf .
% Under: Underestimation error relative to number of edges in obfuscated program = ∆under /Eobf .
∆over ∆under are defined in Section 4.
obfuscation edges that we were able to remove via the 4.2 Flattening with Interprocedural Data Flow
deobfuscation process; and precision, which gives the
For flattening with interprocedural data flow, we used
number of overestimated and underestimated edges, as
only the P LTO implementation, using static path feasi-
discussed above. It can be seen that the P LTO imple-
bility analysis by itself as well as in combination with
mentation, using constraint-based path feasibility anal-
dynamic execution tracing.
ysis, is able to recover the original programs completely,
without any error. The D IABLO implementation, which In this case, because our path feasibility analysis is
uses code cloning followed by constant propagation, is purely intra-procedural in nature, it is unable to achieve
able to remove over 99% of the obfuscation edges. The any deobfuscation.
resulting programs still have a small amount of overesti- We do somewhat better when the static analysis is
mation errors (0.72% on average), due to edges that did combined with dynamic tracing. The results are shown
not appear in the original programs. This is to a great in Table 2(b). The resulting deobfuscated programs
extent an artifact of the program transformation used: have some overestimation errors, ranging from 4.5%
the cloning process introduces a number of additional for the mcf benchmark to 43.7% for vpr, with an over-
control flow edges into the program, and these are not all mean of 26.9%. There is no underestimation error
all eliminated by the constant propagation. It turns out for any of the benchmarks. It is significant that even
that most of them could be eliminated quite easily by though the underlying static analysis is purely intra-
an additional phase of liveness analysis and jump-chain procedural, and has no deobfuscation effect by itself,
collapsing (i.e., where a jump to a jump is replaced by a the effect of combining it with dynamic analysis is to re-
single jump to the final target). However, we did not do move 100 − 26.9 about 73% of the obfuscation edges.
this for the purposes of this paper. Note that the combination of static and dynamic analy-
ses makes a difference only for functions that are ac-
tually executed: for functions that are not executed on
[6] C. Collberg, G. Myles, and A. Huntwork. [18] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna.
Sandmark – a tool for software protection re- Static disassembly of obfuscated binaries. In Proc.
search. IEEE Security and Privacy, 1(4):40–49, 13th USENIX Security Symposium, August 2004.
July/August 2003. [19] E. U. Kumar, A. Kapoor, and A. Lakhotia. DOC –
answering the hidden ‘call’ of a virus. Virus Bul-
[7] C. Collberg and C. Thomborson. Software wa-
letin, April 2005.
termarking: Models and dynamic embeddings. In
Proc. 26th. ACM Symposium on Principles of Pro- [20] A. Lakhotia and E. U. Kumar. Abstract stack
gramming Languages, pages 311–324, January graph to detect obfuscated calls in binaries. In
1999. Proc. 4th. IEEE International Workshop on Source
Code Analysis and Manipulation, pages 17–26,
[8] C. Collberg and C. Thomborson. Watermarking, September 2004.
tamper-proofing, and obfuscation – tools for soft-
ware protection. IEEE Transactions on Software [21] C. Linn and S.K. Debray. Obfuscation of exe-
Engineering, 28(8), August 2002. cutable code to improve resistance to static disas-
sembly. In Proc. 10th. ACM Conference on Com-
[9] C. Collberg, C. Thomborson, and D. Low. A tax- puter and Communications Security (CCS 2003),
onomy of obfuscating transformations. Technical pages 290–299, October 2003.
Report 148, Department of Computer Sciences,
[22] T. Ogiso, Y. Sakabe, M. Soshi, and A. Miyaji.
The University of Auckland, July 1997.
Software obfuscation on a theoretical basis and its
[10] B. De Bus, B. De Sutter, L. Van Put, D. Chanet, implementation. IEEE Trans. Fundamentals, E86-
and K. De Bosschere. Link-time optimization of A(1), January 2003.
arm binaries. In Proc. 2004 ACM Conf. on Lan- [23] W. Pugh. The Omega test: a fast and practical inte-
guages, Compilers, and Tools for Embedded Sys- ger programming algorithm for dependence anal-
tems (LCTES’04), pages 211–220, 7 2004. ysis. Comm. ACM, 35:102–114, August 1992.
[11] Michael D. Ernst. Static and dynamic analysis: [24] B. Schwarz, S. K. Debray, and G. R. Andrews.
Synergy and duality. In WODA 2003: ICSE Work- Plto: A link-time optimizer for the Intel IA-32
shop on Dynamic Analysis, Portland, OR, pages architecture. In Proc. 2001 Workshop on Binary
24–27, May 2003. Translation (WBT-2001), 2001.
[12] D. Evans and D. Larochelle. Improving security [25] Preemptive Solutions. Dotfuscator.
using extensible lightweight static analysis. IEEE www.preemptive.com/products/dotfuscator.
Software, 19(1):42–51, January/February 2002. [26] E. Stroulia and T. Systä. Dynamic analysis
for reverse engineering and program understand-
[13] S. Guyer and K. McKinley. Finding your cronies:
ing. ACM SIGAPP Applied Computing Review,
Static analysis for dynamic object colocation. In
10(1):8–17, 2002.
Proc. OOPSLA’04, pages 237–250, October 2004.
[27] Symantec Corp. Understanding and managing
[14] M. Hind and A. Pioli. Which pointer analysis polymorphic viruses. Technical report, 1996.
should I use? In Proc. 2000 ACM SIGSOFT In-
ternational Symposium on Software Testing and [28] T. Systä. Static and Dynamic Reverse Engineering
Analysis, pages 113–123, 2000. Techniques for Java Software Systems. PhD the-
sis, Dept. of Computer and Information Sciences,
[15] R. Jhala and R. Majumdar. Path slicing. In Proc. University of Tampere, Finland, 2000.
ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI), [29] C. Wang, J. Davidson, J. Hill, and J. Knight.
pages 38–47, June 2005. Protection of software-based survivability mech-
anisms. In Proc. International Conference of De-
[16] A. Kapoor. An approach towards disassembly of pendable Systems and Networks, July 2001.
malicious binaries. Master’s thesis, University of [30] Chenxi Wang. A Security Architecture for Surviv-
Louisiana at Lafayette, 2004. ability Mechanisms. PhD thesis, Department of
Computer Science, University of Virginia, Octo-
[17] R. Kazman and S. J. Carrière. Playing detec-
ber 2000.
tive: Reconstructing software architecture from
10