Decompilation
Decompilation
1 Introduction
In this lecture, we consider the problem of doing compilation “backwards” - that is, transforming from
a compiled binary into a reasonable representation of its original source. Solving this problem will involve
significant consideration of our standard dataflow analyses, as well as a discussion of good selection of internal
representations of code.
While the motivation for the existence of compilers is fairly clear, the motivation for the existence of
decompilers is less so. However, in the modern world there exist many legacy systems for which the original
source code has been lost, which need bugs fixed in them or to be ported to a more modern architecture.
Decompilers facilitate this process greatly. In addition, in malware analysis, generally source is not provided.
It is therefore extremely useful to have some way to go from binary to a reasonable approximation of the
original code.
For this lecture, we will focus on decompiling machine code, originally C0 code, that conforms to the C
ABI, into a version of C0 with pointer arithmetic and goto. This comes nowhere near to being a treatment
of decompilation of arbitrary binaries (and in fact the algorithms as described here will frequently fail to
work on arbitrary binaries!), though more complex variants of the same ideas will continue to work.
2 Steps of Decompilation
Roughly, decompilation follows a few steps:
1. Disassembly - transformation from machine code to the assembly equivalent. There are a surprising
number of pitfalls here.
2. Lifting and dataflow analysis - transforming the resulting assembly code into a higher-level internal
representation, such as our three-operand assembly. One of the tricky parts here is recognizing distinct
variables, and detaching variables from registers or addresses. We also recover expressions, function
return values and arguments.
3. Control flow analysis - recovering control flow structure information, such as if and while statements,
as well as their nesting level.
4. Type analysis - recovering types of variables, functions, and other pieces of data.
3 Disassembly
The first step of writing a good decompiler is writing a good disassembler. While the details of individual
disassemblers can be extremely complex, the general idea is fairly simple. The mapping between assembly
and machine code is in theory one-to-one, so a straight-line translation should be feasible.
1
However, disassemblers rapidly run into a problem: it is very difficult to reliably distinguish code from
data.
In order to do so, generally disassemblers will take one of two strategies:
1. Disassemble the sections that are generally filled with code (.plt, .text, some others) and treat the
rest of them as data. One tool that follows this strategy is objdump. While this works decently well
on code produced by most modern compilers, there exist (or existed!) compilers that place data into
these executable sections, causing the disassembler some confusion. Further, any confusingly-aligned
instructions will also confuse these disassemblers.
2. Consider the starting address given by the binary’s header, and recursively disassemble all code reach-
able from that address. This approach is frequently defeated by indirect jumps, though most of the
disassemblers that use it have additional heuristics that allow them to deal with this. An example tool
that follows this strategy is Hex-Ray’s Interactive Disassembler.
While disassembly is a difficult problem with many pitfalls, it is not particularly interesting from an
implementation perspective for us. Many program “obfuscators” have many steps that are targeted at
fooling disassemblers, however, as without correct disassembly it is impossible to carry on the later steps.
2
t <- %edx:%eax
%eax <- t / %ecx
%edx <- t % %ecx
and %eax is not live in the successor, it is permissible to remove the second line of the result, since the
third line will cause the division by 0 in the case that %ecx is zero.
Dead register elimination is done following effectively the same rules as dead code elimination from the
homeworks, with some special cases like the above.
2. Dead flag elimination. Our translation makes direct use of the condition flags, and keeps track of which
of them are defined and used at which time. We treat flags effectively as registers of their own. In
this case, if a flag f is defined at a line l and is not live-in in l + 1, then we remove the definition of
f from the line l. This will simplify our later analyses greatly, allowing us to collapse conditions more
effectively.
3. Conditional collapsing. At this stage, we collapse sequences of the form comparison-cjump into a
conditional jump on an expression. For example, after flag elimination, we collapse:
zf <- cmp(%eax,0)
jz label
into
In C0, generally every conditional will have this form. However, sufficiently clever optimizing compilers
may be able to optimize some conditional chains more efficiently. A discussion of transforming more
optimized conditions can be found in Cristina Cifuentes’ thesis.
Having reached this point in the analysis, we would like to lose registers. Hence, we may simply replace
each register with an appropriate temp, taking care to keep argument and result registers pinned. We then do
the function-call-expansion step in reverse, replacing sequences of moves into argument registers followed by
a call with a parametrized call. We note that in order to do so, we must first make a pass over all functions
to determine how many arguments they take, in order to deal with the possibility of certain moves being
optimized out.
At this stage, it is possible to effectively perform a slightly modified SSA analysis on the resulting code.
Hence, for the future we will assume that this SSA analysis has been executed, and define our further analysis
over SSA code. We may now perform an extended copy-propagation pass to collapse expressions.
This is sufficient to perform the next stages of the analysis. However, many decompilers apply much
more sophisticated techniques to this stage. Cristina Cifuentes’ thesis contains a description of many such
algorithms.
3
5.1 Structuring Loops
We will consider three primary different classes of loops. While other loops may appear in decompiled
code, analysis of these more complex loops is more difficult. Further reading can be found in the paper “A
Structuring Algorithm for Decompilation” by Cristina Cifuentes. Our three primary classes are as follows:
1. While loops: the node at the start of the loop is a conditional, and the latching node is unconditional.
2. Repeat loops: the latching node is conditional.
3. Endless loops: both the latching and the start nodes are unconditional.
The latching node here is the node with the back-edge to the start node. We note that there are at most
one of these per loop in our language, as break and continue do not exist.
In order to do so, we will consider intervals on a digraph. If h is a node in G, the interval I(h) is the
maximal subgraph in which h is the only entry node and in which all closed paths contain h. It is a theorem
that there exists a set {h1 , ...hk } of header nodes such that the set {I(h1 ), ...I(hk )} is a partition of the
graph, and further there exists an algorithm to find this partition.
We then define the sequence of derived graphs of G as follows:
1. G1 = G.
2. Gn+1 is the graph formed by contracting every interval of Gn into a single node.
This procedure eventually reaches a fixed point, at which point the resulting graph is irreducible.
Note that for any interval I(h), there exists a loop rooted at h if there is a back-edge to h from some
node z ∈ I(h). One way to find such a node is to simply perform DFS on the interval. Then, in order to
find the nodes in the loop, we define h as being part of the loop and then proceed by noting that a node k
is in the loop if and only if its immediate dominator is in the loop and h is reachable from k.
The algorithm for finding loops in the graph then proceeds as follows. Compute the derived graphs of
G until you reach the fixed point, and find the loops in each derived graph. Note that if any node is found
to be the latching node for two loops, one of these loops will need to be labeled with a goto instead. While
there do exist algorithms that can recover more complex structures, this is not one of them.
1. For every conditional node a, find the set of nodes immediately dominated by a.
2. Produce G0 from G by reversing all the arrows. Filter out nodes from the set above that do not
dominate a in G0 .
3. Find the closest node to a in the resulting set, by considering the one with the highest post-order
number.
The resulting node is the follow node of a.
We note that this algorithm does not do a particularly good job of dealing with boolean short-circuiting.
Any control flow that does not match the patterns above will be replaced with an if with a goto.
4
6 Type Analysis
Given control flow and some idea of which variables are which, it is frequently useful to be able to determine
what the types of various variables are. While it may be correct to produce a result where every variable
is of type void *, no one actually writes programs that way. Therefore, we would like to be able to assign
variables and functions their types, as well as hopefully recover structure layout.
A compiler has significant advantages over a decompiler in this respect. The compiler knows which
sections of a structure are padding, and which are actually useful; it also knows which things a function can
take or accept. A compiler notices that the functions below are different, and so compiles them separately;
a decompiler may not be able to notice that these functions accept different types without some more
sophisticated analysis. In particular, on a 32-bit machine, these functions will produce identical assembly.
struct s1 { int a; };
int s1_get(struct s1 *s) { return s->a; }
struct s2 { struct s1 *a; };
struct s1 *s2_get(struct s2 *s) { return s->a; }
5. If two variables are added together and one is a pointer, the other is an integer.
6. If two variables are added together and one is an integer, the other is either a pointer or an integer.
7. If two variables are compared with <, >, >= or <=, they are both integers.
8. If two variables are compared with == or !=, they have the same type.
12. The sum of a pointer of type τ ∗ and an integer is a pointer, but not necessarily of type τ ∗ .
We note that in order to get high-quality types, we will often need to perform analysis across function
boundaries. We also note that this analysis is entirely unable to distinguish between structures and arrays.
A more sophisticated type analysis is described in the TIE paper in the references section. There is plenty
of research being done in this area, however!
5
7 Other Issues
Other issues that haven’t been discussed here include doing things like automatically detecting vulnerabilities,
detecting and possibly collapsing aliases, recovering scoping information, extracting inlined functions, or
dealing with tail call optimizations. Many of these problems (and, in fact, many of the things discussed
above!) do not have satisfactory solutions, and remain open research problems. For one, CMU’s CyLab
contains a group actively doing research on these topics. They recently (a few days ago!) released a paper
containing a description of their solutions to many of these problems. Since they decompile arbitrary native
code, rather than caring mostly about a specific language, they encounter some very interesting and difficult
problems.
Decompilation as a whole is very much an open research topic, and there exist very few reasonable
decompilers. One of the better-known ones is the Hex-Rays decompiler, and it is sadly entirely closed-
source. As far as I know, there are no high-quality open-source decompilers for x86 or x86 64.
8 References
The material for this lecture was almost entirely gleaned from the following: