Modern Compiler Implementation in C
Modern Compiler Implementation in C
Implementation
in C
Basic Techniques
ANDREW W. APPEL
Princeton University
c Andrew W. Appel and Maia Ginsburg, 1997
A catalog record for this book is available from the British Library
Preface ix
1 Introduction 3
1.1 Modules and interfaces 4
1.2 Tools and software 5
1.3 Data structures for tree languages 7
2 Lexical Analysis 16
2.1 Lexical tokens 17
2.2 Regular expressions 18
2.3 Finite automata 21
2.4 Nondeterministic finite automata 24
2.5 Lex: a lexical analyzer generator 31
3 Parsing 39
3.1 Context-free grammars 41
3.2 Predictive parsing 46
3.3 LR parsing 56
3.4 Using parser generators 67
4 Abstract Syntax 80
4.1 Semantic actions 80
4.2 Abstract parse trees 84
5 Semantic Analysis 94
5.1 Symbol tables 94
5.2 Bindings for the Tiger compiler 103
5.3 Type-checking expressions 106
v
CONTENTS
vi
CONTENTS
vii
CONTENTS
Bibliography 389
Index 393
viii
Preface
Over the past decade, there have been several shifts in the way compilers are
built. New kinds of programming languages are being used: object-oriented
languages with dynamic methods, functional languages with nested scope and
first-class function closures; and many of these languages require garbage
collection. New machines have large register sets and a high penalty for
memory access, and can often run much faster with compiler assistance in
scheduling instructions and managing instructions and data for cache locality.
This book is intended as a textbook for a one-semester or two-quarter course
in compilers. Students will see the theory behind different components of a
compiler, the programming techniques used to put the theory into practice,
and the interfaces used to modularize the compiler. To make the interfaces
and programming examples clear and concrete, I have written them in the C
programming language. Other editions of this book are available that use the
Java and ML languages.
The “student project compiler” that I have outlined is reasonably simple,
but is organized to demonstrate some important techniques that are now in
common use: Abstract syntax trees to avoid tangling syntax and semantics,
separation of instruction selection from register allocation, sophisticated copy
propagation to allow greater flexibility to earlier phases of the compiler, and
careful containment of target-machine dependencies to one module.
This book, Modern Compiler Implementation in C: Basic Techniques, is
the preliminary edition of a more complete book to be published in 1998,
entitled Modern Compiler Implementation in C. That book will have a more
comprehensive set of exercises in each chapter, a “further reading” discussion
at the end of every chapter, and another dozen chapters on advanced material
not in this edition, such as parser error recovery, code-generator generators,
byte-code interpreters, static single-assignment form, instruction scheduling
ix
PREFACE
http://www.cs.princeton.edu/˜appel/modern/
There are also pencil and paper exercises in each chapter; those marked with a
star * are a bit more challenging, two-star problems are difficult but solvable,
and the occasional three-star exercises are not known to have a solution.
x
PART ONE
Fundamentals of
Compilation
1
Introduction
This book describes techniques, data structures, and algorithms for translating
programming languages into executable code. A modern compiler is often
organized into many phases, each operating on a different abstract “language.”
The chapters of this book follow the organization of a compiler, each covering
a successive phase.
To illustrate the issues in compiling real programming languages, I show
how to compile Tiger, a simple but nontrivial language of the Algol family,
with nested scope and heap-allocated records. Programming exercises in each
chapter call for the implementation of the corresponding phase; a student
who implements all the phases described in Part I of the book will have a
working compiler. Tiger is easily modified to be functional or object-oriented
(or both), and exercises in Part II show how to do this. Other chapters in Part
II cover advanced techniques in program optimization. Appendix A describes
the Tiger language.
The interfaces between modules of the compiler are almost as important as
the algorithms inside the modules. To describe the interfaces concretely, it is
useful to write them down in a real programming language. This book uses
the C programming language.
3
CHAPTER ONE. INTRODUCTION
Environ-
ments
Source Program
Abstract Syntax
Tables
Reductions
Translate
IR Trees
IR Trees
Tokens
Assem
Parsing Semantic Canon- Instruction
Lex Parse Actions Analysis Translate icalize Selection
Frame
Frame
Layout
Assembly Language
Interference Graph
Machine Language
Flow Graph
Assem
Control Data
Flow Flow Register Code Assembler Linker
Analysis Analysis Allocation Emission
4
1.2. TOOLS AND SOFTWARE
this luxury. Therefore, I present in this book the outline of a project where the
abstractions and interfaces are carefully thought out, and are as elegant and
general as I am able to make them.
Some of the interfaces, such as Abstract Syntax, IR Trees, and Assem, take
the form of data structures: for example, the Parsing Actions phase builds an
Abstract Syntax data structure and passes it to the Semantic Analysis phase.
Other interfaces are abstract data types; the Translate interface is a set of
functions that the Semantic Analysis phase can call, and the Tokens interface
takes the form of a function that the Parser calls to get the next token of the
input program.
Two of the most useful abstractions used in modern compilers are context-free
grammars, for parsing, and regular expressions, for lexical analysis. To make
best use of these abstractions it is helpful to have special tools, such as Yacc
(which converts a grammar into a parsing program) and Lex (which converts
a declarative specification into a lexical analysis program).
The programming projects in this book can be compiled using any ANSI-
standard C compiler, along with Lex (or the more modern Flex) and Yacc
(or the more modern Bison). Some of these tools are freely available on the
Internet; for information see the Wide-World Web page
http://www.cs.princeton.edu/˜appel/modern/
5
CHAPTER ONE. INTRODUCTION
Source code for some modules of the Tiger compiler, support code for some
of the programming exercises, example Tiger programs, and other useful files
are also available from the same Web address.
Skeleton source code for the programming assignments is available from
this Web page; the programming exercises in this book refer to this directory as
$TIGER/ when referring to specific subdirectories and files contained therein.
6
1.3. DATA STRUCTURES FOR TREE LANGUAGES
prints
7
CHAPTER ONE. INTRODUCTION
.
CompoundStm
AssignStm CompoundStm
a OpExp
AssignStm PrintStm
NumExp Plus NumExp LastExpList
b EseqExp
5 3 IdExp
PrintStm OpExp b
PairExpList
NumExp Times IdExp
IdExp LastExpList 10 a
a OpExp
a 1
a := 5 + 3 ; b := ( print ( a , a - 1 ) , 10 * a ) ; print ( b )
8 7
80
8
1.3. DATA STRUCTURES FOR TREE LANGUAGES
Grammar typedef
Stm A stm
Exp A exp
ExpList A expList
id string
num int
For each grammar rule, there is one constructor that belongs to the union
for its left-hand-side symbol. The constructor names are indicated on the
right-hand side of Grammar 1.3.
Each grammar rule has right-hand-side components that must be repre-
sented in the data structures. The CompoundStm has two Stm’s on the right-
hand side; the AssignStm has an identifier and an expression; and so on. Each
grammar symbol’s struct contains a union to carry these values, and a
kind field to indicate which variant of the union is valid.
For each variant (CompoundStm, AssignStm, etc.) we make a constructor
function to malloc and initialize the data structure. In Figure 1.5 only the
prototypes of these functions are given; the definition of A_CompoundStm
would look like this:
9
CHAPTER ONE. INTRODUCTION
10
1.3. DATA STRUCTURES FOR TREE LANGUAGES
functions.
8. Each module (header file) shall have a prefix unique to that module (example,
A_ in Figure 1.5).
9. Typedef names (after the prefix) shall start with lower-case letters; constructor
functions (after the prefix) with uppercase; enumeration atoms (after the prefix)
with lowercase; and union variants (which have no prefix) with lowercase.
1. Each phase or module of the compiler belongs in its own “.c” file, which will
have a corresponding “.h” file.
2. Each module shall have a prefix unique to that module. All global names
(structure and union fields are not global names) exported by the module shall
start with the prefix. Then the human reader of a file will not have to look
outside that file to determine where a name comes from.
3. All functions shall have prototypes, and the C compiler shall be told to warn
about uses of functions without prototypes.
4. We will #include "util.h" in each file:
/* util.h */
#include <assert.h>
void *checked_malloc(int);
11
CHAPTER ONE. INTRODUCTION
A_stm prog =
A_CompoundStm(A_AssignStm("a",
A_OpExp(A_NumExp(5), A_plus, A_NumExp(3))),
A_CompoundStm(A_AssignStm("b",
A_EseqExp(A_PrintStm(A_PairExpList(A_IdExp("a"),
A_LastExpList(A_OpExp(A_IdExp("a"), A_minus,
A_NumExp(1))))),
A_OpExp(A_NumExp(10), A_times, A_IdExp("a")))),
A_PrintStm(A_LastExpList(A_IdExp("b")))));
12
PROGRAMMING EXERCISE
Files with the data type declarations for the trees, and this sample program,
are available in the directory $TIGER/chap1.
Writing interpreters without side effects (that is, assignment statements that
update variables and data structures) is a good introduction to denotational
semantics and attribute grammars, which are methods for describing what
programming languages do. It’s often a useful technique in writing compilers,
too; compilers are also in the business of saying what programming languages
do.
Therefore, in implementing these programs, never assign a new value to
any variable or structure-field except when it is initialized. For local variables,
use the initializing form of declaration (for example, int i=j+3;) and for
each kind of struct, make a “constructor” function that allocates it and
initializes all the fields, similar to the A_CompoundStm example on page 9.
For part 1, remember that print statements can contain expressions that
contain other print statements.
For part 2, make two mutually recursive functions interpStm and interp-
Exp. Represent a “table,” mapping identifiers to the integer values assigned
to them, as a list of id × int pairs.
taking a table t1 as argument and producing the new table t2 that’s just like
t1 except that some identifiers map to different integers as a result of the
statement.
13
CHAPTER ONE. INTRODUCTION
For example, the table t1 that maps a to 3 and maps c to 4, which we write
{a 7→ 3, c 7→ 4} in mathematical notation, could be represented as the linked
list a 3 c 4 .
Now, let the table t2 be just like t1 , except that it maps c to 7 instead of 4.
Mathematically, we could write,
t2 = update(t1 , c, 7)
EXERCISES
1.1 This simple program implements persistent functional binary search trees, so
that if tree2=insert(x,tree1), then tree1 is still available for lookups
even while tree2 can be used.
14
EXERCISES
a. Implement a member function that returns true if the item is found, else
false.
b. Extend the program to include not just membership, but the mapping of
keys to bindings:
T_tree insert(String key, void *binding, T_tree t);
void * lookup(String key, T_tree t);
c. These trees are not balanced; demonstrate the behavior on the following
two sequences of insertions:
(a) t s p i p f b s t
(b) a b c d e f g h i
*d. Research balanced search trees in Sedgewick [1988] and recommend
a balanced-tree data structure for functional symbol tables. (Hint: to
preserve a functional style, the algorithm should be one that rebalances
on insertion but not on lookup.)
15