Language Specification & Compiler Construction: Handouts
Language Specification & Compiler Construction: Handouts
2011
http://ssw.jku.at/Misc/CC/
Course Contents
1. Overview
1.1 Motivation
1.2 Structure of a compiler
1.3 Grammars
1.4 Chomsky's classification of grammars
1.5 The MicroJava language
2. Scanning
2.1 Tasks of a scanner
2.2 Regular grammars and finite automata
2.3 Scanner implementation
3. Parsing
3.1 Context-free grammars and push-down automata
3.2 Recursive descent parsing
3.3 LL(1) property
3.4 Error handling
5. Symbol table
5.1 Overview
5.2 Objects
5.3 Scopes
5.4 Types
5.5 Universe
6. Code generation
6.1 Overview
6.2 The MicroJava VM
6.3 Code buffer
6.4 Operands
6.5 Expressions
6.6 Assignments
6.7 Jumps
6.8 Control structures
6.9 Methods
Further Reading
S.Muchnick: Advanced Compiler Design and Implementation. Morgan Kaufmann,
1997.
A very good and complete book which goes far beyond the scope of this introduc-
tory course. Not quite cheap but rewarding if you really want to become a compiler
expert.
H.Bal, D.Grune, C.Jacobs: Modern Compiler Design. John Wiley, 2000
Also a good books that describes the state of the art in compiler construction.
Aho, R. Sethi, J. Ullman: Compilers –Principles, Techniques and Tools. Addison-
Wesley, 1986.
Old but still good to read. Does not cover recursive descent compilation in depth
but has chapters about optimisation and data flow analysis.
W.M.Waite, G.Goos: Compiler Construction. Springer-Verlag 1984
Theoretical book on compiler construction. Good chapter on attribute grammars.
Compiler Construction Lab
In this lab you will write a small compiler for a Java-like language (MicroJava). You
will learn how to put the techniques from the compiler construction course into prac-
tice and study all the details involved in a real compiler implementation.
The project consists of four levels:
Level 1 requires you to implement a scanner and a parser for the language Micro-
Java, specified in Appendix A of this document.
Level 2 deals with symbol table handling and some type checking.
If you want to go to full length with your compiler you should also implement level
3, which deals with code generation for the MicroJava Virtual Machine specified in
Appendix B of this document. This level is (more or less) optional so that you can
get a good mark even if you do not implement it.
Level 4 finally requires you to use the compiler generator Coco/R to produce a
compiler-like program automatically.
The project should be implemented in Java using Sun Microsystem's Java Develop-
ment Kit (JDK, http://java.sun.com/javase/downloads/index.jsp) or some other devel-
opment environment.
You can also decode a compiled MicroJava program by downloading the file De-
code.java to the package MJ and compiling it. You can invoke it with
java MJ.Decode sample.obj
We want to read the tree from an input file, which represents the tree structure with
brackets, i.e.:
(London
(Brussels)
(Paris
(Madrid)
(Rome
()
(Vienna)
)
)
)
Describe the input of such trees by a recursive EBNF grammar. Write a Coco/R com-
piler description using this grammar. Terminal symbols are identifiers as well as '('
and ')'. Add attributes and semantic actions to your compiler description in order to
build the corresponding binary tree. Write also a dump method that prints the tree
after it was built.
In order to use Coco/R go to http://ssw.jku.at/Coco/#Java and download the files
Coco.jar, Scanner.frame and Parser.frame into a new directory Tree. If your compiler
description is in a file Tree.atg in the directory Tree go to this directory and type
java -jar Coco.jar Tree.atg
This will generate the files Scanner.java and Parser.java in the directory Tree. Write
a main program TreeBuilder.java that creates a scanner and a parser and calls the
parser (look at the slides in the course).
Sample program
program P
final int size = 10;
class Table {
int[] pos;
int[] neg;
}
Table val;
{
void main()
int x, i;
{ //---------- Initialize val ------------
val = new Table;
val.pos = new int[size];
val.neg = new int[size];
i = 0;
while (i < size) {
val.pos[i] = 0; val.neg[i] = 0;
i = i + 1;
}
//------------ Read values -------------
read(x);
while (x != 0) {
if (x >= 0) {
val.pos[x] = val.pos[x] + 1;
} else if (x < 0) {
val.neg[-x] = val.neg[-x] + 1;
}
read(x);
}
}
}
A.2 Syntax
Program = "program" ident {ConstDecl | VarDecl | ClassDecl}
"{" {MethodDecl} "}".
Lexical structure
Operators: + - * / %
== != > >= < <=
( ) [ ] { }
= ; , .
Comments: // to the end of line
A.3 Semantics
All terms in this document that have a definition are underlined to emphasize their
special meaning. The definitions of these terms are given here.
Reference type
Arrays and classes are called reference types.
Type of a constant
The type of an integer constant (e.g. 17) is int.
The type of a character constant (e.g. 'x') is char.
Same type
Two types are the same
if they are denoted by the same type name, or
if both types are arrays and their element types are the same.
Type compatibility
Two types are compatible
if they are the same, or
if one of them is a reference type and the other is the type of null.
Assignment compatibility
A type src is assignment compatible with a type dst
if src and dst are the same, or
if dst is a reference type and src is the type of null.
Predeclared names
int the type of all integer values
char the type of all character values
null the null value of a class or array variable, meaning "pointing to no value"
chr standard method; chr(i) converts the int expression i into a char value
ord standard method; ord(ch) converts the char value ch into an int value
len standard method; len(a) returns the number of elements of the array a
Scope
A scope is the textual range of a method or a class. It extends from the point after the
declaring method or class name to the closing curly bracket of the method or class
declaration. A scope excludes other scopes that are nested within it. We assume that
there is an (artificial) outermost scope (called the universe), to which the main class is
local and which contains all predeclared names. The declaration of a name in an inner
scope hides the declarations of the same name in outer scopes.
Note
Indirectly recursive methods are not allowed, since every name must be declared
before it is used. This would not be possible if indirect recursion were allowed.
A predeclared name (e.g. int or char) can be redeclared in an inner scope (but this is
not recommended).
A.4 Context Conditions
General context conditions
Every name must be declared before it is used.
A name must not be declared twice in the same scope.
A program must contain a method named main. It must be declared as a void
method and must not have parameters.
MethodDecl = (Type | "void") ident "(" [FormPars] ")" {VarDecl} "{" {Statement} "}".
If a method is a function it must be left via a return statement (this is checked at run
time).
Expr = Term.
Expr = "-"Term.
Term must be of type int.
Term = Factor.
code This area contains the code of the methods. The register pc contains the in-
dex of the currently executed instruction. mainpc contains the start address of
the method main().
data This area holds the (static or global) data of the main program. It is an array
of variables. Every variable holds a single word (32 bits). The addresses of
the variables are indexes into the array.
heap This area holds the dynamically allocated objects and arrays. The blocks are
allocated consecutively. free points to the beginning of the still unused area
of the heap. Dynamically allocated memory is only returned at the end of the
program. There is no garbage collector. All object fields hold a single word
(32 bits). Arrays of char elements are byte arrays. Their length is a multiple
of 4. Pointers are word offsets into the heap. Array objects start with an in-
visible word, containing the array length.
pstack This area (the procedure stack) maintains the activation frames of the in-
voked methods. Every frame consists of an array of local variables, each
holding a single word (32 bits). Their addresses are indexes into the array. ra
is the return address of the method, dl is the dynamic link (a pointer to the
frame of the caller). A newly allocated frame is initialized with all zeroes.
estack This area (the expression stack) is used to store the operands of the instruc-
tions. After every MicroJava statement estack is empty. Method parameters
are passed on the expression stack and are removed by the Enter instruction
of the invoked method. The expression stack is also used to pass the return
value of the method back to the caller.
All data (global variables, local variables, heap variables) are initialized with a null
value (0 for int, chr(0) for char, null for references).
B.2 Instruction Set
The following tables show the instructions of the MicroJava VM together with their
encoding and their behaviour. The third column of the tables show the contents of
estack before and after every instruction, for example
..., val, val
..., val
means that this instruction removes two words from estack and pushes a new word
onto it. The operands of the instructions have the following meaning:
b a byte
s a short int (16 bits)
w a word (32 bits)
Variables of type char are stored in the lowest byte of a word and are manipulated
with word instructions (e.g. load, store). Array elements of type char are stored in a
byte array and are loaded and stored with special instructions.
Arithmetic
Object creation
Stack manipulation
Jumps
Input/Output
Miscellaneous