Compiler Design
Compiler Design
Introduction
Falguni Sinhababu
Government College of Engineering and Leather Technology
1 1
Books
Compliers: Principles, Techniques and Tools by Aho,
Lam, Sethi, Ullman Dragon Book
Engineering a Complier : Cooper and Torczon
The Essence of Compilers by Hunter (Prentice-
Hall)
Modern Compiler Design by Grune (Wiley)
2
Definitions
What is a compiler?
A program that accepts as input a program text in a
certain language and produces as output a program
text in another language, while preserving the meaning
of that text (Grune et al, 2000).
A program that reads a program written in one
language (source language) and translates it into an
equivalent program in another language (target
language) (Aho et al)
3 3
General Structure of a Compiler
Error message
The compiler:-
must generate correct code.
must recognise errors.
analyses and synthesises.
4
Definitions
Interpreter
A computer program that translates an instruction into machine language and
executes it before going to the next instruction.
Cross-complier
A cross-complier is a compiler that runs on one machine and produces object
code for another machine. If a compiler has been implemented in its own language,
then this arrangement is called bootstrap arrangement.
Assembler
An assembler translates an assembly language source code to executable or
almost executable object code.
Macro assembler
A macro assembler is a type of assembler that supports the use of macros, which are blocks of
code that can be defined and then invoked multiple times within a program, potentially saving time.
5
Qualities of a good compiler
Generates correct code (first and foremost).
Generates fast code.
Conforms to the specification of the input language.
Copes with essentially arbitrary input size, variables, etc.
Compilation time (linearly) proportional to size of source.
Good diagnostics.
Consistent optimization.
Works well with the debugger.
6
Principles of compilation
Preserve the meaning of the program being compiled.
Improves the source code in some way.
Speed (of compiled code).
Space (size of compiled code).
Feedback (information provided to the user).
Debugging (transformation observe the relationship source
code vs target).
Compilation time efficiency (fast or slow compiler).
7
Language Processing System
8 2024/8/3
Phases of a Compiler
9
Compilers and Interpreters
• Compilers generate machine code, whereas Interpreters interpret
intermediate code
• Interpreters are easier to write and can provide better error messages
(symbol table still available)
• Interpreters are at least 5 times slower than machine code generated by
compilers
• Interpreters also require much more memory than machine code
generated by compilers
• Examples: Java, Scala, C#, C, C++ use Compilers. Perl, Ruby, PHP use Interpreters.
10
Translation Overview – Lexical Analysis
11
Lexical Analysis (Scanning)
Reads characters in the source program and group them into stream of tokens (basic unit of
syntax).
Each token represents a logical cohesive sequence of characters, such as identifiers,
operators and keywords.
The character sequence that forms a token is called a lexeme.
The output is called a token and is a pair of the form <type, lexeme> or <token_class,
attribute>
a = b + c becomes
<id, a> <=> <id, b> <id, c>
position = initial + rate * 60 becomes
<id, 1> <=> <id, 2> < +> <id, 3> <*> <60>
And each id attribute is used to record in the symbol table
Lexical analysis eliminates white space
FLEX or LEX is used for generating scanners; programs which recognize lexical patterns in
text.
12 12
Token Types
Keywords: if, else, int, char, do, while, for, struct, return etc
Constants: often these are numbers, strings or characters
Numbers are types of numbers
Strings are text items the language can recognize
In C or C++ → “This is string”
Characters are single letters
In C or C++ →‘C’
Identifiers: names the programmer has given to something. These include variables, functions,
classes, enumerates etc. each language has rules for specifying how these names can be
written.
Operators: these are the mathematical, logical and other operators that the language can
recognize
+, -, *, /, % (modulo), -- (decrement), ++ (increment) etc
Other tokens: {()} may be valid in the language but not treated as a keyword or operator
13
Attributes of Tokens
The lexical analyser returns to the parser a representation for the token it has found. The
representation is a integer code if the token is a simple construct such a left parenthesis, comma or
colon. The representation is a pair of an integer code and a pointer to a table if the token is a more
complex element such as an identifier or constant. The integer type gives the token type and the
pointer points to the value of that token.
The token names and the associated attribute values for the FORTRAN statement
E = M * C ** 2 are written below as a sequence of pairs
<id, pointer to symbol table entry for E>
<assign_op>
<id, pointer to symbol table entry for M>
<multi_op>
<id, pointer to symbol table entry for C>
<exp_op>
<number, integer value 2>
Special operators, punctations and keywords, there is no need for an attribute value. In this example,
the token number has been given an integer valued attribute.
14
Lexical Analysis
• LA can be generated automatically from regular Expressions specification
• LEX and Flex are two such tools
15
Translation Overview – Syntax Analysis
16
Parsing or Syntax Analysis
• Syntax Analyzers ( Parsers) can be generated automatically from several
variants of context free grammar specifications
• LL(1) or LALR(1) are the most popular ones
• ANTLR (for LL(1)), YACC and Bison (for LALR(1)) are such tools
17
Translation Overview – Semantic Analysis
18
Semantic Analysis
• Semantic consistency that cannot be handled at the parsing stage is handled
here
• Type checking of various language constructs is one of the most important tasks
• Stores type information in the symbol table or the syntax tree
• Types of variables, function parameters, array dimensations, etc.
• Used not only for semantic validation but also for subsequent phases of compilation
19
Translation overview – Intermediate Code Generation
20
Intermediate Code Generation
• While generation machine code directly from source code , it entails two
problems
• With m languages and n target machines, we need to write mxn compilers
• The code optimizer which is one of the largest and very-difficult-to-write component of
any compiler cannot be reused
21
Different types of Intermediate Code
• Types of Intermediate code deployed is based on the application
• Quadruples, triples, indirect triples, abstract syntax trees are the classical forms
used for machine-independent optimizations
• Static Single Assgnment form(SSA) is a recent form and enables more effective
optimizations
• Conditional constant propagation and global values numbering are more effective on
SSA
22
Translation Overview – Code Optimization
23
Machine Independent Code Optimization
• Intermediate code generation process introduces many inefficiencies.
• Extra copies of variables, using variables instead of constants, repeated evolution
of expressions, etc.
24
Examples of Machine Independent Code Optimization
• Common sub-expression elimination
• Copy propagation
• Loop invariant code motion
• Partial redundant elimination
• Induction variable elimination and strength reduction
• Code optimization needs information about the program
• Which expressions are being recomputed in a function?
• Which definitions reach a point?
25 25
Translation Overview – Code Generation
26
Code Generation
Converts intermediate code to machine code
Each intermediate code instruction may result in many machine instructions or
vice-versa
Must handle all aspects of machine architecture
Registers, pipelining, cache, multiple function units, etc.
Generating efficient code is a NP complete problem
Tree pattern matching-based strategies are the best and most common
Needs tree intermediate code
Storage allocation decisions are made here
Register allocation and assignment are the most important problems
27
Machine-Dependent Optimization
Peephole optimization
Analyze sequence of instructions in a small `window (peephole) and using preset
patterns, replace them with a more efficient sequence
Redundant instruction elimination
E.g. replace the sequence [LD A, R1][ST R1, A] by [LD A, R1]
Eliminate “jump to jump” instructions
Use machine idioms (use INC instead of LD and ADD)
Instruction scheduling (reordering) to eliminate pipeline interlocks and to increase
parallelism
Trace scheduling to increase the size of basic blocks and increase parallelism
Software pipelining to increase parallelism in loops
28
Thank You
29