phases of compiler
phases of compiler
Introduction to Compiling
Terminology
• Compiler:
– a program that translates an executable program in one
language into an executable program in another
language
– we expect the program produced by the compiler to be
better, in some way, than the original
• Interpreter:
– a program that reads an executable program and
produces the results of running that program
– usually, this involves executing the source program in
some fashion
Abstract view
Source Machine
code Compiler code
errors
• Recognizes legal (and illegal) programs
• Generate correct code
• Manage storage of all variables and code
• Agreement on format for object (or
assembly) code
Front-end, Back-end division
Source IR Machine
Front end Back end
code code
errors
• Front end maps legal code into IR
• Back end maps IR onto target machine
• Simplify retargeting
• Allows multiple front ends
• Multiple passes -> better code
Front end
Source tokens IR
Scanner Parser
code
errors
errors
• Scanner:
– Maps characters into tokens – the basic unit of syntax
• x = x + y becomes <id, x> = <id, x> + <id, y>
– Typical tokens: number, id, +, -, *, /, do, end
– Eliminate white space (tabs, blanks, comments)
• A key issue is speed so instead of using a tool like
LEX it sometimes needed to write your own
scanner
Front end
Source tokens IR
Scanner Parser
code
errors
• Parser:
– Recognize context-free syntax
– Guide context-sensitive analysis
– Construct IR
– Produce meaningful error messages
– Attempt error correction
• There are parser generators like YACC which
automates much of the work
Front end
• Context free grammars are used to represent
programming language syntaxes:
errors
errors
errors
errors
• Code improvement analyzes and change IR
• Goal is to reduce runtime
Middle end (optimizer)
• Modern optimizers are usually built as a set
of passes
• Typical passes
– Constant propagation
– Common sub-expression elimination
– Redundant store elimination
– Dead code elimination
The Phases of a Compiler
Phase-1: Lexical 1
Analysis 2
• Lexical analyzer reads the stream of characters making up the source program and
groups the characters into meaningful sequences called lexeme
• For each lexeme, the lexical analyzer produces a token of the form that it
passes on to the subsequent phase, syntax analysis
(token-name, attribute-value)
• Token-name: an abstract symbol is used during syntax analysis.
• attribute-value: points to an entry in the symbol table for this token.
Example: 1
3
newval := oldval + 12 Tokens:
newval Identifier
= Assignment
operator oldval Identifier
+ Add operator
12 Number
Lexical analyzer truncates white spaces and also removes
errors.
Phase-2: Syntax Analysis 1
5
• Also called Parsing or Tokenizing.
• The parser uses the first components of the tokens produced by the lexical
analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation
Example: 1
6
Phase-3: Semantic Analysis 1
7
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with
the language definition.
• Gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.
• For example, many programming language definitions require an array
index to be an integer; the compiler must report an error if a floating-point
number is used to index an array.
• Example: newval := oldval+12
The type of the identifier newval must match with the type of expression (oldval+12).
Example:
1
• Semantic analysis 8
• Syntactically correct, but semantically incorrect
• example:
• sum = a + b;
Semantic records
int a; a integer
double sum; data type mismatch sum double
char b; b char
Phase-4: Intermediate Code 1
Generation 9
After syntax and semantic analysis of the source program, many compilers generate
an explicit low-level or machine-like intermediate representation (a program for an
abstract machine). This intermediate representation should have two important
properties:
• it should be easy to produce and
• it should be easy to translate into the target machine.
The considered intermediate form called three-address code, which consists of a
sequence of assembly-like instructions with three operands per instruction. Each
operand can act like a register.
This phase bridges the analysis and synthesis phases of translation .
Example: 2
0
newval := oldval
+ fact * 1
Temp1 = Id3 * 1
Id1 = Id2 + Tem
p1
Phase-6: Code Generation 2
3
• The last phase of translation is code generation.
• Takes as input an intermediate representation of the source program and maps it into
the target language
• If the target language is machine, code, registers or memory locations are
selected for each of the variables used by the program.
• Then, the intermediate instructions are translated into
sequences of machine instructions that perform the same task.
• A crucial aspect of code generation is the judicious assignment of registers
to hold variables.
Example: 2
4
Id1 := Id2 + Id3
*1
MO R1,Id
V 3
MU R1,#
L 1
MO R2,Id
V 2
AD R1,R
D 2
MO Id1,R
V 1
2
5
Symbol-Table Management 2
6
• The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name.
• The data structure should be designed to allow the compiler to find the record for each
name quickly and to store or retrieve data from that record quickly
• These attributes may provide information about the storage allocated for a name, its type,
its scope (where in the program its value may be used), and in the case of procedure
names, such things as the number and types of its arguments, the method of passing each
argument (for example, by value or by reference), and the type returned.
Preprocessor
Source Program
Try for example:
Compiler
gcc -v myprog.c
Target Assembly Program
Assembler
Relocatable Object Code
Linker Libraries and
Relocatable Object Files
Absolute Machine Code
Context of a Compiler 8
• The programs which assist the compiler to
convert a skeletal source code into executable
form make the context of a compiler and is as
follows:
• Preprocessor:
The preprocessor scans the source code and
includes the header files which
contain relevant information for various
functions.
• Compiler:
The compiler passes the source
code through various phases and
generates the
target assembly code.
Cont…. 9
• Assembler:
The assembler converts the assembly code into relocatable machine code or object code.
Although this code is in 0 and 1 form, but it cannot be executed because this code has not
been assigned the actual memory addresses.
• Loader/Link Editor:
It performs two functions. The process of loading consists of taking machine code, altering the
relocatable addresses and placing the altered instructions and data in memory at proper
location.
The link editor makes a single program from several files of relocatable machine code. These
files are library files which the program needs.
The loader/link editor produces the executable or absolute machine code.