Compiler Construction
Compiler Construction
Bachhav
Computers are a balanced mix of software and
hardware.
Hardware is just a piece of mechanical device and its
functions are being controlled by compatible
software.
Hardware understands instructions in the form of
electronic charge, which is the counterpart of binary
language in software programming.
Binary language has only two alphabets, 0 and 1. To
instruct, the hardware codes must be written in
binary format, which is simply a series of 1s and 0s.
It would be a difficult and cumbersome task for
computer programmers to write such codes, which is
why we have compilers to write such codes
computer system is made of hardware and
software.
The hardware understands a language, which
humans cannot understand.
So we write programs in high-level language,
which is easier for us to understand and
remember.
These programs are then fed into a series of
tools and OS components to get the desired
code that can be used by the machine. This
is known as Language Processing System.
Preprocessor: A preprocessor, generally considered as a part of compiler, is a tool that produces
input for compilers. It deals with macro-processing, augmentation, file inclusion, language
extension, etc.
Compiler: A compiler reads the whole source code at once, creates tokens, checks semantics,
generates intermediate code, executes the whole program and may involve many passes. I
Assembler: An assembler translates assembly language programs into machine code. The
output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
Linker: Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by separate
assemblers. The major task of a linker is to search and locate referenced module/routines in a
program and to determine the memory location where these codes will be loaded, making the
program instruction to have absolute references.
Loader: Loader is a part of operating system and is responsible for loading executable files into
memory and execute them. It calculates the size of a program (instructions and data) and
creates memory space for it. It initializes various registers to initiate execution.
Cross-compiler: A compiler that runs on platform (A) and is capable of generating executable
code for platform (B) is called a cross-compiler.
A compiler is a computer program which
helps you transform source code written in a
high-level language into low-level machine
language.
It translates the code written in one
programming language to some other
language without changing the meaning of
the code.
The compiler also makes the end code
efficient which is optimized for execution
time and memory space.
The compiling process includes basic
translation mechanisms and error detection.
Compiler process goes through lexical,
syntax, and semantic analysis at the front
end, and code generation and optimization
at a back-end.
A compiler can broadly be divided into two
phases based on the way they compile.
1)Analysis Phase:
Known as the front-end of the compiler,
the analysis phase of the compiler reads the
source program, divides it into core parts and
then checks for lexical, grammar and syntax
errors. The analysis phase generates an
intermediate representation of the source
program and symbol table, which should be fed
to the Synthesis phase as input.
2) Synthesis Phase:-
Known as the back-end of the compiler,
the synthesis phase generates the target
program with the help of intermediate source
code representation and symbol table.
Pass
: A pass refers to the traversal of a
compiler through the entire program.
What is Pattern?
A pattern is a description which is used by the token. In the
case of a keyword which uses as a token, the pattern is a
sequence of characters.
The main task of lexical analysis is to read
input characters in the code and produce
tokens.
Lexical analyser scans the entire source code
of the program. It identifies each token one
by one. Scanners are usually implemented to
produce tokens only when requested by a
parser.
1)"Get next token" is a command which is sent
from the parser to the lexical analyser.
2)On receiving this command, the lexical
analyser scans the input until it finds the
next token.
3) It returns the token to Parser.
Lexical Analyser skips whitespaces and
comments while creating these tokens. If any
error is present, then Lexical analyser will
correlate that error with the source file and
line number.
Reads the source program, scans the input
characters, group them into lexemes and
produce the token as output.
Helps to identify token into the symbol table
Removes white spaces and comments from
the source program.
Correlates error messages with the source
program i.e., displays error message with its
occurrence by specifying the line number.
Helps you to expands the macros if it is
found in the source program
Consider the following code that is fed to Lexical
Analyser
#include <stdio.h>
int maximum(int x, int y)
{
// This will compare 2 numbers
if (x > y)
return x;
else
{
return y;
}
}
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Speed of Lexical Analysis is concern. Lexical
Analyser needs to look ahead several characters
before a match can found.
The lexical analyser scans the input from left to
right one character at a time. It uses two
pointers Token begin ptr(bp) and forward
ptr(fp)to keep track of the pointer of the input
scanned.
Initially both the pointers point to the first
character of the input string as shown below
The forward ptr moves ahead to search for end of lexeme. As
soon as the blank space is encountered, it indicates end of
lexeme. In example as soon as ptr (fp) encounters a blank space
the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters
white space, it ignore and moves ahead.T hen both the begin
ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence
buffering technique is used. A block of data is first read into a
buffer, and then second by lexical analyser. There are two
methods used in this context: 1)Buffer Pair Scheme,2)Sentinal.
These are explained as following
Input
buffer are divided into two halves of n
characters.Where n is number of characters
on one disk block i.e. 1024.
Inputbuffer two pointers Token begin
ptr(bp) and forward ptr(fp)to keep track
of the pointer of the input scanned.
If the forward ptr. Moves beyond the buffer halfway
mark, then other half is filled with next characters
from the source file.
Since the forward ptr. Move from left half and then
again to right half, it is possibility that we may loose
characters that not yet been grouped into tokens.
In Buffer Pair scheme, each time when the forward
pointer is moved, a check is done to ensure that one
half of the buffer has not moved off. If it is done,
then the other half must be reloaded.
Therefore the ends of the buffer halves require two
tests for each advance of the forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is read.
Advantages
1)Most of the time, It performs only one test to see whether forward
pointer points to an eof.
2)Only when it reaches the end of the buffer half or eof, it performs
more tests.
3) Since N input characters are encountered between eofs, the average
number of tests per input character is very close to 1.
Lexical Analyser represent each token in terms of Regular
Expression and Regular Expression is represented by
Transition Diagram.
Representing valid tokens of a language in regular
expression:-
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate {Є, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {Є}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representing occurrence of symbols using
regular expressions:-
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
or [0-9]
sign = [ + | - ]
Representing language tokens using regular
expressions:-
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyser is
how to verify the validity of a regular expression
used in specifying the patterns of keywords of a
language. A well-accepted solution is to use
finite automata for verification.
Finite automata is a state machine that takes a string of
symbols as input and changes its state accordingly. Finite
automata is a recognizer for regular expressions. When a
regular expression string is fed into finite automata, it
changes its state for each literal. If the input string is
successfully processed and the automata reaches its final
state, it is accepted, i.e., the string just fed was said to be
a valid token of the language in hand.
Example : We assume FA accepts any three digit binary
value ending in digit 1. FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}
For Example:- Draw a Transition Diagram of Identifier Regular
Expression is as followes:
Identifier = (letter)(letter | digit)*
S → S+S
S → S-S
S → (S)
S→a
Input string:
a1-(a2+a3)
Shift-Reduce Parsing:
Shift-reduce parsing uses two unique steps for
bottom-up parsing. These steps are known as
shift-step and reduce-step.
Shift step: The shift step refers to the
advancement of the input pointer to the next
input symbol, which is called the shifted symbol.
This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse
tree.
Reduce step : When the parser finds a complete
grammar rule (RHS) and replaces it to (LHS), it is
known as reduce-step. This occurs when the top
of the stack contains a handle. To reduce, a POP
function is performed on the stack which pops
off the handle and replaces it with LHS non-
terminal symbol.
Parsing table:
There are two main categories of shift reduce parsing as
follows:
1) Operator-Precedence Parsing
2) LR-Parser
LR Parser
LR parsing is one type of bottom up parsing. It is
used to parse the large class of grammars.
In the LR parsing,
"L" stands for left-to-right scanning of the input.
LR algorithm:
The LR algorithm requires stack, input,
output and parsing table. In all type of LR
parsing, input, output and stack are same
but parsing table is different.
Fig: Block diagram of LR parser
Input buffer is used to indicate end of input and
it contains the string to be parsed followed by a
$ Symbol.
A stack is used to contain a sequence of grammar
symbols with a $ at the bottom of the stack.
Parsing table is a two dimensional array. It
contains two parts: Action part and Go To part.
Augment Grammar:-
Augmented grammar G` will be generated if we add
one more production in the given grammar G. It helps
the parser to identify when to stop the parsing and
announce the acceptance of the input.
Example
Given grammar
S → AA
A → aA | b
The Augment grammar G` is represented by
S`→ S
S → AA
A → aA | b
Canonical Collection of LR(0) items
Production rules:
E→E+E
E → E * E
E → id
Input string: id + id * id
Left-most Derivation:
E
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that: the left-most side non-terminal is
always processed first.
Right-most Derivation
If we scan and replace the input with
production rules, from right to left, it is
known as right-most derivation. The
sentential form derived from the right-most
derivation is called the right-sentential form.
The right-most derivation is:
E
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
A parse tree is a graphical depiction of a
derivation. It is convenient to see how strings
are derived from the start symbol. The start
symbol of the derivation becomes the root of
the parse tree.
We take the left-most derivation of a + b * c
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E → id + id * id
In a parse tree:
• All leaf nodes are terminals.
• All interior nodes are non-terminals.
• In-order traversal gives original input string.
A parse tree depicts associativity and
precedence of operators. The deepest sub-tree
is traversed first, therefore the operator in that
sub-tree gets precedence over the operator
which is in the parent nodes.
Ambiguity/ Ambiguous Grammar
A grammar G is said to be ambiguous if it has
more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar
generates two parse trees:
The language generated by an ambiguous
grammar is said to be inherently ambiguous.
Ambiguity in grammar is not good for a compiler
construction.
No method can detect and remove ambiguity
automatically, but it can be removed by either
re-writing the whole grammar without ambiguity,
or by setting and following associativity and
precedence constraints.
Associativity
If an operand has operators on both sides, the side on
which the operator takes this operand is decided by the
associativity of those operators.
If the operation is left-associative, then the operand will
be taken by the left operator or if the operation is right-
associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction,
and Division are left associative. If the expression
contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id
Operations like Exponentiation are right associative, i.e.,
the order of evaluation in the same expression will be:
id op (id op id)
For example, id ^ (id ^ id)
Precedence
If two different operators share a common
operand, the precedence of operators decides
which will take the operand.
That is, 2+3*4 can have two different parse
trees:
one corresponding to (2+3)*4 and
another corresponding to 2+(3*4). By setting
precedence among operators, this problem can
be easily removed.
As in the previous example, mathematically *
(multiplication) has precedence over +
(addition), so the expression 2+3*4 will always
be interpreted as:
2 + (3 * 4)
These methods decrease the chances of
ambiguity in a language or its grammar.
Syntax analyser follow production rules
defined by means of context-free grammar.
The way the production rules are
implemented (derivation) divides parsing into
two types:
1) Top Down Parsing
2) Bottom-up Parsing
1) Top Down Parsing:
When the parser starts constructing the
parse tree from the start symbol and then
tries to transform the start symbol to the
input, it is called top-down parsing.
`
A) Top-down parser with Back-tracking :
Top- down parsers start from the root node
(start symbol) and match the input string
against the production rules to replace them
(if matched).
To understand this, take the following
example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
input string: read, a top-down parser, will
behave like this:
Step 1:It will start with S from the production
rules and will match its yield to the left-most
letter of the input, i.e. ‘r’. The very production
of S (S → rXd) matches with it.
Step
4: Write Recursive routings for each Non
terminal of the grammar G.
3) Predictive Parser:
Predictive parser is a recursive descent
parser, which has the capability to predict
which production is to be used to replace the
input string. The predictive parser does not
suffer from backtracking.
To accomplish its tasks, the predictive parser
uses a look-ahead pointer, which points to
the next input symbols. To make the parser
back-tracking free, the predictive parser
puts some constraints on the grammar and
accepts only a class of grammar known as
LL(k) grammar.
Predictive parsing uses a stack and a parsing table
to parse the input and generate a parse tree.
Both the stack and the input contains an end
symbol $ to denote that the stack is empty and
the input is consumed. The parser refers to the
parsing table to take any decision on the input
and stack element combination.
In recursive descent parsing, the parser may
have more than one production to choose from
for a single instance of input, whereas in
predictive parser, each step has at most one
production to choose. There might be instances
where there is no production matching the input
string, making the parsing procedure to fail.
LL Parser
LL parser is denoted as LL(k). The first L in
LL(k) is parsing the input from left to right,
the second L in LL(k) stands for left-most
derivation and k itself represents the number
of look ahead. Generally k =1, so LL(k) may
also be written as LL(1).
LL Parsing Algorithm:
We may stick to deterministic LL(1) for
parser explanation, as the size of table
grows exponentially with the value of k.
Secondly, if a given grammar is not LL(1),
then usually, it is not LL(k), for any given
k.
Given below is an algorithm for LL(1) Parsing:
Input:
string ω
parsing table M for grammar G
Output:
If ω is in L(G) then left-most derivation of ω,
error otherwise.
repeat
let X be the top stack symbol and a the symbol pointed by
ip.
if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
endif
else /* X is non-terminal */
if M[X,a] = X → Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X → Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */
First and Follow Sets
An important part of parser table
construction is to create first and follow
sets. These sets can provide the actual
position of any terminal in the derivation.
This is done to create the parsing table
where the decision of replacing T[A, t] = α
with some production rule.
First Set
This set is created to know what terminal symbol is
derived in the first position by a non-terminal.
For example, α → t β
That is α derives t (terminal) in the very first position. So,
t ∈ FIRST(α).
Algorithm for calculating First set
Look at the definition of FIRST(α) set:
1) if α is a terminal, then FIRST(α) = { α }.
2) if α is a non-terminal and α → ℇ is a production, then
FIRST(α) = { ℇ }.
3) if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any
FIRST(𝜸) contains t then t is in FIRST(α).
First set can be seen as:
Follow Set
Likewise, we calculate what terminal symbol
immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can
generate but instead, we see what would be the next
terminal symbol that follows the productions of a non-
terminal.
Algorithm for calculating Follow set:
1) if α is a start symbol, then FOLLOW() = $
2) if α is a non-terminal and has a production α → AB,
then FIRST(B) is in FOLLOW(A) except ℇ.
3) if α is a non-terminal and has a production α → AB,
where B contains ℇ,then FOLLOW(A) is in FOLLOW(α).
Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
Mrs. Jyoti C. Bachhav
In previous Chapter we learnt how a parser
constructs parse trees in the syntax analysis
phase. The plain parse-tree constructed in that
phase is generally of no use for a compiler, as it
does not carry any information of how to
evaluate the tree.
The productions of context-free grammar, which
makes the rules of the language, do not
accommodate how to interpret them.
For example
E → E +T
The above CFG production has no semantic rule
associated with it, and it cannot help in making
any sense of the production.
Semantics Analysis:
Semantics of a language provide meaning to its
constructs, like tokens and syntax structure.
Semantics help interpret symbols, their types,
and their relations with each other.
Semantic analysis judges whether the syntax
structure constructed in the source program
derives any meaning or not.
CFG + semantic rules = Syntax Directed
Definitions
For example:int a = “value”;
should not issue an error in lexical and syntax
analysis phase, as it is lexically and structurally
correct, but it should generate a semantic error
as the type of the assignment differs.
These rules are set by the grammar of the
language and evaluated in semantic analysis.
1)Scope resolution
2)Type checking
3) Array-bound checking
4) Semantic Errors
some of the semantics errors that the semantic
analyser is expected to recognize:
Type mismatch
Undeclared variable
Reserved identifier misuse.
Multiple declaration of variable in a scope.
Accessing an out of scope variable.
Actual and formal parameter mismatch.
SDT refers to a method of compiler
implementation where the source language
translation is completely driven by the
parser.
The parsing process and parse trees are used
to direct semantic analysis and the
translation of the source program.
We can augment grammar with information
to control the semantic analysis and
translation. Such grammars are called
Attribute Grammars .
Associated attributes with each grammar
symbol that describes its properties.
An attribute has a name and associated
value.
With each production in a grammar, give
semantic rules or actions.
There are 2 ways to represent the semantic
rules associated with grammar symbols:
1) Syntax Directed Definitions(SDD)
2) Syntax Directed Translation Schemes(SDT)
Definition of SDD: A Syntax Directed
Definitions(SDD) is a context free grammar
together with attributes and rules.
Attributes are associated with grammar symbols
and rules are associated with productions.
For Example:
Production Semantic Rule
E → E1 + T E.value = E1.value || T.value ||‟+‟
SDD‟s are highly readable and high level
specifications for translations.
But they hide implementation details.
For ex. They do not specify order of evaluation
of semantic actions.
Syntax Directed Translation Schemes (SDT)
embeds program fragments called semantic
actions within production action bodies.
SDT‟s are more efficient than SDD‟s as they
indicate the order of evaluation of semantic
actions associated production rule.
Semantic attributes may be assigned to their
values from their domain at the time of parsing
and evaluated at the time of assignment or
conditions.
Based on the way the attributes get their values,
they can be broadly divided into two categories :
a) Synthesized attributes and
b) Inherited attributes.
A) Synthesized attributes:
These attributes get values from the attribute
values of their child nodes.
To illustrate, assume the following production:
S → ABC
If S is taking values from its child nodes (A,B,C),
then it is said to be a synthesized attribute, as
the values of ABC are synthesized to S.
As in our previous example (E → E1 + T), the
parent node E gets its value from its child node.
Synthesized attributes never take values from
their parent nodes or any sibling nodes.
An SDD that involves only synthesized attribute,
called S-attributed.
SDD of Synthesized Attributes
fact (int n)
{
if (n<=1)
return 1;
else
return (n * fact(n-1));
}
fact (6)
Data structure for symbol table
A compiler contains two type of symbol
table:
A)global symbol table and
B) scope symbol table.
Global symbol table can be accessed by all
the procedures and scope symbol table.
The scope of a name and symbol table is
arranged in the hierarchy structure as shown
below:
The scope of a name and symbol table is arranged in the
hierarchy structure as shown below:
E.code = id print id
2)Three address code:-
Three-address code is an intermediate code. It is used by the
optimizing compilers.
In three-address code, the given expression is broken down into
several separate instructions. These instructions can easily
translate into assembly language.
Each Three address code instruction has at most three
operands. It is a combination of assignment and a binary
operator.
Example
GivenExpression:a := (-c * b) + (-c * d)
Three-address code is as follows:
t1 := -c
t2 := b*t1
t3 := -c
t4 := d * t3
t5 := t2 + t4
a := t5
t is used as registers in the target program.
(1) + c d
(2) * (0) (1)
(3) := (2) -
Triples face the problem of code
immovability while optimization, as the
results are positional and changing the order
or position of an expression may cause
problems.
Indirect Triples:
This representation is an enhancement over
triples representation. It uses pointers
instead of position to store results. This
enables the optimizers to freely re-position
the sub-expression to produce an optimized
code.
4) Quadruples:-
The quadruples have four fields to
implement the three address code.
The field of quadruples contains the name of
the operator, the first source operand, the
second source operand and the result
respectively.
item = 10;
do
{
value = value + item;
} while(value<100);
4) Induction Variable and Strength Reduction:
Strength reduction is used to replace the high strength
operator by the low strength.
An induction variable is used in loop for the following kind
of assignment like i = i + constant.
Before reduction the code is:
i = 1;
while(i<10)
{
y = i * 4;
}
t0 = a + b t0 = a + b d = t 0 + t1
Example:
Consider the following three address statement:
1) S1:= 4 * i
2) S2:= a[S1]
3) S3:= 4 * i
4) S4:= b[S3]
5) S5:= s2 * S4
6) S6:= prod + S5
7) Prod:= s6
8) S7:= i+1
9) i := S7
10) if i<= 20 goto (1)
Construction of DAG: