0% found this document useful (0 votes)
53 views

Compiler Construction

Compiler construct
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Compiler Construction

Compiler construct
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

Mrs. Jyoti C.

Bachhav
 Computers are a balanced mix of software and
hardware.
 Hardware is just a piece of mechanical device and its
functions are being controlled by compatible
software.
 Hardware understands instructions in the form of
electronic charge, which is the counterpart of binary
language in software programming.
 Binary language has only two alphabets, 0 and 1. To
instruct, the hardware codes must be written in
binary format, which is simply a series of 1s and 0s.
 It would be a difficult and cumbersome task for
computer programmers to write such codes, which is
why we have compilers to write such codes
 computer system is made of hardware and
software.
 The hardware understands a language, which
humans cannot understand.
 So we write programs in high-level language,
which is easier for us to understand and
remember.
 These programs are then fed into a series of
tools and OS components to get the desired
code that can be used by the machine. This
is known as Language Processing System.
 Preprocessor: A preprocessor, generally considered as a part of compiler, is a tool that produces
input for compilers. It deals with macro-processing, augmentation, file inclusion, language
extension, etc.

 Interpreter: An interpreter, like a compiler, translates high-level language into low-level


machine language. An interpreter reads a statement from the input, converts it to an
intermediate code, executes it, then takes the next statement in sequence. If an error occurs,
an interpreter stops execution and reports it. whereas a compiler reads the whole program
even if it encounters several errors.

 Compiler: A compiler reads the whole source code at once, creates tokens, checks semantics,
generates intermediate code, executes the whole program and may involve many passes. I

 Assembler: An assembler translates assembly language programs into machine code. The
output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.

 Linker: Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by separate
assemblers. The major task of a linker is to search and locate referenced module/routines in a
program and to determine the memory location where these codes will be loaded, making the
program instruction to have absolute references.

 Loader: Loader is a part of operating system and is responsible for loading executable files into
memory and execute them. It calculates the size of a program (instructions and data) and
creates memory space for it. It initializes various registers to initiate execution.

 Cross-compiler: A compiler that runs on platform (A) and is capable of generating executable
code for platform (B) is called a cross-compiler.
A compiler is a computer program which
helps you transform source code written in a
high-level language into low-level machine
language.
 It translates the code written in one
programming language to some other
language without changing the meaning of
the code.
 The compiler also makes the end code
efficient which is optimized for execution
time and memory space.
 The compiling process includes basic
translation mechanisms and error detection.
Compiler process goes through lexical,
syntax, and semantic analysis at the front
end, and code generation and optimization
at a back-end.
A compiler can broadly be divided into two
phases based on the way they compile.
1)Analysis Phase:
Known as the front-end of the compiler,
the analysis phase of the compiler reads the
source program, divides it into core parts and
then checks for lexical, grammar and syntax
errors. The analysis phase generates an
intermediate representation of the source
program and symbol table, which should be fed
to the Synthesis phase as input.
 2) Synthesis Phase:-
Known as the back-end of the compiler,
the synthesis phase generates the target
program with the help of intermediate source
code representation and symbol table.
 Pass
: A pass refers to the traversal of a
compiler through the entire program.

 Phase : A phase of a compiler is a


distinguishable stage, which takes input from
the previous stage, processes and yields
output that can be used as input for the next
stage. A pass can have more than one phase.
 Lexical Analysis:
The first phase of scanner works as a text scanner. This
phase scans the source code as a stream of characters and
converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-value>
 Syntax Analysis:
The next phase is called the syntax analysis or parsing. It
takes the token produced by lexical analysis as input and
generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code
grammar, i.e. the parser checks if the expression made by
the tokens is syntactically correct.
 Semantic Analysis:
Semantic analysis checks whether the parse tree
constructed follows the rules of language. For example,
assignment of values is between compatible data types,
and adding string to an integer. Also, the semantic analyser
keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The
semantic analyser produces an annotated syntax tree as an
output.
 Intermediate Code Generation:
After semantic analysis the compiler generates an
intermediate code of the source code for the target
machine. It represents a program for some abstract
machine. It is in between the high-level language and the
machine language. This intermediate code should be
generated in such a way that it makes it easier to be
translated into the target machine code.
 Code Optimization :
The next phase does code optimization of the intermediate
code. Optimization can be assumed as something that
removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).
 Code Generation:
In this phase, the code generator takes the optimized
representation of the intermediate code and maps it to the
target machine language. The code generator translates
the intermediate code into a sequence of (generally) re-
locatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code
would do.
 Symbol Table:
It is a data-structure maintained throughout
all the phases of a compiler. All the
identifier's names along with their types
are stored here. The symbol table
makes it easier for the compiler to
quickly search the identifier record and
retrieve it. The symbol table is also used
for scope management.
 Error Handler:
Each phase of compiler can encounter
errors. After detecting error , Error Handler
deals with errors, so that compilation
proceed further.
1)Single Pass Compilers
2)Two Pass Compilers
3)Multipass Compilers

1) Single Pass Compiler:-In single pass


Compiler source code directly transforms into
machine code. For example, Pascal language.
 2)Two Pass Compilers:-Two pass Compiler is
divided into two sections, viz.
 Front end: It maps legal code into Intermediate
Representation (IR).
 Back end: It maps IR onto the target machine
 The Two pass compiler method also simplifies
the retargeting process. It also allows multiple
front ends.
 3)Multipass Compilers:
 The multipass compiler processes the source
code or syntax tree of a program several times.
It divided a large program into multiple small
programs and process them. It develops multiple
intermediate codes. All of these multipass take
the output of the previous phase as an input. So
it requires less memory. It is also known as 'Wide
Compiler‘
 Cross Compiler : A Compiler may run on one
machine and produce the target code for
another machine is called cross compiler.
 This can be achieved by Bootstrapping
Technique.
 Bootstrapping:- A compiler can be
characterized by three languages:
the source language (S),
the target language (T),
and the implementation language (I)
The three language S, I, and T can be quite
different. Such a compiler is called cross-
compiler This is represented by a T-diagram
Thank
You
Mrs. Jyoti C. Bachhav
 Lexical analysis is the first phase of a compiler.
 It takes the modified source code from language
preprocessors that are written in the form of sentences.
 The lexical analyser breaks these syntaxes into a series of
tokens, by removing any whitespace or comments in the
source code.
 Programs that perform lexical analysis are called lexical
analyzers or lexers. A lexer contains tokenizer or scanner.
 Example:- How Pleasant Is The Weather?
 See this example; Here, we can easily recognize that there
are five words How Pleasant, The, Weather, Is. This is very
natural for us as we can recognize the separators, blanks,
and the punctuation symbol.
 Example:- HowPl easantIs Th ewe ather?
 Now, check this example, we can also read this. However,
it will take some time because separators are put in the
Odd Places. It is not something which comes to you
immediately.
 What's a lexeme?
A lexeme is a sequence of characters that are included in the
source program according to the matching pattern of a
token. It is nothing but an instance of a token.
 What's a token?
 The token is a sequence of characters which represents a
unit of information in the source program keywords like:
• constant,
• identifiers,
• numbers,
• operators and punctuation symbol

 What is Pattern?
A pattern is a description which is used by the token. In the
case of a keyword which uses as a token, the pattern is a
sequence of characters.
 The main task of lexical analysis is to read
input characters in the code and produce
tokens.
 Lexical analyser scans the entire source code
of the program. It identifies each token one
by one. Scanners are usually implemented to
produce tokens only when requested by a
parser.
1)"Get next token" is a command which is sent
from the parser to the lexical analyser.
2)On receiving this command, the lexical
analyser scans the input until it finds the
next token.
3) It returns the token to Parser.
 Lexical Analyser skips whitespaces and
comments while creating these tokens. If any
error is present, then Lexical analyser will
correlate that error with the source file and
line number.
 Reads the source program, scans the input
characters, group them into lexemes and
produce the token as output.
 Helps to identify token into the symbol table
 Removes white spaces and comments from
the source program.
 Correlates error messages with the source
program i.e., displays error message with its
occurrence by specifying the line number.
 Helps you to expands the macros if it is
found in the source program
 Consider the following code that is fed to Lexical
Analyser
 #include <stdio.h>
int maximum(int x, int y)
{
// This will compare 2 numbers
if (x > y)
return x;
else
{
return y;
}
}
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
 Speed of Lexical Analysis is concern. Lexical
Analyser needs to look ahead several characters
before a match can found.
 The lexical analyser scans the input from left to
right one character at a time. It uses two
pointers Token begin ptr(bp) and forward
ptr(fp)to keep track of the pointer of the input
scanned.
 Initially both the pointers point to the first
character of the input string as shown below
 The forward ptr moves ahead to search for end of lexeme. As
soon as the blank space is encountered, it indicates end of
lexeme. In example as soon as ptr (fp) encounters a blank space
the lexeme “int” is identified.
 The fp will be moved ahead at white space, when fp encounters
white space, it ignore and moves ahead.T hen both the begin
ptr(bp) and forward ptr(fp) are set at next token.
 The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence
buffering technique is used. A block of data is first read into a
buffer, and then second by lexical analyser. There are two
methods used in this context: 1)Buffer Pair Scheme,2)Sentinal.
These are explained as following
 Input
buffer are divided into two halves of n
characters.Where n is number of characters
on one disk block i.e. 1024.
 Inputbuffer two pointers Token begin
ptr(bp) and forward ptr(fp)to keep track
of the pointer of the input scanned.
 If the forward ptr. Moves beyond the buffer halfway
mark, then other half is filled with next characters
from the source file.
 Since the forward ptr. Move from left half and then
again to right half, it is possibility that we may loose
characters that not yet been grouped into tokens.
 In Buffer Pair scheme, each time when the forward
pointer is moved, a check is done to ensure that one
half of the buffer has not moved off. If it is done,
then the other half must be reloaded.
 Therefore the ends of the buffer halves require two
tests for each advance of the forward pointer.
 Test 1: For end of buffer.
 Test 2: To determine what character is read.

 So we choose another buffer scheme Sentinals


 The usage of sentinel reduces the two tests to one by extending each
buffer half to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).

 Advantages
 1)Most of the time, It performs only one test to see whether forward
pointer points to an eof.
 2)Only when it reaches the end of the buffer half or eof, it performs
more tests.
 3) Since N input characters are encountered between eofs, the average
number of tests per input character is very close to 1.
 Lexical Analyser represent each token in terms of Regular
Expression and Regular Expression is represented by
Transition Diagram.
 Representing valid tokens of a language in regular
expression:-
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate {Є, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {Є}.
 [a-z] is all lower-case alphabets of English language.
 [A-Z] is all upper-case alphabets of English language.
 [0-9] is all natural digits used in mathematics.
 Representing occurrence of symbols using
regular expressions:-
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
or [0-9]
sign = [ + | - ]
 Representing language tokens using regular
expressions:-
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
 The only problem left with the lexical analyser is
how to verify the validity of a regular expression
used in specifying the patterns of keywords of a
language. A well-accepted solution is to use
finite automata for verification.
 Finite automata is a state machine that takes a string of
symbols as input and changes its state accordingly. Finite
automata is a recognizer for regular expressions. When a
regular expression string is fed into finite automata, it
changes its state for each literal. If the input string is
successfully processed and the automata reaches its final
state, it is accepted, i.e., the string just fed was said to be
a valid token of the language in hand.
 Example : We assume FA accepts any three digit binary
value ending in digit 1. FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}
 For Example:- Draw a Transition Diagram of Identifier Regular
Expression is as followes:
 Identifier = (letter)(letter | digit)*

 Draw a Transition Diagram of where the terminals if , then, else, relop,


id and num generate sets of strings given by the following regular
definitions:
 if → if
 then → then
 else → else
 relop → <|<=|=|<>|>|>=
 id → letter(letter|digit)*
 num → digit+ (.digit+)?(E(+|-)?digit+)?
 Itis Unix utility.
 Generate a C code.
 Takes Regular Expression for patterns.
 Take additional C code.
A LEX program consist of 3 section and each
section is separated by %%.
 Format of lex program is as follows:
 1) Definition Section
%%
 2) Translation Rule Section
%%
 3)Procedure Section// Written in C language
1)int yylex(void) call to invoke lexer, returns
token.
2)char *yytext pointer to matched string
3)yyleng length of matched string.
4)yylval value associated with token.
5)int yywrap(void) wrapup, return 1 if done, 0 if
not done.
6) yyerror():- It used to find the errors in the
program.
yyerror()
{
printf(“\n Error”);
}
%{ int count; %}
%%
/* match identifier */ {letter}({letter}|{digit})*
count++;
%%
int main(void)
{ yylex();
printf("number of identifiers = %d\n", count);
return 0;
}
 counts the number of characters, words, and
lines in a file.
 %{ int nchar, nword, nline; %}
%%
\n { nline++; nchar++; }
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void)
{ yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}
Thank
You
Mrs. Jyoti C. Bachhav
 Syntax analysis or parsing is the second phase of
a compiler.
 It checks syntax of language.
 A syntax analyser or parser takes the input from
a lexical analyser in the form of token streams.
The parser analyses the source code (token
stream) against the production rules to detect
any errors in the code. The output of this phase
is a parse tree.
 This way, the parser accomplishes two tasks,
i.e., parsing the code, looking for errors and
generating a parse tree as the output of the
phase.
A derivation is basically a sequence of
production rules, in order to get the input
string.
 During parsing, we take two decisions for
some sentential form of input:
1) Deciding the non-terminal which is to be
replaced.
2) Deciding the production rule, by which,
the non-terminal will be replaced.
 Syntax analyser follow production rules
defined by means of context-free grammar.
The way the production rules are
implemented (derivation) divides parsing into
two types:
 1) Top Down Parsing
 2) Bottom-up Parsing
 First and Follow Sets
 An important part of parser table
construction is to create first and follow
sets. These sets can provide the actual
position of any terminal in the derivation.
This is done to create the parsing table
where the decision of replacing T[A, t] = α
with some production rule.
 First Set
 This set is created to know what terminal symbol is
derived in the first position by a non-terminal.
 For example, α → t β
 That is α derives t (terminal) in the very first position. So,
t ∈ FIRST(α).
 Algorithm for calculating First set
 Look at the definition of FIRST(α) set:
1) if α is a terminal, then FIRST(α) = { α }.
2) if α is a non-terminal and α → ℇ is a production, then
FIRST(α) = { ℇ }.
3) if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any
FIRST(𝜸) contains t then t is in FIRST(α).
 First set can be seen as:
 Follow Set
 Likewise, we calculate what terminal symbol
immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can
generate but instead, we see what would be the next
terminal symbol that follows the productions of a non-
terminal.
 Algorithm for calculating Follow set:
1) if α is a start symbol, then FOLLOW() = $
2) if α is a non-terminal and has a production α → AB,
then FIRST(B) is in FOLLOW(A) except ℇ.
3) if α is a non-terminal and has a production α → AB,
where B contains ℇ,then FOLLOW(A) is in FOLLOW(α).
 Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
 Compute Follows:
 Bottom-up Parsing:-
 As the name suggests, bottom-up parsing starts
with the input symbols and tries to construct
the parse tree up to the start symbol.
 Example:
 Input string : a + b * c
 Production rules:
 S → E
E → E + T
E → E * T
E → T
 T → id
 Letus start bottom-up parsing
 a + b * c
 Read the input and check if any production
matches with the input:
a+b*c
T+b*c
E+b*c
E+T*c
E*c
E*T
E
S
 Bottom-up parsing starts from the leaf nodes of
a tree and works in upward direction till it
reaches the root node.
 Here, we start from a sentence and then apply
production rules in reverse manner in order to
reach the start symbol.
 The image given below depicts the bottom-up
parsers available.
 LR Parser:
 The LR parser is a non-recursive, shift-
reduce, bottom-up parser.
 It uses a wide class of context-free grammar
which makes it the most efficient syntax
analysis technique.
 LR parsers are also known as LR(k) parsers,
where L stands for left-to-right scanning of
the input stream;
 R stands for the construction of right-most
derivation in reverse,
 and k denotes the number of lookahead
symbols to make decisions.
 A) SLR(1) – Simple LR Parser:-
 S :simple
 L :left-to-right scan of input
 R :rightmost derivation in reverse
 Works on smallest class of grammar
 Few number of states, hence very small table
 Simple and fast construction
 Example:
 Grammar:

 S → S+S
 S → S-S
 S → (S)
 S→a
 Input string:

 a1-(a2+a3)
 Shift-Reduce Parsing:
 Shift-reduce parsing uses two unique steps for
bottom-up parsing. These steps are known as
shift-step and reduce-step.
 Shift step: The shift step refers to the
advancement of the input pointer to the next
input symbol, which is called the shifted symbol.
This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse
tree.
 Reduce step : When the parser finds a complete
grammar rule (RHS) and replaces it to (LHS), it is
known as reduce-step. This occurs when the top
of the stack contains a handle. To reduce, a POP
function is performed on the stack which pops
off the handle and replaces it with LHS non-
terminal symbol.
 Parsing table:
 There are two main categories of shift reduce parsing as
follows:

 1) Operator-Precedence Parsing
 2) LR-Parser

1)Operator precedence parsing:-

 Operator precedence grammar is kinds of shift reduce


parsing method. It is applied to a small class of operator
grammars.

 A grammar is said to be operator precedence grammar if it


has two properties:
 No R.H.S. of any production has a∈.
 No two non-terminals are adjacent.
 Operator precedence can only established between the
terminals of the grammar. It ignores the non-terminal.
 There are the three operator precedence relations:
 a ⋗ b means that terminal "a" has the higher precedence than
terminal "b".
 a ⋖ b means that terminal "a" has the lower precedence than
terminal "b".
 a ≐ b means that the terminal "a" and "b" both have same
precedence.
 Precedence table:
 Operator precedence parsing
 Parsing Action
 Both end of the given input string, add the $ symbol.
 Now scan the input string from left right until the ⋗ is
encountered.
 Scan towards left over all the equal precedence until the first
left most ⋖ is encountered.
 Everything between left most ⋖ and right most ⋗ is a handle.
 $ on $ means parsing is successful.
 Grammar:
E → E+T/T
T → T*F/F
F → id
 Given string:w = id + id * id
 Let us consider a parse tree for it as follows:
 On the basis of above tree, we can design
following operator precedence table:
 Now let us process the string with the help of
the above precedence table:
 2) LR-Parser:-

 LR Parser
 LR parsing is one type of bottom up parsing. It is
used to parse the large class of grammars.

 In the LR parsing,
 "L" stands for left-to-right scanning of the input.

 "R" stands for constructing a right most


derivation in reverse.

 "K" is the number of input symbols of the look


ahead used to make number of parsing decision.
 LRparsing is divided into four parts: LR (0)
parsing, SLR parsing, CLR parsing and LALR
parsing.

 LR algorithm:
 The LR algorithm requires stack, input,
output and parsing table. In all type of LR
parsing, input, output and stack are same
but parsing table is different.
 Fig: Block diagram of LR parser
 Input buffer is used to indicate end of input and
it contains the string to be parsed followed by a
$ Symbol.
 A stack is used to contain a sequence of grammar
symbols with a $ at the bottom of the stack.
 Parsing table is a two dimensional array. It
contains two parts: Action part and Go To part.
 Augment Grammar:-
 Augmented grammar G` will be generated if we add
one more production in the given grammar G. It helps
the parser to identify when to stop the parsing and
announce the acceptance of the input.
 Example
 Given grammar
 S → AA
 A → aA | b
 The Augment grammar G` is represented by
 S`→ S
 S → AA
 A → aA | b
 Canonical Collection of LR(0) items

 An LR (0) item is a production G with dot at some


position on the right side of the production.
 LR(0) items is useful to indicate that how much of the
input has been scanned up to a given point in the
process of parsing.
 In the LR (0), we place the reduce node in the entire
row.
 Example
 Given grammar:
S → AA
A → aA | b
 Add Augment Production and insert '•' symbol at the
first position for every production in G
S` → •S
S → •AA
A → •aA
A → •b
 I0 State:
 Add Augment production to the I0 State and Compute
the Closure
 I0 = Closure (S` → •S)
 Add all productions starting with S in to I0 State
because "•" is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S
S → •AA

 Add all productions starting with "A" in modified I0


State because "•" is followed by the non-terminal. So,
the I0 State becomes.
 I0= S` → •S
S → •AA
A → •aA
A → •b
 I1= Go to (I0, S) = closure (S` → S•) = S` → S•
 Here, the Production is reduced so close the State.
 I1= S` → S•
 I2= Go to (I0, A) = closure (S → A•A)
 Add all productions starting with A in to I2 State because
"•" is followed by the non-terminal. So, the I2 State
becomes
 I2 =S→A•A
A → •aA
A → •b
 Go to (I2,a) = Closure (A → a•A) = (same as I3)
 Go to (I2, b) = Closure (A → b•) = (same as I4)
 I3= Go to (I0,a) = Closure (A → a•A)
 Add productions starting with A in I3.
 A → a•A
A → •aA
A → •b
 Go to (I3, a) = Closure (A → a•A) = (same as I3)
Go to (I3, b) = Closure (A → b•) = (same as I4)

 I4= Go to (I0, b) = closure (A → b•) = A → b•


I5= Go to (I2, A) = Closure (S → AA•) = SA → A•
I6= Go to (I3, A) = Closure (A → aA•) = A → aA•
 SLR (1) Parsing
 SLR (1) refers to simple LR Parsing. It is same as
LR(0) parsing.
 The only difference is in the parsing table.
 To construct SLR (1) parsing table, we use
canonical collection of LR (0) item.
 In the SLR (1) parsing, we place the reduce move
only in the follow of left hand side.
 Various steps involved in the SLR (1) Parsing:
1)For the given input string write a context free
grammar
2)Check the ambiguity of the grammar
3)Add Augment production in the given grammar
4)Create Canonical collection of LR (0) items
5)Draw a data flow diagram (DFA)
6)Construct a SLR (1) parsing table
 SLR (1) Table Construction:
 Example
S -> •Aa
A->αβ•
 Follow(S) = {$}
 Follow (A) = {a}
 SLR ( 1 ) Grammar
 S→E
E→E+T|T
T→T*F|F
F → id
 Add Augment Production and insert '•' symbol at the first
position for every production in G
 S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
 I0 State:
 Add Augment production to the I0 State and Compute
the Closure
 I0 = Closure (S` → •E)
 Add all productions starting with E in to I0 State
because "." is followed by the non-terminal. So, the
I0 State becomes
 I0 = S` → •E
E → •E + T
E → •T
 Add all productions starting with T and F in modified
I0 State because "." is followed by the non-terminal.
So, the I0 State becomes.
 I0= S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
 I1= Go to (I0, E) = closure (S` → E•, E → E• + T)
I2= Go to (I0, T) = closure (E → T•T, T• → * F)
I3= Go to (I0, F) = Closure ( T → F• ) = T → F•
I4= Go to (I0, id) = closure ( F → id•) = F → id•
I5= Go to (I1, +) = Closure (E → E +•T)
 Add all productions starting with T and F in I5
State because "." is followed by the non-
terminal. So, the I5 State becomes
 I5 = E → E +•T
T → •T * F
T → •F
F → •id
 Go to (I5, F) = Closure (T → F•) = (same as I3)
Go to (I5, id) = Closure (F → id•) = (same as I4)
 I6= Go to (I2, *) = Closure (T → T * •F)
 Add all productions starting with F in I6 State
because "." is followed by the non-terminal.
So, the I6 State becomes
 I6 = T → T * •F
F → •id
 Go to (I6, id) = Closure (F → id•) = (same as
I4)
 I7= Go to (I5, T) = Closure (E → E + T•) = E
→ E + T•
I8= Go to (I6, F) = Closure (T → T * F•) = T →
T * F•
 SLR (1) Table

 First (E) = First (E + T) ∪ First (T)


First (T) = First (T * F) ∪ First (F)
First (F) = {id}
First (T) = {id}
First (E) = {id}
Follow (E) = First (+T) ∪ {$} = {+, $}
Follow (T) = First (*F) ∪ First (F)
= {*, +, $}
Follow (F) = {*, +, $}
 CLR (1) Parsing:-
 CLR refers to canonical lookahead. CLR parsing
use the canonical collection of LR (1) items to
build the CLR (1) parsing table. CLR (1) parsing
table produces the more number of states as
compare to the SLR (1) parsing.
 In the CLR (1), we place the reduce node only in
the lookahead symbols.
 Various steps involved in the CLR (1) Parsing:
1)For the given input string write a context free
grammar
2)Check the ambiguity of the grammar
3)Add Augment production in the given grammar
4)Create Canonical collection of LR (0) items
5)Draw a data flow diagram (DFA)
6)Construct a CLR (1) parsing table
 LR (1) item
 LR (1) item is a collection of LR (0) items and a look ahead
symbol.
 LR (1) item = LR (0) item + look ahead
 The look ahead is used to determine that where we place the
final item.
 The look ahead always add $ symbol for the argument
production.
 Example
 CLR ( 1 ) Grammar
 S → AA
 A → aA
 A→b
 Add Augment Production, insert '•' symbol at the first position for
every production in G and also add the lookahead.
 S` → •S, $
 S → •AA, $
 A → •aA, a/b
 A → •b, a/b
 I0 State:
 Add Augment production to the I0 State and Compute
the Closure
 I0 = Closure (S` → •S)
 Add all productions starting with S in to I0 State
because "." is followed by the non-terminal. So, the
I0 State becomes
 I0 = S` → •S, $
S → •AA, $
 Add all productions starting with A in modified I0
State because "." is followed by the non-terminal. So,
the I0 State becomes.
 I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
 I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
 Add all productions starting with A in I2 State
because "." is followed by the non-terminal. So,
the I2 State becomes
 I2= S → A•A, $
A → •aA, $
A → •b, $
 I3= Go to (I0, a) = Closure ( A → a•A, a/b )
 Add all productions starting with A in I3 State
because "." is followed by the non-terminal. So,
the I3 State becomes
 I3= A → a•A, a/b
A → •aA, a/b
A → •b, a/b
 Go to (I3, a) = Closure (A → a•A, a/b) = (same as
I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as
I4)
 I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•,
a/b
I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)
 Add all productions starting with A in I6 State because
"." is followed by the non-terminal. So, the I6 State
becomes
 I6 = A → a•A, $
A → •aA, $
A → •b, $
 Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
 I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•,
a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $
 CLR (1) Parsing table:

 Productions are numbered as follows:


 S → AA ... (1)
 A → aA ....(2)
 A → b ... (3)
 The placement of shift node in CLR (1) parsing table is same as the SLR
(1) parsing table. Only difference in the placement of reduce node.
 I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} =
R3, action {I4, b} = R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} =
R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} =
R2, action {I8, b} = R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} =
R2.
 LALR (1) Parsing:
 LALR refers to the lookahead LR. To construct the LALR (1)
parsing table, we use the canonical collection of LR (1) items.
 In the LALR (1) parsing, the LR (1) items which have same
productions but different look ahead are combined to form a
single set of items
 LALR (1) parsing is same as the CLR (1) parsing, only difference in
the parsing table.
 Example
 LALR ( 1 ) Grammar
 S → AA
 A → aA
 A→b
 Add Augment Production, insert '•' symbol at the first position for
every production in G and also add the look ahead.
 S` → •S, $
 S → •AA, $
 A → •aA, a/b
 A → •b, a/b
 I0 State:
 Add Augment production to the I0 State and Compute the
ClosureL
 I0 = Closure (S` → •S)
 Add all productions starting with S in to I0 State because
"•" is followed by the non-terminal. So, the I0 State
becomes
 I0 = S` → •S, $
S → •AA, $
 Add all productions starting with A in modified I0 State
because "•" is followed by the non-terminal. So, the I0
State becomes.
 I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
 I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
 Add all productions starting with A in I2 State because "•"
is followed by the non-terminal. So, the I2 State becomes
 I2= S → A•A, $
A → •aA, $
A → •b, $
 I3= Go to (I0, a) = Closure ( A → a•A, a/b )
 Add all productions starting with A in I3 State because "•"
is followed by the non-terminal. So, the I3 State becomes
 I3= A → a•A, a/b
A → •aA, a/b
A → •b, a/b
 Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
 I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b
I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)
 Add all productions starting with A in I6 State because "•"
is followed by the non-terminal. So, the I6 State becomes
 I6 = A → a•A, $
A → •aA, $
A → •b, $
 Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
 I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•,
a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $
 If we analyze then LR (0) items of I3 and I6 are same
but they differ only in their lookahead.
 I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
 I6= { A → a•A, $
A → •aA, $
A → •b, $
}
 Clearly I3 and I6 are same in their LR (0) items
but differ in their lookahead, so we can combine
them and called as I36.
 I36 = { A → a•A, a/b/$
A → •aA, a/b/$
A → •b, a/b/$
}
 The I4 and I7 are same but they differ only in
their look ahead, so we can combine them and
called as I47.
 I47 = {A → b•, a/b/$}
 The I8 and I9 are same but they differ only in
their look ahead, so we can combine them and
called as I89.
 I89 = {A → aA•, a/b/$}

 LALR (1) Parsing table:
 YACC is an automatic tool that generates the
parser program.
 As we have discussed YACC in the first unit of
this tutorial so you can go through the concepts
again to make things more clear.
 YACC
 YACC stands for Yet Another Compiler Compiler.
 YACC provides a tool to produce a parser for a
given grammar.
 YACC is a program designed to compile a LALR
(1) grammar.
 It is used to produce the source code of the
syntactic analyzer of the language produced by
LALR (1) grammar.
 The input of YACC is the rule or grammar and the
output is a C program.
 These are some points about YACC:
 Input: A CFG- file.y
 Output: A parser y.tab.c (yacc)
 The output file "file.output" contains the
parsing tables.
 The file "file.tab.h" contains declarations.
 The parser called the yyparse ().
 Parser expects to use a function called yylex
() to get tokens
Mrs. Jyoti C. Bachhav
 Syntax analysis or parsing is the second phase of
a compiler.
 It checks syntax of language.
 A syntax analyser or parser takes the input from
a lexical analyser in the form of token streams.
The parser analyses the source code (token
stream) against the production rules to detect
any errors in the code. The output of this phase
is a parse tree.
 This way, the parser accomplishes two tasks,
i.e., parsing the code, looking for errors and
generating a parse tree as the output of the
phase.
A derivation is basically a sequence of
production rules, in order to get the input
string.
 During parsing, we take two decisions for
some sentential form of input:
1) Deciding the non-terminal which is to be
replaced.
2) Deciding the production rule, by which,
the non-terminal will be replaced.
 To decide which non-terminal to be replaced with
production rule, we can have two options.
Left-most Derivation
 If the sentential form of an input is scanned and
replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-
most derivation is called the left-sentential form.
 Example:-

Production rules:

 E→E+E
 E → E * E
 E → id
Input string: id + id * id
 Left-most Derivation:
E
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that: the left-most side non-terminal is
always processed first.
Right-most Derivation
 If we scan and replace the input with
production rules, from right to left, it is
known as right-most derivation. The
sentential form derived from the right-most
derivation is called the right-sentential form.
 The right-most derivation is:
E
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
A parse tree is a graphical depiction of a
derivation. It is convenient to see how strings
are derived from the start symbol. The start
symbol of the derivation becomes the root of
the parse tree.
 We take the left-most derivation of a + b * c

The left-most derivation is:

E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E → id + id * id

 In a parse tree:
• All leaf nodes are terminals.
• All interior nodes are non-terminals.
• In-order traversal gives original input string.
 A parse tree depicts associativity and
precedence of operators. The deepest sub-tree
is traversed first, therefore the operator in that
sub-tree gets precedence over the operator
which is in the parent nodes.
 Ambiguity/ Ambiguous Grammar
 A grammar G is said to be ambiguous if it has
more than one parse tree (left or right
derivation) for at least one string.
 Example
E→E+E
E→E–E
E → id
 For the string id + id – id, the above grammar
generates two parse trees:
 The language generated by an ambiguous
grammar is said to be inherently ambiguous.
 Ambiguity in grammar is not good for a compiler
construction.
 No method can detect and remove ambiguity
automatically, but it can be removed by either
re-writing the whole grammar without ambiguity,
or by setting and following associativity and
precedence constraints.
 Associativity
 If an operand has operators on both sides, the side on
which the operator takes this operand is decided by the
associativity of those operators.
 If the operation is left-associative, then the operand will
be taken by the left operator or if the operation is right-
associative, the right operator will take the operand.
 Example
 Operations such as Addition, Multiplication, Subtraction,
and Division are left associative. If the expression
contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id
 Operations like Exponentiation are right associative, i.e.,
the order of evaluation in the same expression will be:
id op (id op id)
For example, id ^ (id ^ id)
 Precedence
 If two different operators share a common
operand, the precedence of operators decides
which will take the operand.
 That is, 2+3*4 can have two different parse
trees:
 one corresponding to (2+3)*4 and
 another corresponding to 2+(3*4). By setting
precedence among operators, this problem can
be easily removed.
 As in the previous example, mathematically *
(multiplication) has precedence over +
(addition), so the expression 2+3*4 will always
be interpreted as:
2 + (3 * 4)
 These methods decrease the chances of
ambiguity in a language or its grammar.
 Syntax analyser follow production rules
defined by means of context-free grammar.
The way the production rules are
implemented (derivation) divides parsing into
two types:
 1) Top Down Parsing
 2) Bottom-up Parsing
 1) Top Down Parsing:
 When the parser starts constructing the
parse tree from the start symbol and then
tries to transform the start symbol to the
input, it is called top-down parsing.
`
 A) Top-down parser with Back-tracking :
 Top- down parsers start from the root node
(start symbol) and match the input string
against the production rules to replace them
(if matched).
 To understand this, take the following
example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
input string: read, a top-down parser, will
behave like this:
 Step 1:It will start with S from the production
rules and will match its yield to the left-most
letter of the input, i.e. ‘r’. The very production
of S (S → rXd) matches with it.

 Step 2: So the top-down parser advances to the


next input letter (i.e. ‘e’). The parser tries to
expand non-terminal ‘X’ and checks its
production from the left (X → oa).
 Step 3:It does not match with the next input
symbol. So the top-down parser backtracks
to obtain the next production rule of X, (X →
ea).
 Now the parser matches all the input letters
in an ordered manner. The string is accepted.
 Drawback of Top- down parser with Back-
tracing:
 1) Back-tracing slows down parsing.
 2) Precise error indication is not possible.
 3)Left Recursion is one more problem in Top-
down parsing.
 Left Recursion:
 A grammar becomes left-recursive if it has any
non-terminal ‘A’ whose derivation contains ‘A’
itself as the left-most symbol.
 Left-recursive grammar is considered to be a
problematic situation for top-down parsers. Top-
down parsers start parsing from the Start
symbol, which in itself is non-terminal. So, when
the parser encounters the same non-terminal in
its derivation, it becomes hard for it to judge
when to stop parsing the left non-terminal and it
goes into an infinite loop.
 Example:
(1) A => Aα | β
(2) S => A α | β
A => Sd
 (1) is an example of immediate left recursion,
where A is any non-terminal symbol and α
represents a string of non-terminals.
 (2) is an example of indirect-left recursion.

 A top-down parser will first parse the A, which


in-turn will yield a string consisting of A itself
and the parser may go into a loop forever.
 Removal of Left Recursion
 One way to remove left recursion is to use the
following technique:
 1) Elimination of immediate Left
Recursion:(Direct Method)
If the production is in the form of:
A => Aα | β
Where A is NT and α is string of T and / or NTs.
is converted into following productions
A => βA'
A'=> α A' | ε
 This does not impact the strings derived from
the grammar, but it removes immediate left
recursion.
 Example of Elimination of Immediate Left
Recursion:
 2) Elimination of Left Recursion Using Indirect Method:
 Second method is to use the following algorithm, which
should eliminate all direct and indirect left recursions.
 START
Arrange non-terminals in some order like A1, A2, A3,…, An
for each i from 1 to n
{
for each j from 1 to i-1
{
replace each production of form Ai ⟹Aj𝜸
with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
 END
 Example:
 Example
The production set
S => Aα | β
A => Sd
 after applying the above algorithm, should
become
S => Aα | β
A => A α d | βd
 and then, remove immediate left recursion using
the first technique.
A => β dA'
A' => α dA' | ε
 Now none of the production has either direct or
indirect left recursion.
 Example of Elimination of Indirect Left
Recursion:
 Need of Left Factoring:
 If more than one grammar production rules has a
common prefix string, then the top-down parser
cannot make a choice as to which of the
production it should take to parse the string in
hand.
 Example
 If a top-down parser encounters a production
like
A⟹ αβ | α𝜸 | …
 Then it cannot determine which production to
follow to parse the string as both productions
are starting from the same terminal (or non-
terminal). To remove this confusion, we use a
technique called left factoring.
 Left factoring transforms the grammar to
make it useful for top-down parsers. In this
technique, we make one production for each
common prefixes and the rest of the
derivation is added by new productions.
 Example
A⟹αβ|α𝜸|…
 The above productions can be written as
A => α A‘
A'=> β | 𝜸 | …
 Now the parser has only one production per
prefix which makes it easier to take
decisions.
 2) Recursive Descent Parsing:-
 Recursive descent is a top-down parsing
technique that constructs the parse tree from
the top and the input is read from left to right.
It uses procedures for every terminal and non-
terminal entity.
 This parsing technique recursively parses the
input to make a parse tree, which may or may
not require back-tracking. But the grammar
associated with it (if not left factored) cannot
avoid back-tracking.
 A form of recursive-descent parsing that does
not require any back-tracking is known as
predictive parsing.
 This parsing technique is regarded recursive as
it uses context-free grammar which is recursive
in nature.
 To Write Recursive Descent Parsing for
Grammar G we need to do following steps:
 Step 1:Check whether grammar is
ambiguous. If yes, dis-ambiguous grammar.

 Step2: Rewrite grammar without left


recursion.

 Step 3: Left factor the grammar if necessary.

 Step
4: Write Recursive routings for each Non
terminal of the grammar G.
 3) Predictive Parser:
 Predictive parser is a recursive descent
parser, which has the capability to predict
which production is to be used to replace the
input string. The predictive parser does not
suffer from backtracking.
 To accomplish its tasks, the predictive parser
uses a look-ahead pointer, which points to
the next input symbols. To make the parser
back-tracking free, the predictive parser
puts some constraints on the grammar and
accepts only a class of grammar known as
LL(k) grammar.
 Predictive parsing uses a stack and a parsing table
to parse the input and generate a parse tree.
 Both the stack and the input contains an end
symbol $ to denote that the stack is empty and
the input is consumed. The parser refers to the
parsing table to take any decision on the input
and stack element combination.
 In recursive descent parsing, the parser may
have more than one production to choose from
for a single instance of input, whereas in
predictive parser, each step has at most one
production to choose. There might be instances
where there is no production matching the input
string, making the parsing procedure to fail.
 LL Parser
 LL parser is denoted as LL(k). The first L in
LL(k) is parsing the input from left to right,
the second L in LL(k) stands for left-most
derivation and k itself represents the number
of look ahead. Generally k =1, so LL(k) may
also be written as LL(1).
 LL Parsing Algorithm:
 We may stick to deterministic LL(1) for
parser explanation, as the size of table
grows exponentially with the value of k.
Secondly, if a given grammar is not LL(1),
then usually, it is not LL(k), for any given
k.
 Given below is an algorithm for LL(1) Parsing:
 Input:
string ω
parsing table M for grammar G

 Output:
If ω is in L(G) then left-most derivation of ω,
error otherwise.

 Initial State : $S on stack (with S being start symbol)


ω$ in the input buffer

SET ip to point the first symbol of ω$.

repeat
let X be the top stack symbol and a the symbol pointed by
ip.
if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
endif

else /* X is non-terminal */
if M[X,a] = X → Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X → Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */
 First and Follow Sets
 An important part of parser table
construction is to create first and follow
sets. These sets can provide the actual
position of any terminal in the derivation.
This is done to create the parsing table
where the decision of replacing T[A, t] = α
with some production rule.
 First Set
 This set is created to know what terminal symbol is
derived in the first position by a non-terminal.
 For example, α → t β
 That is α derives t (terminal) in the very first position. So,
t ∈ FIRST(α).
 Algorithm for calculating First set
 Look at the definition of FIRST(α) set:
1) if α is a terminal, then FIRST(α) = { α }.
2) if α is a non-terminal and α → ℇ is a production, then
FIRST(α) = { ℇ }.
3) if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any
FIRST(𝜸) contains t then t is in FIRST(α).
 First set can be seen as:
 Follow Set
 Likewise, we calculate what terminal symbol
immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can
generate but instead, we see what would be the next
terminal symbol that follows the productions of a non-
terminal.
 Algorithm for calculating Follow set:
1) if α is a start symbol, then FOLLOW() = $
2) if α is a non-terminal and has a production α → AB,
then FIRST(B) is in FOLLOW(A) except ℇ.
3) if α is a non-terminal and has a production α → AB,
where B contains ℇ,then FOLLOW(A) is in FOLLOW(α).
 Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
Mrs. Jyoti C. Bachhav
 In previous Chapter we learnt how a parser
constructs parse trees in the syntax analysis
phase. The plain parse-tree constructed in that
phase is generally of no use for a compiler, as it
does not carry any information of how to
evaluate the tree.
 The productions of context-free grammar, which
makes the rules of the language, do not
accommodate how to interpret them.
 For example
 E → E +T
 The above CFG production has no semantic rule
associated with it, and it cannot help in making
any sense of the production.
 Semantics Analysis:
 Semantics of a language provide meaning to its
constructs, like tokens and syntax structure.
 Semantics help interpret symbols, their types,
and their relations with each other.
 Semantic analysis judges whether the syntax
structure constructed in the source program
derives any meaning or not.
 CFG + semantic rules = Syntax Directed
Definitions
 For example:int a = “value”;
 should not issue an error in lexical and syntax
analysis phase, as it is lexically and structurally
correct, but it should generate a semantic error
as the type of the assignment differs.
 These rules are set by the grammar of the
language and evaluated in semantic analysis.
1)Scope resolution
2)Type checking
3) Array-bound checking
4) Semantic Errors
 some of the semantics errors that the semantic
analyser is expected to recognize:
 Type mismatch
 Undeclared variable
 Reserved identifier misuse.
 Multiple declaration of variable in a scope.
 Accessing an out of scope variable.
 Actual and formal parameter mismatch.
 SDT refers to a method of compiler
implementation where the source language
translation is completely driven by the
parser.
 The parsing process and parse trees are used
to direct semantic analysis and the
translation of the source program.
 We can augment grammar with information
to control the semantic analysis and
translation. Such grammars are called
Attribute Grammars .
 Associated attributes with each grammar
symbol that describes its properties.
 An attribute has a name and associated
value.
 With each production in a grammar, give
semantic rules or actions.
 There are 2 ways to represent the semantic
rules associated with grammar symbols:
1) Syntax Directed Definitions(SDD)
2) Syntax Directed Translation Schemes(SDT)
 Definition of SDD: A Syntax Directed
Definitions(SDD) is a context free grammar
together with attributes and rules.
 Attributes are associated with grammar symbols
and rules are associated with productions.
 For Example:
 Production Semantic Rule
 E → E1 + T E.value = E1.value || T.value ||‟+‟
 SDD‟s are highly readable and high level
specifications for translations.
 But they hide implementation details.
 For ex. They do not specify order of evaluation
of semantic actions.
 Syntax Directed Translation Schemes (SDT)
embeds program fragments called semantic
actions within production action bodies.
 SDT‟s are more efficient than SDD‟s as they
indicate the order of evaluation of semantic
actions associated production rule.
 Semantic attributes may be assigned to their
values from their domain at the time of parsing
and evaluated at the time of assignment or
conditions.
 Based on the way the attributes get their values,
they can be broadly divided into two categories :
 a) Synthesized attributes and
 b) Inherited attributes.
 A) Synthesized attributes:
 These attributes get values from the attribute
values of their child nodes.
 To illustrate, assume the following production:
 S → ABC
 If S is taking values from its child nodes (A,B,C),
then it is said to be a synthesized attribute, as
the values of ABC are synthesized to S.
 As in our previous example (E → E1 + T), the
parent node E gets its value from its child node.
Synthesized attributes never take values from
their parent nodes or any sibling nodes.
 An SDD that involves only synthesized attribute,
called S-attributed.
 SDD of Synthesized Attributes

 Each of the non-terminals has a single


synthesized attribute, called val.
 An SDD that involves only synthesized
attribute is called S-attributed.
 B) Inherited attributes: An inherited attribute
at node N is defined only in terms of attriubute
values at N‟s parent, N itself , and N‟s siblings.
 In contrast to synthesized attributes, inherited
attributes can take values from parent , itself
and/or siblings.
 SDD with Synthesized and Inherited
attributes:
Production Rule Semantic Rule
D → TL L.inh= T.val //Inherited attribute
T → int T. val= int
T → real T. val= real
L → L1,id L1 .inh =L.inh //Inherited attribute
L.val= L1 .inh , id.lexval
L →id L.val= id.lexval
 Annotated Parse Tree: A Parse tree , showing
the values of its attribute(s) is called an
annotated parse tree.
 With Synthesized attributes, evaluate attributes
in Bottom-up order i.e. post order traversal.
 Example. Let us consider the Grammar for
arithmetic expressions. The Syntax Directed
Definition associates to each non terminal a
synthesized attribute
PRODUCTION SEMANTIC called
RULE val.
L→En print (E.val )
E→E1+T E.val := E 1.val + T .val
E→T E.val := T .val
T → T1 ∗ F T .val := T1.val ∗ F.val
T→F • T .val := F.val
F→(E) F.val := E.val
 Also draw annotated parse tree for expressions:
 1)(3+4) * (5+6)n
 2)1*2*3*(4+5)n
 SDD for expression grammar with inherited
attributes.
 Input string is w= real id1,id2,id3
 Draw annotated parse tree for w=3*5 by using SDD for
expression grammar with inherited attributes.

PRODUCTION SEMANTIC RULE


T→ FT’ T‟inh=F.val
T.val= T‟.syn
T’ → *FT’1 T‟1.inh := T‟.inh * F .val
T’.syn=T’1.syn
T’→ Є T‟.syn := T‟.inh
F → digit F.val :=digit.lexval
 SDD with circular dependencies no guarantee
in the order of evaluation.
 1) Dependency Graphs:-
 It is interdependencies between synthesizes
attributes and inherited attributes are shown
by directed graph is called Dependency
Graph.
 It is tool for determining an evaluation order
for the attribute instances in a given pares
tree.
 Annotated parse tree shows the values of
attributes, a dependency graph helps to
determine how those values can be
computed.
 Edges express constrains implied by the
semantic rules.
 Each attribute is associated to a node.
 If a semantic rule associated with a
production p defines the value of
synthesized attribute A.b in terms of the
value of X.c, then draw the graph has edge
from X.c to A.b.
 If a semantic rule associated with a
production p defines the value of inherited
attribute B.c in terms of value of X.a, then
draw the graph has edge from X.a to B.c.
 ForExample:
Production Semantic Rule
E → E1 + T E.val = E1.val + T.val
Then the Dependency graph is as follows:
 2) Ordering evaluation of the attributes:
 A dependency graph characterizes the
possible order in which we can evaluate the
attributes at various nodes of a parse tree.
 If there is an edge from node M to N , then
then attribute corresponding to M first be
evaluated before evaluating N.
 Thus the allowable orders of evaluation are
N!,N2,………,Nk such that is there ia an edge
from Ni to Nj then i<j.
 Such an order, and is called a topological
sort of the graph.
 If there ias an any cycle in the graph , then
there are no topological sort.
 3) S-Attributed : If every attribute is
synthesized then such of grammar is call S-
attributed grammar.
 S-attributed SDD can be evaluated in bottom
up order of the nodes of the parse tree.
 4) L-attributed :Suppose there is a production
A→X1 X2……..Xn, and that there is an inherited
attribute Xi computed by a rule associated with
this production. Then the rule may use only:
 Inherited attributes associated with the head A
(i.e. inherited attribute uses value of their
parents)
 Either inherited or synthesized attribute
associated with the occurrences of symbols X1,
X2, …. Xi-1 location to the left of Xi.(i.e. from
left side siblings)
 Inherited or synthesized attributes associated
with this occurrence of Xi itself, but only in such
a way that there are no cycles in a dependency
graph formed by the attributes of this Xi.(i.e.
value from itself)
 5) Semantic Rules with Controlled Side
Effects:
 Permit incidental side effects that do not
effect attribute evaluation.
 Impose restriction on allowable evaluation
orders, so that the same transaction is
produced for any allowable order.
 Construction of Syntax Trees:
 syntax trees are useful for representing
programming language constructs like
expressions and statements.
 Each node of a syntax tree represents a the
meaningful components of the construct.
 E.g. A Syntax tree node representing an
expression E1+E2 has label + and 2 children
representing the sub expressions E1 and E2.
 Each node is implemented by objects with
suitable number of fields; each object will
have an op field that is the label of the node
with additional fields as follows:
 If the node is a leaf, an additional field
holds the lexical value for the leaf .This is
created by function Leaf(op,val).
 If the node is an interior node, there are as
many fields as the node has children in the
syntax tree. This is created by function
node(op,c1,c2,…,ck).
 Writethe steps to construct syntax tree using
the semantic rules. Construct a syntax tree
for the following SDD :
Production rule Semantic rules

E→E1+T E.node = new Node („+‟, E1.node,


T.node)
E→E1–T E.node = new Node („–‟, E1.node,
T.node)
E→T E.node = T.node
T→(E) T.node = E.node
T→id T.node = new Leaf (id, id.entry)
T→num T.node = new Leaf(num, num.val)
 SDT can be implemented by first building a
parse tree and then performing the actions in
a left-to-right ,depth first order.
 A) Postfix Translation schemes: each
semantic action can be placed at the end of
the production and executed along with the
reduction of body to the head of the
production.
 SDT
: we write semantic action instead of
semantic rule

 Production Semantic Action


 B) Parser- Stack Implementation of Postfix SDT’s:

 Semantic Actions during Parsing:


 When Shifting :
 Push the Value of the terminal on the semantic stack
 When Reducing:
 Pop k values from the semantic stack, Where k is the
number of symbols on production‟s RHS.
 Push the production‟s value on the semantic stack.
 SDT‟s with Actions inside Productions:
 If bottom up parser is used, then action a is
performed as soon as X appears on top of the
stack.
 If top-down parser is used, then action a is
performed:
 Justbefore Y is expanded (if Y is Non Terminal) or
 Check Y on input (if Y is Terminal ).
 Any SDT can be implemented as follows:
 Ignoring actions, parse input and produce parse tree.
 Add additional children to node N for action in α,
where A→α.
 Performance preorder traversal of the tree, and as
soon as a node labelled by an action is visited ,
 Perform that action.
 Semantic Actions during Parsing:
Thank
You
Mrs. Jyoti C. Bachhav
 A program as a source code is merely a collection of
text (code, statements etc.) and to make it alive, it
requires actions to be performed on the target
machine. A program needs memory resources to
execute instructions. A program contains names for
procedures, identifiers etc., that require mapping
with the actual memory location at runtime.
 By runtime, we mean a program in execution.
Runtime environment is a state of the target
machine, which may include software libraries,
environment variables, etc., to provide services to
the processes running in the system.
 Runtime support system is a package, mostly
generated with the executable program itself and
facilitates the process communication between the
process and the runtime environment. It takes care
of memory allocation and de-allocation while the
program is being executed.
 Difference between Static and Dynamic
Memory:
Static Dynamic
It refers to the allocation of It refers to the allocation of memory
memory during compilation of during execution of program.
program
Memory bindings are not Memory bindings are established and
established and destroyed during destroyed during the execution
the execution
Variables remain permanently Allocated only when program unit is
allocated. active.
Faster execution than dynamic Slower execution than static
More memory space required. Less memory space required
Data is stored in data segment of Data is stored in heap memory.
memory.
 Storage Organization:
 When the target program executes then it
runs in its own logical address space in which
the value of each program has a location.
 The logical address space is shared among
the compiler, operating system and target
machine for management and organization.
The operating system is used to map the
logical address into physical address which is
usually spread throughout the memory.
 Subdivision of Run-time Memory:

 Runtime storage comes into blocks, where a byte is used to


show the smallest unit of addressable memory. Using the
four bytes a machine word can form. Object of multibyte
is stored in consecutive bytes and gives the first byte
address.
 Run-time storage can be subdivide to hold the different
components of an executing program:
 Generated executable code
 Static data objects
 Dynamic data-object- heap
 Automatic data objects- stack
 Activation Record:
 Control stack is a run time stack which is used to
keep track of the live procedure activations i.e.
it is used to find out the procedures whose
execution have not been completed.
 When it is called (activation begins) then the
procedure name will push on to the stack and
when it returns (activation ends) then it will
popped.
 Activation record is used to manage the
information needed by a single execution of a
procedure.
 An activation record is pushed into the stack
when a procedure is called and it is popped
when the control returns to the caller function.
 The diagram below shows the contents of activation records:
 Return Value: It is used by calling
procedure to return a value to calling
procedure.
 Actual Parameter: It is used by calling
procedures to supply parameters to the called
procedures.
 Control Link: It points to activation record of
the caller.
 Access Link: It is used to refer to non-local data
held in other activation records.
 Saved Machine Status: It holds the information
about status of machine before the procedure is
called.
 Local Data: It holds the data that is local to the
execution of the procedure.
 Temporaries: It stores the value that arises in the
evaluation of an expression.
 Storage Allocation:
 The different ways to allocate memory are:
1)Static storage allocation
2)Stack storage allocation
3)Heap storage allocation

1)Static storage allocation


 In static allocation, names are bound to storage locations.
 If memory is created at compile time then the memory will
be created in static area and only once.
 Static allocation supports the dynamic data structure that
means memory is created only at compile time and
deallocated after program completion.
 Drawback:
 The drawback with static storage allocation is that the size
and position of data objects should be known at compile
time.
 Another drawback is restriction of the recursion
procedure.
 2)Stack storage allocation
 In static storage allocation, storage is
organized as a stack.
 An activation record is pushed into the stack
when activation begins and it is popped when
the activation end.
 Activation record contains the locals so that
they are bound to fresh storage in each
activation record. The value of locals is
deleted when the activation ends.
 It works on the basis of last-in-first-out
(LIFO) and this allocation supports the
recursion process.
3)Heap Storage Allocation
 Heap allocation is the most flexible
allocation scheme.
 Allocation and deallocation of memory can
be done at any time and at any place
depending upon the user's requirement.
 Heap allocation is used to allocate memory
to the variables dynamically and when the
variables are no more used then claim it
back.
 Heap storage allocation supports the
recursion process.
 Example: The dynamic allocation
is as follows:

fact (int n)
{
if (n<=1)
return 1;
else
return (n * fact(n-1));
}
fact (6)
 Data structure for symbol table
 A compiler contains two type of symbol
table:
 A)global symbol table and
 B) scope symbol table.
 Global symbol table can be accessed by all
the procedures and scope symbol table.
 The scope of a name and symbol table is
arranged in the hierarchy structure as shown
below:
 The scope of a name and symbol table is arranged in the
hierarchy structure as shown below:

 The above grammar can be represented in a hierarchical


data structure of symbol tables:
 Theabove grammar can be represented in a
hierarchical data structure of symbol tables:
 The global symbol table contains one global
variable and two procedure names. The name
mentioned in the sum_num table is not available
for sum_id and its child tables.
 Data structure hierarchy of symbol table is
stored in the semantic analyser. If you want to
search the name in the symbol table then you
can search it using the following algorithm:
 First a symbol is searched in the current symbol
table.
 If the name is found then search is completed
else the name will be searched in the symbol
table of parent until,
 The name is found or global symbol is searched.
 Representing Scope Information:
 In the source program, every name possesses a region
of validity, called the scope of that name.
 The rules in a block-structured language are as
follows:
1)If a name declared within block B then it will be valid
only within B.
2)If B1 block is nested within B2 then the name that is
valid for block B2 is also valid for B1 unless the name's
identifier is re-declared in B1.
• These scope rules need a more complicated
organization of symbol table than a list of
associations between names and attributes.
• Tables are organized into stack and each
table contains the list of names and their
associated attributes.
• Whenever a new block is entered then a new
table is entered onto the stack. The new
table holds the name that is declared as local
to this block.
• When the declaration is compiled then the
table is searched for a name.
• If the name is not found in the table then the
new name is inserted.
• When the name's reference is translated then
each table is searched, starting from the
each table on the stack.
 For example:  For example:
void f(int m) {
float x, y;
{
int i, j;
int u, v;
}
}
int g (int n)
{
bool t;
}
Mrs. Jyoti C. Bachhav
 Intermediate code
 Intermediate code is used to translate the
source code into the machine code.
Intermediate code lies between the high-
level language and the machine language.

Fig: Position of intermediate code generator


 Intermediate representation
 Intermediate code can be represented in two
ways:
 1. High Level intermediate code:
High level intermediate code can be represented
as source code. To enhance performance of source
code, we can easily apply code modification. But
to optimize the target machine, it is less
preferred.
 2. Low Level intermediate code
Low level intermediate code is close to the target
machine, which makes it suitable for register and
memory allocation etc. it is used for machine-
dependent optimizations.
A code generator is expected to have an understanding of the
target machine’s runtime environment and its instruction set. The
code generator should take the following things into consideration
to generate the code:
 Target language : The code generator has to be aware of the
nature of the target language for which the code is to be
transformed. That language may facilitate some machine-
specific instructions to help the compiler generate the code in
a more convenient way. The target machine can have either
CISC or RISC processor architecture.
 IR Type : Intermediate representation has various forms. It can
be in Abstract Syntax Tree (AST) structure, Reverse Polish
Notation, or 3-address code.
 Selection of instruction : The code generator takes
Intermediate Representation as input and converts (maps) it
into target machine’s instruction set. One representation can
have many ways (instructions) to convert it, so it becomes the
responsibility of the code generator to choose the appropriate
instructions wisely.
 Register allocation : A program has a number
of values to be maintained during the
execution. The target machine’s architecture
may not allow all of the values to be kept in
the CPU memory or registers. Code generator
decides what values to keep in the registers.
Also, it decides the registers to be used to
keep these values.
 Ordering of instructions : At last, the code
generator decides the order in which the
instruction will be executed. It creates
schedules for instructions to execute them.
 Descriptors:-
 The code generator has to track both the registers (for
availability) and addresses (location of values) while generating
the code. For both of them, the following two descriptors are
used:
 Register descriptor : Register descriptor is used to inform the
code generator about the availability of registers. Register
descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this
descriptor is consulted for register availability.
 Address descriptor : Values of the names (identifiers) used in the
program might be stored at different locations while in
execution. Address descriptors are used to keep track of memory
locations where the values of identifiers are stored. These
locations may include CPU registers, heaps, stacks, memory or a
combination of the mentioned locations.
 Code generator keeps both the descriptor updated in real-time.
For a load statement, LD R1, x, the code generator:
 updates the Register Descriptor R1 that has value of x and
 updates the Address Descriptor (x) to show that one instance of x
is in R1.
 Generating Code for Assignment Statements:
 The assignment statement d:= (a-b) + (a-c) + (a-c) can be
translated into the following sequence of three address
code:
t:= a-b
u:= a-c
v:= t +u
d:= v+u
 Code sequence for the example is as follows:

Statement Code Generated Register descriptor Address descriptor


Register empty

t:= a - b MOV a, R0 R0 contains t t in R0


SUB b, R0
u:= a - c MOV a, R1 R0 contains t t in R0
SUB c, R1 R1 contains u u in R1
v:= t + u ADD R1, R0 R0 contains v u in R1
R1 contains u v in R1
d:= v + u ADD R1, R0 R0 contains d d in R0
MOV R0, d d in R0 and memory
 Intermediate code for expression:
 1) Postfix Notation
 Postfix notation is the useful form of
intermediate code if the given language is
expressions.
 Postfix notation is also called as 'suffix notation'
and 'reverse polish'.
 Postfix notation is a linear representation of a
syntax tree.
 In the postfix notation, any expression can be
written unambiguously without parentheses.
 The ordinary (infix) way of writing the sum of x
and y is with operator in the middle: x * y. But in
the postfix notation, we place the operator at
the right end as xy *.
.
 Inpostfix notation, the operator follows the
operand:
 Example
 Production
E → E1 op E2
E → (E1)
E → id
Semantic Rule Program fragment
E.code = E1.code || print op
E2.code || op
E.code = E1.code

E.code = id print id
2)Three address code:-
 Three-address code is an intermediate code. It is used by the
optimizing compilers.
 In three-address code, the given expression is broken down into
several separate instructions. These instructions can easily
translate into assembly language.
 Each Three address code instruction has at most three
operands. It is a combination of assignment and a binary
operator.
 Example
GivenExpression:a := (-c * b) + (-c * d)
 Three-address code is as follows:
t1 := -c
t2 := b*t1
t3 := -c
t4 := d * t3
t5 := t2 + t4
a := t5
 t is used as registers in the target program.

 The three address code can be represented in two


forms: quadruples and triples.
3)Triples:-
 The triples have three fields to implement the
three address code. The field of triples contains
the name of the operator, the first source
operand and the second source operand.
 In triples, the results of respective sub-
expressions are denoted by the position of
expression. Triple is equivalent to DAG while
representing expressions.

 Fig: Triples field


 Example: a := -b * c + d
 Three address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
 These statements are represented by triples as
follows:
Operator Source 1 Source 2
(0) uminus b -

(1) + c d
(2) * (0) (1)

(3) := (2) -
 Triples face the problem of code
immovability while optimization, as the
results are positional and changing the order
or position of an expression may cause
problems.
 Indirect Triples:
 This representation is an enhancement over
triples representation. It uses pointers
instead of position to store results. This
enables the optimizers to freely re-position
the sub-expression to produce an optimized
code.
 4) Quadruples:-
 The quadruples have four fields to
implement the three address code.
 The field of quadruples contains the name of
the operator, the first source operand, the
second source operand and the result
respectively.

 Fig: Quadruples field


 Example:- a := -b * c + d
 Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
 These statements are represented by quadruples
as follows:
Operator Source 1 Source 2 Destination
(0) uminus B - t1
(1) + C d t2
(2) * T1 t2 t3
(3) := T3 - a
 Design Issues:-
 In the code generation phase, various issues
can arises:
1)Input to the code generator
2)Target program
3)Memory management
4)Instruction selection
5)Register allocation
6)Evaluation order
 1) Input to the code generator:
 The input to the code generator contains the
intermediate representation of the source program
and the information of the symbol table. The source
program is produced by the front end.
 Intermediate representation has the several choices:
a) Postfix notation
b) Syntax tree
c) Three address code
 We assume front end produces low-level intermediate
representation i.e. values of names in it can directly
manipulated by the machine instructions.
 The code generation phase needs complete error-free
intermediate code as an input requires.
 2. Target program:-
The target program is the output of the code
generator. The output can be:
 a) Assembly language: It allows subprogram
to be separately compiled.
 b) Relocatable machine language: It makes
the process of code generation easier.
 c) Absolute machine language: It can be
placed in a fixed location in memory and can
be executed immediately.
 3. Memory management:-
 During code generation process the symbol
table entries have to be mapped to actual p
addresses and levels have to be mapped to
instruction address.
 Mapping name in the source program to
address of data is co-operating done by the
front end and code generator.
 Local variables are stack allocation in the
activation record while global variables are
in static area.
 4. Instruction selection:
 Nature of instruction set of the target machine should be
complete and uniform.
 When you consider the efficiency of target machine then
the instruction speed and machine idioms are important
factors.
 The quality of the generated code can be determined by
its speed and size.
 Example:
 The Three address code is:
a:= b + c
d:= a + e
 Inefficient assembly code is:
MOV b, R0 R0→b
ADD c, R0 R0 →c + R0
MOV R0, a a → R0
MOV a, R0 R0→ a
ADD e, R0 R0 → e + R0
MOV R0, d d → R0
 5. Register allocation
 Register can be accessed faster than memory. The
instructions involving operands in register are shorter and
faster than those involving in memory operand.
 The following sub problems arise when we use registers:
 Register allocation: In register allocation, we select the
set of variables that will reside in register.
 Register assignment: In Register assignment, we pick the
register that contains variable.
 Certain machine requires even-odd pairs of registers for
some operands and result.
 For example:
 Consider the following division instruction of the form:
D x, y
 Where,
 x is the dividend even register in even/odd register pair
 y is the divisor
 Even register is used to hold the reminder.
 Old register is used to hold the quotient.
 6. Evaluation order
 The efficiency of the target code can be
affected by the order in which the
computations are performed. Some
computation orders need fewer registers to
hold results of intermediate than others.
Mrs. Jyoti C. Bachhav
 Intermediate code
 Intermediate code is used to translate the
source code into the machine code.
Intermediate code lies between the high-
level language and the machine language.

Fig: Position of intermediate code generator


 Intermediate representation
 Intermediate code can be represented in two
ways:
 1. High Level intermediate code:
High level intermediate code can be represented
as source code. To enhance performance of source
code, we can easily apply code modification. But
to optimize the target machine, it is less
preferred.
 2. Low Level intermediate code
Low level intermediate code is close to the target
machine, which makes it suitable for register and
memory allocation etc. it is used for machine-
dependent optimizations.
A code generator is expected to have an understanding of
the target machine’s runtime environment and its
instruction set. The code generator should take the
following things into consideration to generate the code:
 Target language : The code generator has to be aware of
the nature of the target language for which the code is to
be transformed. That language may facilitate some
machine-specific instructions to help the compiler
generate the code in a more convenient way. The target
machine can have either CISC or RISC processor
architecture.
 IR Type : Intermediate representation has various forms.
It can be in Abstract Syntax Tree (AST) structure, Reverse
Polish Notation, or 3-address code.
 Selection of instruction : The code generator takes
Intermediate Representation as input and converts
(maps) it into target machine’s instruction set. One
representation can have many ways (instructions) to
convert it, so it becomes the responsibility of the
code generator to choose the appropriate instructions
wisely.
 Register allocation : A program has a number of
values to be maintained during the execution. The
target machine’s architecture may not allow all of
the values to be kept in the CPU memory or registers.
Code generator decides what values to keep in the
registers. Also, it decides the registers to be used to
keep these values.
 Ordering of instructions : At last, the code generator
decides the order in which the instruction will be
executed. It creates schedules for instructions to
execute them.
 Optimization of Basic Blocks:
 Optimization process can be applied on a
basic block. While optimization, we don't
need to change the set of expressions
computed by the block.
 There are two type of basic block
optimization.
1)Structure-Preserving Transformations
2)Algebraic Transformations
 1. Structure preserving transformations:
 The primary Structure-Preserving
Transformation on basic blocks is as follows:
a)Common sub-expression elimination
b)Dead code elimination
c)Renaming of temporary variables
d)Interchange of two independent adjacent
statements
 (a) Common sub-expression elimination:
 In the common sub-expression, you don't need to be
computed it over and over again. Instead of this you
can compute it once and kept in store from where it's
referenced when encountered again.
a:=b+c
b:=a-d
c:=b+c
d:=a-d
 In the above expression, the second and forth
expression computed the same expression. So the
block can be transformed as follows:
a:=b+c
b:=a-d
c:=b+c
d:=b
 (b) Dead-code elimination:
 It is possible that a program contains a large amount
of dead code.
 This can be caused when once declared and defined
once and forget to remove them in this case they
serve no purpose.
 Suppose the statement x:= y + z appears in a block
and x is dead symbol that means it will never
subsequently used. Then without changing the value
of the basic block you can safely remove this
statement.
 (c) Renaming temporary variables:
 A statement t:= b + c can be changed to u:= b + c
 where t is a temporary variable and u is a new
temporary variable. All the instance of t can be
replaced with the u without changing the basic block
value.
 (d) Interchange of statement:
 Suppose a block has the following two
adjacent statements:
t1 : = b + c
t2 : = x + y
 These two statements can be interchanged
without affecting the value of block when
value of t1 does not affect the value of t2.
2. Algebraic transformations:-
 In the algebraic transformation, we can change the set of
expression into an algebraically equivalent set. Thus the
expression x:= x + 0 or x:= x *1 can be eliminated from a
basic block without changing the set of expression.
 Constant folding is a class of related optimization. Here at
compile time, we evaluate constant expressions and
replace the constant expression by their values. Thus the
expression 5*2.7 would be replaced by13.5.
 Sometimes the unexpected common sub expression is
generated by the relational operators like <=, >=, <, >, +, =
etc.
 Sometimes associative expression is applied to expose
common sub expression without changing the basic block
value. if the source code has the assignments
a:= b + c
e:= c +d +b
 The following intermediate code may be generated:
a:= b + c
t:= c +d
e:= t + b
 Machine-Independent Optimization
 Code Optimization can perform in the
following different ways:
 1) Compile Time Evaluation:
(a) z = 5*(45.0/5.0)*r
Perform 5*(45.0/5.0)*r at compile time.
(b) x = 5.7
y = x/3.6
Evaluate x/3.6 as 5.7/3.6 at compile time.
 2) Dead code elimination:
 Before elimination the code is:
c=a*b
x=b
------
d=a*b+4
 After elimination the code is:
c=a*b
------
d=a*b+4
 Here, x= b is a dead state because it will
never subsequently used in the program. So,
we can eliminate this state.
 3)Frequency Reduction :
 It reduces the evaluation frequency of expression.
 It brings loop invariant statements out of the loop.
do
{
item = 10;
value = value + item;
} while(value<100);

 //This code can be further optimized as

item = 10;
do
{
value = value + item;
} while(value<100);
 4) Induction Variable and Strength Reduction:
 Strength reduction is used to replace the high strength
operator by the low strength.
 An induction variable is used in loop for the following kind
of assignment like i = i + constant.
 Before reduction the code is:
i = 1;
while(i<10)
{
y = i * 4;
}

//After Reduction the code is:


i=1
t=4
{
while( t<40)
y = t;
t = t + 4;
}
 Reduction in Strength:-
 Strength reduction is used to replace the expensive operation by the cheaper once
on the target machine.
 Addition of a constant is cheaper than a multiplication. So we can replace
multiplication with an addition within the loop.
 Multiplication is cheaper than exponentiation. So we can replace exponentiation
with multiplication within the loop.
 Example:
 while (i<10)
{
j= 3 * i+1;
a[j]=a[j]-2;
i=i+2;
}
 After strength reduction the code will be:
s= 3*i+1;
while (i<10)
{
j=s;
a[j]= a[j]-2;
i=i+2;
s=s+6;
}
 In the above code, it is cheaper to compute s=s+6 than j=3 *i
 Basic Block
 Basic block contains a sequence of statement.
The flow of control enters at the beginning of
the statement and leave at the end without any
halt (except may be the last instruction of the
block).
 The following sequence of three address
statements forms a basic block:
t1:= x * x
t2:= x * y
t3:= 2 * t2
t4:= t1 + t3
t5:= y * y
t6:= t4 + t5
 Basic block construction:
 Algorithm: Partition into basic blocks
 Input: It contains the sequence of three address
statements
 Output: it contains a list of basic blocks with each
three address statement in exactly one block
 Method: First identify the leader in the code. The
rules for finding leaders are as follows:
1)The first statement is a leader.
2)Statement L is a leader if there is an
conditional or unconditional goto statement
like: if....goto L or goto L.
3)Instruction L is a leader if it immediately
follows a goto or conditional goto statement
like: if goto B or goto B.
 For each leader, its basic block consists of the leader
and all statement up to. It doesn't include the next
leader or end of the program.
 Consider the following source code for dot
product of two vectors a and b of length 10:
 begin
prod :=0;
i:=1;
 do begin
prod :=prod+ a[i] * b[i];
i :=i+1;
 end
 while i <= 10 ;
 end
 The three address code for the above source program is
given below:
 B1
(1) prod := 0
(2) i := 1
 B2
(3) t1 := 4* i
(4) t2 := a[t1]
(5) t3 := 4* i
(6) t4 := b[t3]
(7) t5 := t2*t4
(8) t6 := prod+t5
(9) prod := t6
(10) t7 := i+1
(11) i := t7
(12) if i<=10 goto (3)
 Basic block B1 contains the statement (1) to (2)
 Basic block B2 contains the statement (3) to (12)
 Flow Graph:-
 Flow graph is a directed graph. It contains
the flow of control information for the set of
basic block.
 A control flow graph is used to depict that
how the program control is being parsed
among the blocks. It is useful in the loop
optimization.
 Flow graph for the vector dot product is given as follows:

 Block B1 is the initial node. Block B2 immediately follows B1, so


from B2 to B1 there is an edge.
 The target of jump from last statement of B1 is the first
statement B2, so from B1 to B2 there is an edge.
 B2 is a successor of B1 and B1 is the predecessor of B2.
 DAG representation for basic blocks:-
 A DAG for basic block is a directed acyclic graph
with the following labels on nodes:
1)The leaves of graph are labeled by unique
identifier and that identifier can be variable
names or constants.
2)Interior nodes of the graph is labeled by an
operator symbol.
3)Nodes are also given a sequence of identifiers for
labels to store the computed value.
 DAGs are a type of data structure. It is used to
implement transformations on basic blocks.
 DAG provides a good way to determine the
common sub-expression.
 It gives a picture representation of how the
value computed by the statement is used in
subsequent statements.
 Algorithm for construction of DAG
 Input: It contains a basic block
 Output: It contains the following information:
 Each node contains a label. For leaves, the label is an
identifier.
 Each node contains a list of attached identifiers to
hold the computed values.
Case (i) x:= y OP z
Case (ii) x:= OP y
Case (iii) x:= y
 Method:
 Step 1:
 If y operand is undefined then create node(y).
 If z operand is undefined then for case(i) create
node(z).
 Step 2:
 For case(i), create node(OP) whose right child is
node(z) and left child is node(y).
 For case(ii), check whether there is node(OP) with
one child node(y).
 For case(iii), node n will be node(y).
 Output:
 For node(x) delete x from the list of identifiers.
Append x to attached identifiers list for the node n
found in step 2. Finally set node(x) to n.
 Example: Steps of construct DAG for given
Basic Block
t0 = a + b
t 1 = t0 + c
d = t 0 + t1

t0 = a + b t0 = a + b d = t 0 + t1
 Example:
 Consider the following three address statement:
1) S1:= 4 * i
2) S2:= a[S1]
3) S3:= 4 * i
4) S4:= b[S3]
5) S5:= s2 * S4
6) S6:= prod + S5
7) Prod:= s6
8) S7:= i+1
9) i := S7
10) if i<= 20 goto (1)
 Construction of DAG:

You might also like