Compiler Design Notes For Uptu Syllabus PDF
Compiler Design Notes For Uptu Syllabus PDF
(TCS-502)
COURSE FILE
FOR
Bachelor of Technology
IN
Computer Science and Engineering
Session: 2007-2008
Prepared by MOHIT KUMAR for and on behalf of Meerut Institute of Engineering and Technology, Meerut.
CONTENTS
PREAMBLE
SYLLABUS
LECTURE PLAN
LECTURE NOTES
01 Introduction to Compilers and its Phases
02 Lexical Analysis
03 Basics of Syntax Analysis
04 Top-Down Parsing
05 Basic Bottom-Up Parsing Techniques
06 LR Parsing
07 Syntax-Directed Translation
08 Symbol Tables
09 Run Time Administration
10 Error Detection and Recovery
11 Code Optimization
EXERCISES
Practice Questions
Examination Question Papers
Laboratory Assignments
This course studies programming language translation and compiler design concepts;
language recognition, symbol table management, semantic analysis and code generation.
Prerequisites:
Programming knowledge of at least one high level language to carry out practical work.
Objectives:
To introduce the major concept areas of language translation and compiler design.
Topics:
UNIT I:
Introduction to Compiler: Phases and passes, Bootstrapping, Finite state machines and regular
expressions and their applications to lexical analysis, Implementation of lexical analyzers, lexical-
analyzer generator, LEX-compiler, Formal grammars and their application to syntax analysis,
BNF notation, ambiguity, YACC.
The syntactic specification of programming languages: Context free grammars, derivation
and parse trees, capabilities of CFG.
UNIT II:
Basic Parsing Techniques: Parsers, Shift reduce parsing, operator precedence parsing, top
down parsing, predictive parsers
Automatic Construction of efficient Parsers: LR parsers, the canonical Collection of LR (O)
items, constructing SLR parsing tables, constructing Canonical LR parsing tables, Constructing
LALR parsing tables, using ambiguous grammars, an automatic parser generator, implementation
of LR parsing tables, constructing LALR sets of items.
UNIT III:
Syntax-directed Translation: Syntax-directed Translation schemes, Implementation of Syntax-
directed Translators, Intermediate code, postfix notation, Parse trees & syntax trees, three
address code, quadruple & triples, translation of assignment statements, Boolean expressions,
statements that alter the flow of control, postfix translation, translation with a top down parser.
More about translation: Array references in arithmetic expressions, procedures call,
declarations, case statements.
UNIT IV:
Symbol Tables: Data structure for symbols tables, representing scope information.
Run-Time Administration: Implementation of simple stack allocation scheme, storage allocation
in block structured language.
Error Detection & Recovery: Lexical Phase errors, syntactic phase errors semantic errors.
UNIT V:
Introduction to code optimization: Loop optimization, the DAG representation of basic blocks,
value numbers and algebraic laws, Global Data-Flow analysis.
TEXTBOOK:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley.
TEACHING TIME OF
UNIT COMPETENCIES TOPICS HOURS
AIDS STUDY
AUGUST
2007
1. Introduction to
Basic Concepts of compiler and its 1
the compiler, phases
FIRST
overview of the
Passes, Phases, 2. Lexical Analysis 3
Lexical Analyzers,
CFG. 3. Basics of Syntax
3
SEPTEMBER
Analysis
2007
4. Top Down Parsing 4
SECOND
OCTOBER
Internal details about
2007
translation, actions to
THIRD
A compiler is a program takes a program written in a source language and translates it into an
equivalent program in a target language.
This subject discusses the various techniques used to achieve this objective. In addition to the
development of a compiler, the techniques used in compiler design can be applicable to many
problems in computer science.
o Techniques used in a lexical analyzer can be used in text editors, information
retrieval system, and pattern recognition programs.
o Techniques used in a parser can be used in a query processing system such as
SQL.
o Many software having a complex front-end may need techniques used in
compiler design.
A symbolic equation solver which takes an equation as input. That
program should parse the given input equation.
o Most of the techniques used in compiler design can be used in Natural Language
Processing (NLP) systems.
Each phase transforms the source program from one representation into another representation.
They communicate with error handlers and the symbol table.
• Lexical Analyzer reads the source program character by character and returns the tokens
of the source program.
• A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)
Example:
In the line of code newval := oldval + 12, tokens are:
newval (identifier)
:= (assignment operator)
oldval (identifier)
+ (add operator)
12 (a number)
• A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given
program.
• A syntax analyzer is also called a parser.
• A parse tree describes a syntactic structure.
Example:
For the line of code newval := oldval + 12, parse tree will be:
assignment
identifier := expression
identifier number
oldval 12
Example:
CFG used for the above parse tree is:
assignment Æ identifier := expression
expression Æ identifier
expression Æ number
expression Æ expression + expression
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing,
– Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards the root.
• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language used in
syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes (semantic
rules). The result is a syntax-directed translation and Attribute grammars
Example:
In the line of code newval := oldval + 12, the type of the identifier newval must match
with type of the expression (oldval+12).
• A compiler may produce an explicit intermediate codes representing the source program.
• These intermediate codes are generally machine architecture independent. But the level of
intermediate codes is close to the level of machine codes.
Example:
• The code optimizer optimizes the code produced by the intermediate code generator in the
terms of time and space.
Example:
The above piece of intermediate code can be reduced as follows:
MOVE id2, R1
MULT id3, R1
ADD #1, R1
MOVE R1, id1
Phases of a compiler are the sub-tasks that must be performed to complete the compilation
process. Passes refer to the number of times the compiler has to traverse through the entire
program.
There are three languages involved in a single compiler- the source language (S), the target
language (A) and the language in which the compiler is written (L).
CLSA
The language of the compiler and the target language are usually the language of the computer
on which it is working.
CASA
If a compiler is written in its own language then the problem would be to how to compile the first
compiler i.e. L=S. For this we take a language, R which is a small part of language S. We write a
compiler of R in language of the computer A. The complier of S is written in R and complied on
the complier of R make a full fledged compiler of S. This is known as Bootstrapping.
Lexical Analyzer reads the source program character by character to produce tokens.
Normally a lexical analyzer does not return a list of tokens at one shot; it returns a token
when the parser asks a token from it.
2.1 Token
• Token represents a set of strings described by a pattern. For example, an identifier represents
a set of strings which start with a letter continues with letters and digits. The actual string is
called as lexeme.
• Since a token can represent more than one lexeme, additional information should be held for
that specific lexeme. This additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information for that
token. For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds
the actual attributes for that token.
• Examples:
– <identifier, attribute> where attribute is pointer to the symbol table
– <assignment operator> no attribute is needed
– <number, value> where value is the actual value of the number
• Token type and its attribute uniquely identify a lexeme.
• Regular expressions are widely used to specify patterns.
2.2 Languages
2.2.1 Terminology
Examples:
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
Examples:
– Σ = {0,1}
– 0|1 = {0,1}
– (0|1)(0|1) = {00,01,10,11}
*
– 0 = {ε ,0,00,000,0000,....}
*
– (0|1) = All strings with 0 and 1, including the empty string
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
Example:
a
a b
0 1 2
start
b
Transition Graph
Transition Function:
a b
0 {0,1} {0}
1 {} {2}
2 {} {}
b
Transition Graph
Transition Function:
a b
0 1 0
1 1 2
2 1 0
Note that the entries in this function are single value and not set of values (unlike NFA).
ε
i f
a
i f
ε N(r1) ε
i f
ε ε
N(r2)
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
i N(r1) N(r2) f
ε ε
i N(r) f
ε
Example:
For a RE (a|b) * a, the NFA construction is shown below.
a a
ε
a ε
(a | b) ε ε
b b
b
ε
a ε
ε
* ε ε
(a|b) ε ε
b
ε
ε
a
ε ε
* ε ε
(a|b) a ε
a
ε
b
We merge together NFA states by looking at them from the point of view of the input characters:
• From the point of view of the input, any two states that are connected by an -transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can
regard a transition on a symbol as moving from a state to a set of states (ie. the union of
all those states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
• The -closure function takes a state and returns the set of states reachable from it
based on (one or more) -transitions. Note that this will always include the state tself.
We should be able to get from a state to any state in its -closure without consuming any
input.
• The function move takes a state and a character, and returns the set of states reachable
by one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.
begin
mark S1
for each input symbol a do
begin
S2 Í ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] Í S2
end
end
2 a 3 ε
ε
0 ε 1 ε a
6 7 8
ε ε
4 b 5
ε
S0 = ε-closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
⇓ mark S0
ε-closure(move(S0,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1 into DS
ε-closure(move(S0,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS
transfunc[S0,a] Í S1 transfunc[S0,b] Í S2
⇓ mark S1
ε-closure(move(S1,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1
ε-closure(move(S1,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] Í S1 transfunc[S1,b] Í S2
⇓ mark S2
ε-closure(move(S2,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1
ε-closure(move(S2,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] Í S1 transfunc[S2,b] Í S2
S1
S0 b a
S2
• The input to LEX consists primarily of Auxiliary Definitions and Translation Rules.
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use Auxiliary Definitions.
• We can give names to regular expressions, and we can use these names as symbols to
define other regular expressions.
• An Auxiliary Definition is a sequence of the definitions of the form:
d1 → r 1
d2 → r 2
.
.
dn → r n
Example:
For Identifiers in Pascal
letter → A | B | ... | Z | a | b | ... | z
digit → 0 | 1 | ... | 9
id → letter (letter | digit ) *
If we try to write the regular expression representing identifiers without using regular definitions,
that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Example:
For Unsigned numbers in Pascal
digit → 0 | 1 | ... | 9
digits → digit +
opt-fraction → ( . digits ) ?
opt-exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
• Translation Rules comprise of a ordered list Regular Expressions and the Program Code to be
executed in case of that Regular Expression encountered.
R1 P1
R2 P2
.
.
Rn Pn
• The list is ordered i.e. the RE’s should be checked in order. If a string matches more than one
RE, the RE occurring higher in the list should be given preference and its Program Code is
executed.
•The Regular Expressions are converted into NFA’s. The final states of each NFA correspond
to some RE and its Program Code.
•Different NFA’s are then converted to a single NFA with epsilon moves. Each final state of the
NFA corresponds one-to-one to some final state of individual NFA’s i.e. some RE and its
Program Code. The final states have an order according to the corresponding RE’s. If more
than one final state is entered for some string, then the one that is higher in order is selected.
•This NFA is then converted to DFA. Each final state of DFA corresponds to a set of states
(having at least one final state) of the NFA. The Program Code of each final state (of the DFA)
is the program code corresponding to the final state that is highest in order out of all the final
states in the set of states (of NFA) that make up this final state (of DFA).
Example:
AUXILIARY DEFINITIONS
(none)
TRANSLATION RULES
a {Action1}
abb {Action2}
a*b+ {Action2}
First we construct an NFA for each RE and then convert this into a single NFA:
start a
1 2
a { action1 } start a b b
3 4 5 6
abb{ action2 } a b
a *b + { action3} start
7 b 8
a
1 2
ε
start
0
ε 3
a
4
b
5
b
a b 6
ε
7 8
This NFA is now converted into a DFA. The transition table for the above DFA is as follows:
3.1 Parser
3.2.1 Derivations
Example:
(b) E → E + E | E – E | E * E | E / E | - E
(c) E → ( E )
(d) E → id
• At each derivation step, we can choose any of the non-terminal in the sentential form of G for
the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is called
as left-most derivation.
Example:
• If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.
Example:
• We will see that the top-down parsers try to find the left-most derivation of the given source
program.
• We will see that the bottom-up parsers try to find the right-most derivation of the given source
program in the reverse order.
E ⇒ -E E
⇒ -(E) E
⇒ -(E+E)
E
- E - E - E
( E ) ( E )
E E E + E
- E - E
⇒ -(id+E) ⇒ -(id+id)
( E ) ( E )
E + E E + E
id id id
3.2.3 Ambiguity
• A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.
• For the most parsers, the grammar must be unambiguous.
• Unambiguous grammar
Î Unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design phase of the compiler.
• An unambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an ambiguous
grammar) to disambiguate that grammar to restrict to this choice.
• Ambiguous grammars (because of ambiguous operators) can be disambiguated according to
the precedence and associativity rules.
Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of
operators as follows:
^ (right to left)
* (left to right)
+ (left to right)
In general,
A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not start with A
⇓ Eliminate immediate left recursion
A → β1 A’ | ... | βn A’
A’ → α1 A’ | ... | αm A’ | ε an equivalent grammar
Example:
E → E+T | T
T → T*F | F
F → id | (E)
E → T E’
E’ → +T E’ | ε
T → F T’
T’ → *F T’ | ε
F → id | (E)
Example:
S → Aa | b
A → Sc | d
S ⇒ Aa ⇒ Sca
Or
A ⇒ Sc ⇒ Aac
causes to a left-recursion
for i from 1 to n do {
for j from 1 to i-1 do {
replace each production
Ai → Aj γ
by
Ai → α1 γ | ... | αk γ
where Aj → α1 | ... | αk
}
eliminate immediate left-recursions among Ai productions
}
Example:
S → Aa | b
A → Ac | Sd | f
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A → Sd with A → Aad | bd
So, we will have A → Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A → bdA’ | fA’
A’ → cA’ | adA’ | ε
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | ε
for S:
- Replace S → Aa with S → SdA’a | fA’a
So, we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S → fA’aS’ | bS’
S’ → dA’aS’ | ε
• A predictive parser (a top-down parser without backtracking) insists that the grammar
must be left-factored.
• when we see if, we cannot now which production rule to choose to re-write stmt in the
derivation
• In general,
3.4.1 Algorithm
• For each non-terminal A with two or more alternatives (production rules) with a common
non-empty prefix, let say
convert it into
A → αA’ | γ1 | ... | γm
A’ → β1 | ... | βn
Example:
Example:
A → ad | a | ab | abc | b
⇓
A → aA’ | b
3.5 YACC
YACC generates C code for a syntax analyzer, or parser. YACC uses grammar rules that allow it
to analyze tokens from LEX and create a syntax tree. A syntax tree imposes a hierarchical
structure on tokens. For example, operator precedence and associativity are apparent in the
syntax tree. The next step, code generation, does a depth-first walk of the syntax tree to generate
code. Some compilers produce machine code, while others output assembly.
YACC takes a default action when there is a conflict. For shift-reduce conflicts, YACC will shift.
For reduce-reduce conflicts, it will use the first rule in the listing. It also issues a warning message
whenever a conflict exists. The warnings may be suppressed by making the grammar
unambiguous.
Input to YACC is divided into three sections. The definitions section consists of token
declarations, and C code bracketed by “%{“ and “%}”. The BNF grammar is placed in the rules
section, and user subroutines are added in the subroutines section.
• Backtracking is needed.
• It tries to find the left-most derivation.
Example:
If the grammar is S → aBc; B → bc | b and the input is abc:
S S
a B c a B c
b c b
• When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a
production rule by just looking the current symbol in the input string.
Example:
stmt → if ...... |
while ...... |
begin ...... |
for .....
• When we are trying to write the non-terminal stmt, we have to choose first production
rule.
• When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it may not be
suitable for predictive parsing (not LL (1) grammar).
Example:
A → aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}
A → aA | bB | ε
• If all other productions fail, we should apply an ε-production. For example, if the current
token is not a or b, we may apply the ε-production.
• Most correct choice: We should apply an ε-production for a non-terminal A when the
current token is in the follow set of A (which terminals can follow A in the sentential
forms).
Example:
A → aBe | cBd | C
B → bB | ε
C→f
proc A {
case of the current token {
a: - match the current token with a and move to the next token;
- call B;
- match the current token with e and move to the next token;
c: - match the current token with c and move to the next token;
- call B;
- match the current token with d and move to the next token;
f: - call C //First Set of C
}
}
proc C {
match the current token with f and move to the next token;
}
proc B {
case of the current token {
b: - match the current token with b and move to the next token;
- call B
e,d: - do nothing //Follow Set of B
Parsing Table
input buffer
– our string to be parsed. We will assume that its end is marked with a special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most derivation) of
the string in the input buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S. ($S Í initial stack)
– when the stack is emptied (i.e. only $ left in the stack), the parsing is completed.
parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule.
The symbol at the top of the stack (say X) and the current symbol in the input string (say a)
determine the parser action. There are four possible parser actions.
• If X is a non-terminal
Example:
For the Grammar is S → aBa; B → bB | ε and the following LL(1) parsing table:
a b $
S S → aBa
B B→ε B → bB
a B a
b B
b B
ε
4.4.2 Constructing LL(1) parsing tables
• Two functions are used in the construction of LL(1) parsing tables -FIRST & FOLLOW
• FIRST(α) is a set of the terminal symbols which occur as first symbols in strings derived from
α where α is any string of grammar symbols.
Example:
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
E’ → ε FIRST(ε)={ε} Î none
but since ε in FIRST(ε)
and FOLLOW(E’)={$,)} Î E’ → ε into M[E’,$] and M[E’,)]
T’ → ε FIRST(ε)={ε} Î none
but since ε in FIRST(ε)
and FOLLOW(T’)={$,),+}Î T’ → ε into M[T’,$], M[T’,)] and M[T’,+]
id + * ( ) $
E E → TE’ E → TE’
E’ E’ → +TE’ E’ → ε E’ → ε
T T → FT’ T → FT’
T’ T’ → ε T’ → *FT’ T’ → ε T’ → ε
F F → id F → (E)
LL(1)
• A grammar whose parsing table has no multiply-defined entries is said to be LL(1) grammar.
• The parsing table of a grammar may contain more than one production rule. In this case, we
say that it is not a LL(1) grammar.
• A grammar G is LL(1) if and only if the following conditions hold for two distinctive production
rules A → α and A → β:
1. Both α and β cannot derive strings starting with same terminals.
2. At most one of α and β can derive to ε.
3. If β can derive to ε, then α cannot derive to any string starting with a terminal in
FOLLOW(A).
Example:
S→iCtSE | a
E→eS | ε
C→b
FOLLOW(S) = { $,e }
FOLLOW(E) = { $,e }
a b e i t $
FOLLOW(C) = { t }
S S→a S → iCtSE
FIRST(iCtSE) = {i}
FIRST(a) = {a} E→eS
FIRST(eS) = {e} E E→ε E→ε
FIRST(ε) = {ε}
FIRST(b) = {b} C C→b
• What do we have to do it if the resulting parsing table contains multiply defined entries?
– If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
– If the grammar is not left factored, we have to left factor the grammar.
– If its (new grammar’s) parsing table still contains multiply defined entries, that grammar is
ambiguous or it is inherently not a LL(1) grammar.
• A left recursive grammar cannot be a LL(1) grammar.
– A → Aα | β
Î any terminal that appears in FIRST(β) also appears FIRST(Aα) because Aα ⇒ βα.
Î If β is ε, any terminal that appears in FIRST(α) also appears in FIRST(Aα) and
FOLLOW(A).
• A grammar is not left factored, it cannot be a LL(1) grammar
– A → αβ1 | αβ2
Î any terminal that appears in FIRST(αβ1) also appears in FIRST(αβ2).
• An ambiguous grammar cannot be a LL(1) grammar.
• A bottom-up parser creates the parse tree of the given input starting from leaves towards the
root.
• A bottom-up parser tries to find the right-most derivation of the given input in the reverse
order.
(a) S ⇒ ... ⇒ ω (the right-most derivation of ω)
(b) ← (the bottom-up parser finds the right-most derivation in the reverse
order)
• Bottom-up parsing is also known as shift-reduce parsing because its two main actions are shift
and reduce.
– At each shift action, the current symbol in the input string is pushed to a stack.
– At each reduction step, the symbols at the top of the stack (this symbol sequence is the
right side of a production) will replaced by the non-terminal at the left side of that
production.
– There are also two more actions: accept and error.
• A shift-reduce parser tries to reduce the given input string into the starting symbol.
• At each reduction step, a substring of the input matching to the right side of a production rule
is replaced by the non-terminal at the left side of that production rule.
• If the substring is chosen correctly, the right most derivation of that string is created in the
reverse order.
Example:
For Grammar S → aABb; A → aA | a; B → bB | b and Input string aaabb,
aaabb
⇒ aaAbb
⇒ aAbb
⇒ aABb
⇒ S
5.1.1 Handle
• Informally, a handle of a string is a substring that matches the right side of a production rule.
- But not every substring that matches the right side of a production rule is handle.
• A handle of a right sentential form γ (≡ αβω) is a production rule A → β and a position of γ
where the string β may be found and replaced by A to produce the previous right-sentential
form in a rightmost derivation of γ.
S ⇒ αAω ⇒ αβω
• If the grammar is unambiguous, then every right-sentential form of the grammar has exactly
one handle.
• We will see that ω is a string of terminals.
input string
• Start from γn, find a handle An→βn in γn, and replace βn in by An to get γn-1.
• Then find a handle An-1→βn-1 in γn-1, and replace βn-1 in by An-1 to get γn-2.
• Repeat this, until we reach S.
Example:
E → E+T | T
T → T*F | F
F → (E) | id
Example:
$ id+id*id$ shift
$id +id*id$ reduce by F → id
$F +id*id$ reduce by T → F
$T +id*id$ reduce by E → T
$E +id*id$ shift
$E+ id*id$ shift
$E+id *id$ reduce by F → id
• There are context-free grammars for which shift-reduce parsers cannot be used.
• Stack contents and the next input symbol may not decide action:
– shift/reduce conflict: Whether make a shift operation or a reduction.
– reduce/reduce conflict: The parser cannot decide which of several reductions to make.
• If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-LR(k)
grammar.
LR (k)
1. Operator-Precedence Parser
– simple, but only a small class of grammars.
2. LR-Parsers
– Covers wide range of grammars.
• SLR – Simple LR parser
• CLR – most general LR parser (Canonical LR)
• LALR – intermediate LR parser (Look Ahead LR)
– SLR, CLR and LALR work same, only their parsing tables are different.
CFG
CLR
LALR
SLR
• Operator grammar
– small, but an important class of grammars
– we may have an efficient operator precedence parser (a shift-reduce parser) for an
operator grammar.
• In an operator grammar, no production rule can have:
– ε at the right side
– two adjacent non-terminals at the right side.
Examples:
E→AB E→EOE E→E+E |
A→a E→id E*E |
B→b O→+|*|/ E/E | id
not operator grammar not operator grammar operator grammar
• The determination of correct precedence relations between terminals are based on the
traditional notions of associativity and precedence of operators. (Unary minus causes a
problem).
• The intention of the precedence relations is to find the handle of a right-sentential form,
<. with marking the left end,
=· appearing in the interior of the handle, and
.
> marking the right hand.
• In our input string $a1a2...an$, we insert the precedence relation between the pairs of terminals
(the precedence relation holds between the terminals in that pair).
Example:
• Scan the string from left end until the first .> is encountered.
• Then scan backwards (to the left) over any =· until a <. is encountered.
• The handle contains everything to left of the first .> and to the right of the <. is encountered.
The handles thus obtained can be used to shift reduce a given string.
• The input string is w$, the initial stack is $ and a table holds precedence relations between
certain terminals
The input string is w$, the initial stack is $ and a table holds precedence relations between certain
terminals.
Example:
4. Also, let
(=·) $ <. ( id .> ) ) .> $
( <. ( $ <. id id .> $ ) .> )
( <. id
Example:
The complete table for the Grammar E → E+E | E-E | E*E | E/E | E^E | (E) | -E | id is:
+ - * / ^ id ( ) $
+ .> .> <. <. <. <. <. .> .>
- .> .> <. <. <. <. <. .> .>
* .> .> .> .> <. <. <. .> .>
/ .> .> .> .> <. <. <. .> .>
^ .> .> .> .> <. <. <. .> .>
id .> .> .> .> .> .> .>
( <. <. <. <. <. <. <. =·
) .> .> .> .> .> .> .>
$ <. <. <. <. <. <. <.
There is another more general way to compute precedence relations among terminals:
Note that the grammar must be unambiguous for this method. Unlike the previous
method, it does not take into account any other property and is based purely on grammar
productions. An ambiguous grammar will result in multiple entries in the table and thus
cannot be used.
• Operator-Precedence parsing cannot handle the unary minus when we also use the
binary minus in our grammar.
• The best approach to solve this problem is to let the lexical analyzer handle this problem.
– The lexical analyzer will return two different operators for the unary minus and
the binary minus.
• Compilers using operator precedence parsers do not need to store the table of precedence
relations.
• The table can be encoded by two precedence functions f and g that map terminal symbols to
integers.
• For symbols a and b.
f(a) < g(b) whenever a <. b
f(a) = g(b) whenever a =· b
f(a) > g(b) whenever a .> b
• Advantages:
– simple
– powerful enough for expressions in programming languages
• Disadvantages:
– It cannot handle the unary minus (the lexical analyzer should handle the unary minus).
– Small class of grammars.
– Difficult to decide which language is recognized by the grammar.
stack
input a1 ... ai ... an $
Sm
Xm
Sm-1
LR Parsing Algorithm output
Xm-1
• Sm and ai decides the parser action by consulting the parsing action table. (Initial Stack
contains just So )
1. shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) Î ( So X1 S1 ... Xm Sm ai s,
ai+1 ... an $ )
Example:
1) E → E+T
2) E→T
3) T → T*F
4) T→F
5) F → (E)
6) F → id
Action Goto
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
• An LR(0) item of a grammar G is a production of G a dot at the some position of the right
side.
Example:
A → aBb
• Sets of LR(0) items will be the states of action and goto table of the SLR parser.
• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis for
constructing SLR parsers.
If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items
constructed from I by the two rules:
Example:
I = { E’ → .E, E → .E+T, E → .T, T → .T*F, T → .F, F → .(E), F → .id }
goto(I,E) = { E’ → E., E → E.+T }
goto(I,T) = { E → T., T → T.*F }
goto(I,F) = {T → F. }
goto(I,() = {F→ (.E), E→ .E+T, E→ .T, T→ .T*F, T→ .F, F→ .(E), F→ .id }
goto(I,id) = { F → id. }
To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)
collection of the grammar G’.
Algorithm:
C is { closure({S’→.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
Example:
For grammar used above, Canonical LR(0) items are as follows-
I0: E’ → .E I1: E’ → E. I6: E → E+.T I9: E → E+T.
E → .E+T E → E.+T T → .T*F T → T.*F
E → .T T → .F
T → .T*F I2: E → T. F → .(E) I10: T → T*F.
T → .F T → T.*F F → .id
F → .(E)
F → .id I3: T → F. I7: T → T*.F I11: F → (E).
F → .(E)
I4: F → (.E) F → .id
E → .E+T
E → .T I8: F → (E.)
T → .T*F E → E.+T
T → .F
F → .(E)
F → .id
I5: F → id.
I0 E I1 + I6 T I9 * to I7
F to I3
( to I4
T
id to I
5
F I2 * I7
F I10
(
I3 to I4
id
( to I5
id E
I4 I8 )
id T I11
to I2 +
F
I5 to I3 to I6
(
to I4
6.3.5 Parsing Table
Example:
• If a state does not know whether it will make a shift operation or reduction for a terminal,
we say that there is a shift/reduce conflict.
Example:
S → L=R I0: S’ → .S I1: S’ → S. I6: S → L=.R I9: S → L=R.
S→R S → .L=R R → .L
L→ *R S → .R I2: S → L.=R L→ .*R
L → id L → .*R R → L. L → .id
R→L L → .id
R → .L I3: S → R.
• If a state does not know whether it will make a reduction operation using the production
rule i or j for a terminal, we say that there is a reduce/reduce conflict.
Example:
S → AaAb I0: S’ → .S
S → BbBa S → .AaAb
A→ε S → .BbBa
B→ε A→.
B→.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A → ε b reduce by A → ε
reduce by B → ε reduce by B → ε
reduce/reduce conflict reduce/reduce conflict
If the SLR parsing table of a grammar G has a conflict, we say that that grammar is not SLR
grammar.
• In SLR method, the state i makes a reduction by Aα→ when the current token is a:
– if the Aα→. in the Ii and a is FOLLOW(A)
• To avoid some of invalid reductions, the states need to carry more information.
• Extra information is put into a state by including a terminal symbol as a second
component in an item.
Algorithm:
C is { closure({S’→.S,$}) }
repeat the followings until no more set of LR(1) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
Example:
id * = $ S L R
0 S5 s4 1 2 3
1 acc
2 s6 r5
3 r2
4 S5 s4 8 7
5 r4 r4
no shift/reduce or
6 s12 s11 10 9 no reduce/reduce conflict
7 r3 r3 ⇓
so, it is a LR(1) grammar
8 r5 r5
9 r1
10 r5
11 s12 s11 10 13
12 r4
13 r3
• This shrink process may introduce a reduce/reduce conflict in the resulting LALR parser.
In that case the grammar is NOT LALR.
• This shrink process cannot produce a shift/reduce conflict.
• The core of a set of LR(1) items is the set of its first component.
Example:
S → L.=R,$ Î S → L.=R
R → L.,$ R → L.
• We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores.
Then we will merge them as a single state.
Example:
I1:L → id.,= I12: L → id.,=
• We will do this for all states of a canonical LR(1) parser to get the states of the LALR
parser.
• In fact, the number of the states of the LALR parser for a grammar will be equal to the
number of states of the SLR parser for that grammar.
• Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar.
• Find each core; find all sets having that same core; replace those sets having same
cores with a single set which is their union.
C={I0,...,In} Î C’={J1,...,Jm} where m ≤ n
• Create the parsing tables (action and goto tables) same as the construction of the
parsing tables of LR(1) parser.
– Note that: If J=I1 ∪ ... ∪ Ik since I1,...,Ik have same cores
Î cores of goto(I1,X),...,goto(I2,X) must be same.
– So, goto(J,X)=K where K is the union of all sets of items having same cores as
goto(I1,X).
• We say that we cannot introduce a shift/reduce conflict during the shrink process for the
creation of the states of a LALR parser.
• Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR parser
must have:
A → α.,a and B → β.aγ,b
• This means that a state of the canonical LR(1) parser must have:
A → α.,a and B → β.aγ,c
But, this state has also a shift/reduce conflict. i.e. The original canonical
LR(1) parser has a conflict.
(Reason for this, the shift operation does not depend on Lookaheads)
But, we may introduce a reduce/reduce conflict during the shrink process for the creation of
the states of a LALR parser.
Example:
For the above Canonical LR Parsing table, we can get the following LALR(1) collection
I7 and I13
I713:L → *R.,$/=
I8 and I10
I810: R → L.,$/=
id * = $ S L R
0 s5 s4 1 2 3
1 acc
2 s6 r5 no shift/reduce or
no reduce/reduce conflict
3 r2
⇓
4 s5 s4 8 7 so, it is a LALR(1) grammar
5 r4 r4
6 s12 s11 10 9
7 r3 r3
8 r5 r5
9 r1
Example:
E → E+T | T
E → E+E | E*E | (E) | id Î T → T*F | F
F → (E) | id
FOLLOW(E) = { $,+,*,) }
E + E
I0 I1 I4 I7
id + * ( ) $ E
0 s3 s2 1
1 s4 s5 acc
2 s3 s2 6
3 r4 r4 r4 r4
4 s3 s2 7
5 s3 s2 8
6 s4 s5 s9
7 r1 s5 r1 r1
8 r2 r2 r2 r2
9 r3 r3 r3 r3
• Grammar symbols are associated with attributes to associate information with the
programming language constructs that they represent.
• Values of these attributes are evaluated by the semantic rules associated with the
production rules.
• Evaluation of these semantic rules:
– may generate intermediate codes
– may put information into the symbol table
– may perform type checking
– may issue error messages
– may perform some other activities
– In fact, they may perform almost any activities.
• An attribute may hold almost any thing.
– A string, a number, a memory location, a complex record.
• Evaluation of a semantic rule defines the value of an attribute. But a semantic rule may
also have some side effects such as printing a value.
Example:
• Intermediate codes are machine independent codes, but they are close to machine
instructions.
• The given program in a source language is converted to an equivalent program in an
intermediate language by the intermediate code generator.
• Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.
– syntax trees can be used as an intermediate language.
– postfix notation can be used as an intermediate language.
– three-address code (Quadraples) can be used as an intermediate language
• we will use quadraples to discuss intermediate code generation
• quadraples are close to machine instructions, but they are not actual
machine instructions.
Syntax Tree is a variant of the Parse tree, where each leaf represents an operand and
each interior node an operator.
Example:
a +
b d
Postfix Notation is another useful form of intermediate code if the language is mostly
expressions.
Example:
• We use the term “three-address code” because each statement usually contains three
addresses (two for operands, one for the result).
• The most general kind of three-address code is:
x := y op z
where x, y and z are names, constants or compiler-generated temporaries; op is any
operator.
• But we may also the following notation for quadraples (much better notation because it
looks like a machine code instruction)
op y,z,x
apply operator op to y and z, and store the result in x.
Three-address code can be represented in various forms viz. Quadruples, Triples and
Indirect Triples. These forms are demonstrated by way of an example below.
Example:
A = -B * (C + D)
Three-Address code is as follows:
T1 = -B
T2 = C + D
T3 = T1 * T2
A = T3
Quadruple:
(1) - B T1
(2) + C D T2
(3) * T1 T2 T3
(4) = A T3
Triple:
(1) - B
(2) + C D
(3) * (1) (2)
(4) = A (3)
Indirect Triple:
Statement
(0) (56)
(1) (57)
(2) (58)
(3) (59)
(56) - B
(57) + C D
(58) * (56) (57)
(59) = A (58)
T1 := - B
T2 := C+D
T3 := T1* T2
A := T3
E → E1 + E2 E.place = newtemp();
E.code = E1.code || E2.code || gen( E.place = E1.place + E2.place )
E → E1 * E2 E.place = newtemp();
E.code = E1.code || E2.code || gen( E.place = E1.place * E2.place )
E → - E1 E.place = newtemp();
E.code = E1.code || gen( E.place = - E1.place )
E → ( E1 ) E.place = E1.place;
E.code = E1.code
E → id E.place = id.place;
E.code = null
E Æ E or E
E Æ E and E
E Æ not E
EÆ(E)
E Æ id
E Æ id relop id
Example:
The translation for A or B and C is the three-address sequence:
T1 := B and C
T2 := A or T1
Also, the translation of a relational expression such as A < B is the three-address sequence:
E Æ E1 or E2 T = newtemp ();
E.place = T;
Gen (T = E1.place or E2.place)
E → ( E1 ) E.place = E1.place;
E.code = E1.code
E → id E.place = id.place;
E.code = null
Quadruples are being generated and NEXTQUAD indicates the next available entry in the
quadruple array.
Here (4) is a true exit and (6) is a false exit of the Boolean expressions.
E Æ E or M E
E Æ E and M E
E Æ not E
EÆ(E)
E Æ id
E Æ id relop id
MÆε
Example:
For the expression P<Q or R<S and T, the parsing steps and corresponding semantic
actions are shown below. We assume that NEXTQUAD has an initial value of 100.
Step 1: P<Q gets reduced to E by E Æ id relop id. The grammatical form is E1 or R<S
and T.
Step 2: R<S gets reduced to E by E Æ id relop id. The grammatical form is E1 or E2 and
T.
104: if T goto _
105: goto _
We have no new code generated but changes are made in the already generated
code (Backpatch).
We have no new code generated but changes are made in the already
generated code (Backpatch).
Boolean expressions may in practice contain arithmetic sub expressions e.g. (A+B)>C.
We can accommodate such sub-expressions by adding the production E Æ E op E to our
grammar.
We will also add a new field MODE for E. If E has been achieved after reduction using
the above (arithmetic) production, we make E.MODE = arith, otherwise make E.MODE =
bool.
If E.MODE = arith, we treat it arithmetically and use E.PLACE. If E.MODE = bool, we
treat it as Boolean and use E.FALSE and E.TRUE.
S Æ LABEL : S
LABEL Æ id
The semantic action attached with this production is to record the LABEL and its value
(NEXTQUAD) in the symbol table. It will also Backpatch any previous references to this
LABEL with its current value.
(1) S Æ if E then S
(2) S Æ if E then S else S
(3) S Æ while E do S
(4) S Æ begin L end
(5) SÆA
We introduce a new field NEXT for S and L like TRUE and FALSE for E. S.NEXT and
L.NEXT are respectively the pointers to a list of all conditional and unconditional jumps to
the quadruple following statement S and statement-list L in execution order.
We also introduce the marker non-terminal M as in the case of grammar for Boolean
expressions. This is put before statement in if-then, before both statements in if-then-else
and the statement in while-do as we may need to proceed to them after evaluating E. In
case of while-do, we also need to put M before E as we may need to come back to it after
executing S.
In case of if-then-else, if we evaluate E to be true, first S will be executed. After this we
should ensure that instead of second S, the code after this if-then-else statement be
executed. We thus place another non-terminal marker N after first S i.e. before else.
The grammar now is as follows:
(1) S Æ if E then M S
(2) S Æ if E then M S N else M S
(3) S Æ while M E do M S
(4) S Æ begin L end
(5) SÆA
(6) LÆL;MS
(7) LÆS
(8) MÆε
(9) NÆε
The production
S Æ while M1 E do M2 S1
can be factored as
S Æ C S1
C Æ W E do
W Æ while
C Æ W E do C.QUAD = W.QUAD
BACKPATCH (E.TRUE, NEXTQUAD)
C.FALSE = E.FALSE
S Æ for L = E1 step E2 to E3 do S1
Here L is any expression with l-value, usually a variable, called the index. E1, E2 and E3 are
expressions called the initial value, increment and limit, respectively. Semantically, the for-
statement is equivalent to the following program.
begin
INDEX = addr ( L );
*INDEX = E1;
INCR = E2;
LIMIT = E3;
while *INDEX <= LIMIT do
begin
The non-terminals L, E1, E2, E3 and S appear in the same order as in the production. The
production can be factored as
(1) F Æ for L
(2) T Æ F = E1 by E2 to E3 do
(3) S Æ T S1
Any translation done by top-down parser can be done in a bottom-up parser also.
But in certain situations, translation with a top-down parser is advantageous as tricks
such as placing a marker non-terminal can be avoided.
Semantic routines can be called in the middle of productions in top-down parser.
Elements of arrays can be accessed quickly if the elements are stored in a block of
consecutive locations.
For a one-dimensional array A:
baseA+(i-low)*width
can be re-written as
So, the location of A[i] can be computed at the run-time by evaluating the formula
i*width+c where c is (baseA-low*width) which is evaluated at compile-time.
Intermediate code generator should produce the code to evaluate this formula i*width+c
(one multiplication and one addition operation).
A two-dimensional array can be stored in either row-major (row-by-row) or column-major
(column-by-column).
Most of the programming languages use row-major method.
The location of A[i1,i2] is baseA+ ((i1-low1)*n2+i2-low2)*width
((i1*n2)+i2)*width + (baseA-((low1*n1)+low2)*width)
So, the intermediate code generator should produce the codes to evaluate the
following formula (to find the location of A[i1,i2,...,ik]) :
To evaluate the (( ... ((i1*n2)+i2) ...)*nk+ik portion of this formula, we can use the
recurrence equation:
e1 = i1
em = em-1 * nm + im
The grammar and suitable translation scheme for arithmetic expressions with array references is
as given below:
E → E1 + E2 E.PLACE = NEWTEMP ( )
E → ( E1 ) E.PLACE = E1.PLACE
L → id L.PLACE = id.PLACE
L.OFFSET = NULL
Here, NDIM denotes the number of dimensions, LIMIT (AARAY, i) function returns the upper limit
along the ith dimension of ARRAY i.e. ni, WIDTH (ARRAY) returns the number of bytes for one
element of ARRAY.
7. 8 Declarations
Following is the grammar and a suitable translation scheme for declaration statements:
Here, ENTER makes the entry into symbol table while ATTR is used to trace the data type.
Following is the grammar and a suitable translation scheme for Procedure Calls:
switch E
begin
case V1: S1
case V2: S2
.
.
.
case Vn-1: Sn-1
default: Sn
end
8.2 Implementation
• Each entry in a symbol table can be implemented as a record that consists of several fields.
• The entries in symbol table records are not uniform and depend on the program element
identified by the name.
• Some information about the name may be kept outside of the symbol table record and/or
some fields of the record may be left vacant for the reason of uniformity. A pointer to this
information may be stored in the record.
• The name may be stored in the symbol table record itself, or it can be stored in a separate
array of characters and a pointer to it in the symbol table.
• The information about runtime storage location, to be used at the time of code generation, is
kept in the symbol table.
• There are various approaches to symbol table organization e.g. Linear List, Search Tree and
Hash Table.
• How do we allocate the space for the generated target code and the data object of our
source programs?
• The places of the data objects that can be determined at compile time will be allocated
statically.
• But the places for the some of data objects will be allocated at run-time.
• The allocation of de-allocation of the data objects is managed by the run-time support
package.
– run-time support package is loaded together with the generate target code.
– the structure of the run-time support package depends on the semantics of the
programming language (especially the semantics of procedures in that
language).
• We can use a tree (called activation tree) to show the way control enters and leaves
activations.
• In an activation tree:
– Each node represents an activation of a procedure.
– The root represents the activation of the main program.
– The node a is a parent of the node b iff the control flows from a to b.
– The node a is left to to the node b iff the lifetime of a occurs before the lifetime of
b.
Example:
p s
q s
9.1.2 Control Stack
• The flow of the control in a program corresponds to a depth-first traversal of the activation
tree that:
– starts at the root,
– visits a node before its children, and
– recursively visits children at each node an a left-to-right order.
• A stack (called control stack) can be used to keep track of live procedure activations.
– An activation record is pushed onto the control stack as the activation starts.
– That activation record is popped when that activation ends.
• When node n is at the top of the control stack, the stack contains the nodes along the
path from n to the root.
• The same variable name can be used in the different parts of the program.
• The scope rules of the language determine which declaration of a name applies when the
name appears in the program.
• An occurrence of a variable (a name) is:
– local: If that occurrence is in the same procedure in which that name is declared.
– non-local: Otherwise (ie. it is declared outside of that procedure)
Example:
procedure p;
var b:real;
procedure p;
var a: integer; a is local
begin a := 1; b := 2; end; b is non-local
begin ... end;
Heap
optional control link The optional control link points to the activation record
of the caller.
optional access link The optional access link is used to refer to nonlocal data
held in other activation records.
saved machine status
The field for saved machine status holds information about
the state of the machine before the procedure is called.
local data
The field of local data holds data that local to an execution
temporaries of a procedure..
Example:
(For a non-recursive procedure)
main
p s
• Who deallocates?
– Callee de-allocates the part allocated by Callee.
– Caller de-allocates the part allocated by Caller.
9.2.3 Displays
S → AbS | e | ε
A b c e $
A → a | cAd
S S → AbS sync S → AbS sync S → e S → ε
FOLLOW(S)={$}
FOLLOW(A)={b,d} A A→a sync A → cAd sync sync sync
• Each empty entry in the parsing table is filled with a pointer to a special error routine which will
take care that error case.
• These error routines may:
– change, insert, or delete input symbols.
– issue appropriate error messages
– pop items from the stack.
• We should be careful when we design these error routines, because we may put the parser
into an infinite loop.
Error Cases:
– No relation holds between the terminal on the top of stack and the next input symbol.
– A handle is found (reduction step), but there is no production with this handle as a right
side
Error Recovery:
– Each empty entry is filled with a pointer to an error routine.
– Decides the popped handle “looks like” which right hand side. And tries to recover from
that situation.
• An LR parser will detect an error when it consults the parsing action table and finds an
error entry. All empty entries in the action table are error entries.
• Scan down the stack until a state s with a goto on a particular nonterminal A is found.
(Get rid of everything from the stack before this state s).
• Discard zero or more input symbols until a symbol a is found that can legitimately follow
A.
– The symbol a is simply in FOLLOW (A), but this may not work for all situations.
• The parser stacks the nonterminal A and the state goto[s,A], and it resumes the normal
parsing.
• This nonterminal A is normally is a basic programming block (there can be more than one
choice for A).
– stmt, expr, block, ...
• Each empty entry in the action table is marked with a specific error routine.
• An error routine reflects the error that the user most likely will make in that case.
• An error routine inserts the symbols into the stack or the input (or it deletes the symbols
from the stack and the input, or it can do both insertion and deletion).
– missing operand
– unbalanced right parenthesis
Example:
a = d * c;
.
.
.
d = b * c + x –y;
We can eliminate the second evaluation of b*c from this code if none of the intervening
statements has changed its value. The code can be rewritten as given below.
T1 = b * c;
a = T1;
.
.
.
d = T1 + x – y;
• We can improve the execution efficiency of a program by shifting execution time actions
to compile time.
• We can evaluate an expression by a single value (known as folding).
Example:
A = 2 * (22.0/7.0) * r
Example:
x = 12.4
Example:
c = a * b;
x = a;
.
.
.
d = x * b;
• If the value contained in a variable at that point is not used anywhere in the program
subsequently, the variable is said to be dead at that place.
• If an assignment is made to a dead variable, then that assignment is a dead assignment
and it can be safely removed from the program.
• A piece of code is said to be dead if it computes values that are never used anywhere in
the program.
• Dead Code can be eliminated safely.
• Variable propagation often leads to making assignment statement into dead code.
Example:
c = a * b;
x = a;
.
.
.
d = x * b + 4;
c = a * b;
x = a;
.
.
.
d = a * b + 4;
• We aim to improve the execution time of the program by reducing the evaluation
frequency of expressions.
• Evaluation of expressions is moved from one part of the program to another in such a
way that it is evaluated lesser frequently.
• Loops are usually executed several times.
• We can bring the loop-invariant statements out of the loop.
Example:
a = 200;
while (a > 0)
{
b = x + y;
if ( a%b == 0)
printf (“%d”, a);
}
The statement b = x + y is executed every time with the loop. But because it is loop-
invariant, we can bring it outside the loop. It will then be executed only once.
a = 200;
b = x + y;
while (a > 0)
{
if ( a%b == 0)
printf (“%d”, a);
}
• An induction variable may be defined as an integer scalar variable which is used in loop
for the following kind of assignments i = i + constant.
• Strength Reduction means replacing the high strength operator by a low strength
operator.
• Strength Reduction used on induction variables to achieve a more efficient code.
Example:
i = 1;
while (i < 10)
{
.
.
.
y = i * 4;
.
.
i = 1;
t = 4;
while (t < 40)
{
.
.
.
y = t;
.
.
.
t = t + 4;
.
.
.
}
• Certain computations that look different to the compiler and are not identified as common
sub-expressions are actually same.
• An expression B op C will usually be treated as being different to C op B.
• However, for certain operations (like addition and multiplication), they will produce the
same result.
• We can achieve further optimization by treating them as common sub-expressions for
such operations.
• A basic Block is defined as a sequence of consecutive statements with only one entry (at
the beginning) and one exit (at the end).
• When a Basic Block of a program is entered, all the statements are executed in
sequence without a halt or possibility of branch except at the end.
• In order to determine all the Basic Block in a program, we need to identify the leaders,
the first statement of each Basic Block.
• Any statement that satisfies the following conditions is a leader;
o The first statement is leader.
o Any statement which is the target of any goto (jump) is a leader.
o Any statement that immediately follows a goto (jump) is a leader.
• It is a directed graph that is used to portray basic block and their successor relationships.
• The nodes of a flow graph are the basic blocks.
• The basic block whose leader is the first statement is known as the initial block.
• There is a directed edge from block B1 to B2 if B2 could immediately follow B1 during
execution.
• To determine whether there should be directed edge from B1 to B2, following criteria is
applied:
o There is a jump from last statement of B1 to the first statement of B2, OR
o B2 immediately follows B1 in order of the program and B1 does not end in an
unconditional jump.
• B1 is known as the predecessor of B2 and B2 is a successor of B1.
11.2.3 Loops
• We need to identify all the loops in a flow graph to carry out many optimizations
discussed earlier.
• A loop is a collection of nodes that
o is strongly connected i.e. from any node in the loop to any other, there is a path
of length one or more wholly within the loop, and
o has a unique entry, a node in the loop such that the only way to reach a node in
the loop from a node outside the loop is to first go through the entry.
• We assume there are initially no nodes and NODE ( ) is undefined for all arguments.
• The 3-address statements has one of three cases:
(i) A = B op C
(ii) A = op B
(iii) A=B
• We shall do the following steps (1) through (3) for each 3-address statement of the basic
block:
(1) If NODE (B) is undefined, create a leaf labeled B, and let NODE (B) be this node. In case
(i), if NODE (C) is undefined, create a leaf labeled C and let that leaf be NODE (C);
(2) In case (i), determine if there is a node labeled op whose left child is NODE (B) and
whose right child is NODE (C). (This is to catch common sub-expressions.) If not create
such a node. In case (ii), determine whether there is a node labeled op whose lone child
• Certain optimizations can be achieved by examining the entire program and not just a
portion of the program.
• User-defined chaining is one particular problem of this kind.
• Here we try to find out as to which definition of a variable is applicable in a statement
using the value of that variable.
SHEET#1
1. What is the difference between top-down and bottom-up parsing? Demonstrate with the
help of an example.
2. What are the necessary properties in a grammar so that is can be parsed in a top-down
manner?
3. Determine whether the following grammar can be parsed by a top-down parser or not. In
case it cannot be top-down parsed, make necessary transformations to that effect.
E Æ E+T / T
T Æ T*F / F
F Æ (E) / id
4. Show all steps in parsing the following string w = cad with the given grammar in a top-
down (with backtrack) manner:
S Æ cAd
A Æ ab / a
5. Calculate FIRST and FOLLOW for the grammar (after transformation, if any) in question
3 above.
6. Compute the LL(1) parsing table for the grammar (after transformation, if any) in question
3 above. Determine whether this grammar is LL (1) or not.
7. Consider the grammar below and determine whether it is an operator grammar or not:
E Æ E + E / E * E / id
8. For the grammar in question 7 above, compute the operator precedence table using the
associativity and precedence properties.
9. For the grammar in question 3 compute the operator precedence relation without using
associativity and precedence properties. Determine whether the grammar is operator
precedence or not.
10. Show steps of parsing the string w = id + id * id by table question 8 above.
11. Show steps of parsing the string w = id + id * id by table question 9 above.
12. Consider the following grammar
S Æ AS / b
A Æ SA / a
a. List all the LR (0) items for the above grammar.
b. Construct an NFA whose states are LR (0) items.
13. For grammar in question 12 above, determine if the grammar is SLR. If so, construct its
SLR table.
14. For the grammar in question 12 above, list all the LR (1) items and construct an NFA
whose states are these LR (1) items.
1. For the input expression (4*7+1)*2, construct an parse tree with translations.
2. Construct the parse tree and the syntax tree for the expression ((a)+(b)).
3. Translate the arithmetic expression a*-( b+c) into
a) syntax tree
b) postfix notation
c) three-address code
4. Translate the expression -( a+b) * (c+d) +( a+b+c) into
a) quadruples
b) triples
c) Indirect triples
5. Translate the executable statements of the following C program
main()
{
int i ;
int a[10];
i = 1;
while (i<=10) {
a[i] = 0; i = i+1;
}
}
into
a) a syntax tree
b) postfix notation
c) three-address code.
6. A translation model may translates E Æ id1 < id2 into pair of statements
If id1 < id2 goto……
goto…….
We could translate instead into the single statement
If id1>= id2 goto_
and fall through the code when E is true. Devise a translation model to generate code of
this nature.
10. Using control-flow translation of Boolean expressions obtain the code of the following
expression
a < b or c < d and e < f
1. What are the attributes that shall be stored in the symbol table?
2. Describe the different data structures for symbol table implementation and compare
them.
3. Describe and illustrate the use of symbol table for each phase of compiler construction
with the help of suitable example.
4. Define an activation record. Write down the structure of a typical activation record.
5. Consider the program fragment given below:
program main(input, output);
procedure p(x, y, z);
begin
y:=y+1;
z:=z + x;
end;
begin
a:=2;
b:=3;
p (a+b, a, a);
print a
end
What will be printed by the program assuming Call-by-Value?
6. What will be printed by the program in question 5 above assuming Call-by-Reference?
7. What will be printed by the program in question 5 above assuming Call-by-Name?
8. Consider the following program fragment:
program main
var y: Real;
procedure compute()
var x : Integer;
procedure initialize()
var x: Real;
begin {initialize}
...
end {initialize}
procedure transform()
var z: Real;
begin {transform}
...
end {transform}
begin {compute}
1. Consider the following program which finds the sum and the largest number:
main( )
{
int i, x [10], sum =0, large;
for (i = 0; i < 10; i++)
{
printf (“%d”, x [i]);
sum+ = x [i];
}
large = x [10];
for (i = 0; i < 10; i++)
{
if (x [i] > large)
large = x [i];
}
printf (“ %d %d”, sum, large);
}
(1) T1 = 4*I
(2) T2 = address (A) - 4
(3) T3 = T2 [T1]
(4) T4 = address (B) – 4
(5) T5 = T4 [T1]
(6) T6 = T3 * T5
(7) PROD = PROD + T6
(8) I = I + 1
(9) if I < 20 goto (3)
3. What is DAG? Construct a DAG for the three-address code in Question 5(b) above.
What is the application of a DAG? Demonstrate these applications with in the
constructed DAG.
2006 - 2007
FIRST SESSIONAL TEST (September 2006) - SOLVED
Please read the following instructions carefully before attempting this Question Paper.
1. This paper contains three questions. Each question carries 10 marks.
2. All questions are compulsory. However, some questions have internal choices.
3. Clearly show all the steps wherever applicable.
4. The references used in the questions are given below.
Grammar G1:
stat Æ if cond then stat
stat Æ if cond then stat else stat
stat Æ other-stat
Grammar G2:
E Æ E + T / T
T Æ T * F / F
F Æ ( E ) / id
Grammar G3:
S → aBa
B → bB / ε
Grammar G4:
S Æ S ( S )
S Æ ε
Action Goto
State
( ) $ S
0 r2 r2 r2 1
1 s2 acc
2 r2 r2 r2 3
3 s2 s4
4 r1 r1 r1
c) The LR parse table construction method that can be used for largest number of
grammars is Canonical LR (CLR).
a) Consider the Regular Expression (RE): (a+b)*abb. Construct an NFA for this RE.
Then convert the NFA into DFA. The final DFA should have minimum possible
number of states.
(1+3+1)
NFA and its transition table for this NFA are as follows:
a, b
1 2 3 4
a b b
State A B
1 12 1
2 - 3
3 - 4
4 - -
To convert this NFA into DFA, we need to construct all possible subsets
from the states of the NFA, as follows:
Subsets A B
0 0 0
1 (Start) 12 1
2 0 3
3 0 4
Removing the states that are unreachable from the start state, we achieve a
minimized DFA as follows:
Subsets A b
1 (Start) 12 1
12 12 13
13 12 14
14 (Final) 12 1
a
b
b b
a
1 12 13 14
a
a
b) Consider the grammar G1. Show that this grammar is ambiguous. Transform this
grammar into an unambiguous one.
(2+3)
Stat Æ Bal-Stat
Stat Æ Unbal-Stat
Bal-Stat Æ if cond then Bal-Stat else Bal-Stat
Bal-Stat Æ other-stat
Unbal_Stat Æ if cond then Stat
Unbal_Stat Æ if cond then Bal-Stat else Unbal-Stat
c) Consider the grammar G2. Show that this grammar is left recursive. Transform
this grammar into a non-left recursive one.
(1+4)
Consider the first two productions of the given grammar. They are
of the form: A Æ Aα / b. These productions are therefore left
recursive. Left recursive productions can cause recursive descent
parsers to loop forever as follows:
β βα βαα βααα
βαααα ..................
A Æ βΑ’
A’ Æ αΑ’ / ε
E Æ TE’
E Æ +TE’ / ε
T Æ FT’
E Æ *FT’ / ε
E Æ TE’
E Æ +TE’ / ε
T Æ FT’
E Æ *FT’ / ε
F Æ ( E ) / id
a) Consider the grammar G3. Construct the LL(1) parsing table for this grammar.
Draw conclusion on whether this grammar is LL(1) or not.
(4+1)
Grammar G3:
S → aBa
B → bB / ε
a b $
S S → aBa
B B → ε B → bB
b) Consider grammar G2. Construct the operator-precedence table for this grammar
without using rules of precedence and associativity.
(5)
+ * ( ) id $
+ > < < > < >
* > > < > < >
( < < < = <
) > > > >
id > > > >
$ < < < <
c) Give the steps in LR parsing for the grammar G4 and the input ( )( ).
(5)
For the grammar :
(1) S Æ S ( S )
(2) S Æ ε
Please read the following instructions carefully before attempting this Question Paper.
1. This paper contains three questions.
2. All questions are compulsory. However, Question 3 has internal choice.
3. Clearly show all the steps wherever applicable.
4. The references used in the questions are given below.
Grammar G1:
1. S Æ CC
2. C Æ cC
3. C Æ d
Grammar G2:
1. E Æ E + T
2. E Æ T
3. T Æ T * F
4. T Æ F
5. F Æ ( E )
6. F Æ id
Grammar G3:
1. S Æ E$
2. E Æ E op E
3. E Æ ( E )
4. E Æ I
5. I Æ I digit
6. I Æ digit
2. Consider grammar G1. Clearly show the sets of LR (1) items that correspond to the
states of its Canonical LR parsing table.
(5)
First we make the grammar augmented by adding S’ Æ S to it.
a) Consider the grammar G2. Construct an SLR parsing table for this grammar. Clearly
show the sets of LR (0) items, the GOTO function (by means of a Finite Automata)
and the final table.
(2+1+2)
For grammar G2, the sets of LR(0) items are as follows-
I5: F → id.
I0 E I1 + I6 T I9 * to I7
F to I
3
(
T to I4
id
to I5
I2 *
F I7 F I
10
( to I
I3 4
id
( to I5
I4 E
)
id id T I8 I11
to I2 +
F
I5 to I3 to I6
( to I4
b) Refer to grammar G1 and its sets of LR (1) items in Question 2 above. Determine
whether this grammar is LALR (1) or not. Give reasons for your conclusion.
(5)
We first need to identify the set of LR (1) items with common cores. Such
sets are:
L3 and L6
L4 and L7
L8 and L9
The set of LR (1) items now correspond to the states of LALR table. We
now start constructing the LALR table for these states.
state c d $ S C
0 s36 S47 1 2
1 acc
2 s36 S47 5
36 s36 S47 89
47 r3 r3 R3
5 R1
89 r2 r2 R2
c) Refer to grammar G3. Write down a Syntax-directed translation scheme for this
grammar and its implementation for a desk calculator.
(2+3)
Please read the following instructions carefully before attempting this Question Paper.
1. This paper contains five questions. Each question carries 20 marks.
2. All questions are compulsory. However, all questions have internal choices.
3. Clearly show all the steps wherever applicable.
QUESTIONS
a). What modification must be made in a grammar to make it suitable for Predictive
parsing? Examine the following grammar and indicate whether it is suitable for
predictive parsing or not. In case it is not suitable, make the necessary modifications
to make it suitable. Take common assumptions as required.
E Æ E + E / E * E / id
b). Construct the sets of LR (1) items required for the construction of CLR table for the
following grammar. Make necessary computations to determine whether the
grammar is LALR (1) or not.
S Æ AS / b
A Æ SA / a
c). Discuss the operator-precedence parsing algorithm. Consider the following operator
grammar and precedence functions, explain the parsing of string:
id + id * id
Grammar: E Æ E + E / E * E / id
Precedence functions:
+ * id $
F 4 2 4 0
G 3 1 5 0
int i;
i = 1;
while a < 10 do
if x > y then a = x + y;
Else a = x – y;
b). What is an intermediate code? Discuss the advantages of writing intermediate code.
For the following expression, write down the syntax tree, postfix notation, quadruples,
triples and indirect triples.
-B*(C+D)
c). Consider the following grammar for array references. Give syntax directed translation
scheme to generate three address codes for addressing array elements. Translate
the statement X = A [i,j] where upper bounds are 10 and 20 respectively.
L Æ Elist ] / id
Elist Æ Elist, L / id [ L
a). What do you understand by Lexical, Syntactic and Semantic phase errors? Explain
with the help of examples. Also suggest methods for error recovery for each of them.
b). What is the need to maintain a symbol table for compilation? What are the different
data structures suitable for storing symbol table? In the program fragment of
Question 3(a) above, indicate information would be added to symbol table after each
of lexical, syntactical and semantic phases.
c). Explain activation record and display structure. Show the activation records and
display structure just after the procedure called at lines marked X and Y have started
their execution. Be sure to indicate which of the two procedures named A you are
referring to.
Program Test;
Procedure A;
Procedure B;
Procedure A;
-------
end A;
begin
Y : A;
end B
begin
B;
end A;
begin
X : A;
end Test;
main( )
{
int i, x [10], sum =0, large;
for (i = 0; i < 10; i++)
{
printf (“%d”, x [i]);
sum+ = x [i];
}
large = x [10];
for (i = 0; i < 10; i++)
{
if (x [i] > large)
large = x [i];
}
printf (“ %d %d”, sum, large);
}
(1) T1 = 4*I
(2) T2 = address (A) - 4
(3) T3 = T2 [T1]
(4) T4 = address (B) – 4
(5) T5 = T4 [T1]
(6) T6 = T3 * T5
(7) PROD = PROD + T6
(8) I = I + 1
(9) if I < 20 goto (1)
(i) Determine whether this could be a single basic block or not. If not divide it into
basic blocks.
(ii) What are loop-invariant computations? Move loop-invariant computations out
of the basic blocks.
(iii) Find induction variables and eliminate them where possible.
(iv) What is the concept of Reduction in Strength? Indicate this in the above code
if this is happening.
c). What is DAG? Construct a DAG for the three-address code in Question 5(b) above.
What is the application of a DAG? Demonstrate these applications with in the
constructed DAG.
Read the following instructions carefully before attempting this Question Paper.
5. This paper contains four questions. Each question carries 10 marks. Attempt any three
questions.
6. Clearly show all the steps wherever applicable. Failing to do so would result in loss of
marks.
7. The references used in the questions are given below.
Grammar G1:
stat Æ if cond then stat
stat Æ if cond then stat else stat
stat Æ other-stat
Grammar G2:
S Æ S ( S )
S Æ ε
Grammar G3:
S → aBa
B → bB / ε
Question-1
(a) Explain all the six phases of a compiler in 1-2 sentences each.
(6)
Lexical Analyzer:
Lexical Analyzer reads the source program character by character and returns the
tokens of the source program. A token describes a pattern of characters having
same meaning in the source program. (such as identifiers, operators, keywords,
numbers, delimiters and so on)
Syntax Analyzer:
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the
given program.
Semantic Analyzer:
A semantic analyzer checks the source program for semantic errors and collects
the type information for the code generation.
Code Optimizer:
The code optimizer optimizes the code produced by the intermediate code
generator in the terms of time and space.
Code Generator:
It produces the target language in a specific architecture. The target program is
normally is a relocatable object file containing the machine codes.
(b) Which phase is completely optional and will not affect the output of the target code but
only the efficiency of target code.
(1)
Code Optimization Phase
(c) What is the least number of passes required to build a compiler with all its phases?
(1)
ONE
(d) What is the advantage of building a compiler with less passes?
(1)
Tends to be more efficient because of less number of traversals
Question-2
(a) Write a Regular Expression for an identifier of C language.
(1)
(A+B+…..+Z+_+a+b+…..+z) (A+B+…..+Z+_+a+b+…..+z+0+1+…..+9)*
S1
A/U
S0
Here {S0} is the start state and {S1} is the final state.
State {S0, S1} is unreachable from start state.
States {S0}, {S1}, and { } can be renamed as Q0, Q1 and Q2, respectively. The DFA
can be constructed as follows:
A/U/D
Q1
A/U
Q0
D
Q2
A/U/D
(d) Based on this DFA, write an algorithm or a program in C language that determines
whether the input string is a valid C identifier or not.
(4)
The C program for the above DFA is as follows:
main ( )
{
unsigned short int state = 0, count = 0;
Question-3
(a) Prove that the grammar G1 is ambiguous.
(4)
Consider the string:
if cond then if cond then other-stat else other-stat
There are two left-most derivations possible for this string as shown below:
S Æ ( S ) S | ε
(c) Using above results, compute the LL(1) parsing table for grammar G3.
(3)
For S Æ aBa, FIRST(aBa) = {a} Î M(S,a)= S Æ aBa
a B $
S S Æ aBa
B B Æ ε B Æ bB
Output = S
Step-1
We consider S (on stack) and a (on input buffer) Î S is a non-terminal Î
Stack = aBa$
Output = S Æ aBa
Step-2
We consider a (on stack) and a (on input buffer) Î both are same
terminals
Stack = Ba$
Output = S Æ aBa
Step-3
We consider B (on stack) and b (on input buffer) Î B is a non-terminal Î
Stack = bBa$
Input Buffer = a$
terminals Î ERROR
Therefore, this parsing has resulted in ERROR. This means that that the given
string is not accepted by the given grammar.
-- END --
S. No. Objective
03 Implementation of LR Parser
Note: Please refer to the Laboratory Manual of TCS-552 for further information on above.