Unit 2 - Compiler Design - WWW - Rgpvnotes.in
Unit 2 - Compiler Design - WWW - Rgpvnotes.in
Tech
Subject Name: Compiler Design
Subject Code: CS-603
Semester: 6th
Downloaded from www.rgpvnotes.in
X1 X2 Xn
Bottom-up Approach −
Starts from tree leaves
Proceeds upward to the root which is the starting symbol S
Leftmost and Rightmost Derivation of a String
Leftmost derivation − A leftmost derivation is obtained by applying production to the leftmost variable in
each step.
Rightmost derivation − A rightmost derivation is obtained by applying production to the rightmost
variable in each step.
Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" may be −
X → X+X → a+X → a + X*X → a+a*X → a+a*a
The stepwise derivation of the above string is shown as below −
The rightmost derivation for the above string "a+a*a" may be −
X → X*X → X*a → X+X*a → X+a*a → a+a*a
The stepwise derivation of the above string is shown as below −
2. Parsing
Syntax analyzers follow production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation) divides parsing into two types: top-down parsing and
bottom-up parsing.
leaves, and working up. Early parser generators such as YACC creates bottom-up parsers whereas many of
Java parser
generators such as JavaCC create top-down parsers.
Where S is the goal or start symbol. Figure 6-1 illustrates the working of this brute-force parsing technique
by showing the sequence of syntax trees generated during the parse of the string ‘accd’.
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second symbol of w
‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer to third
symbol of w ‘d’. But the third leaf of tree is b which does not match with the input symbol d.
Hence discard the chosen production and reset the pointer to second position. This is called backtracking.
Step4:
Now try the second alternative for A.
Now we can halt and announce the successful completion of parsing.
Example for recursive decent parsing:
X
Y INPUT a + b $
Z
$ OUTPUT
Predictive Parsing Program
Stack
Predictive Table M
The construction of a predictive parser is aided by two functions associated with a grammarG :
1. FIRST
2. FOLLOW
Rules for first ( ):
1. If X is terminal, then FIRST(X) is {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → aα is a production then add a to FIRST(X).
Example:
Consider the following grammar:
E → E+T | T
T → T*F | F
F → (E) | id
Non Id + * ( ) $
terminal
E E → TE’ E → TE’
E’ E’ → +TE’ E’ → ε E’ → ε
T T → FT’ T → FT’
T’ T→ ε T→ ε T→ ε
F F → id F → (E)
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step and
reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input symbol,
which is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is treated as a
single node of the parse tree.
Reduce step: When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is
known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP function
is performed on the stack which pops off the handle and replaces it with LHS non-terminal symbol.
Example:
Consider the grammar:
S → aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde
Relation Meaning
a <· b a yields precedence to b
a =· b a has the same precedence as b
a ·> b a takes precedence over b
These operator precedence relations allow to delimit the handles in the right sentential forms: <· marks the
left end, =· appears in the interior of the handle, and ·> marks the right end.
Example: The input string:
id1 + id2 * id3
After inserting precedence relations becomes
$ <· id1 ·> + <· id2 ·> * <· id3 ·> $
Having precedence relations allows to identify handles as follows:
Scan the string from left until seeing ·>
Scan backwards the string from right to left until seeing <·
Everything between the two relations <· and ·> forms the handle
Id + * $
Id .> .> .>
+ <. .> <. .>
* <. .> .> .>
$ <. <. <. .>
The "L" is for left-to-right scanning of the input and the "R" is for constructing a rightmost derivation in
reverse.
Why LR Parsing:
LR parsers can be constructed to recognize virtually all programming-language constructs for which
context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing method known, yet it can
be implemented as efficiently as other shift-reduce methods.
The class of grammars that can be parsed using LR methods is a proper subset of the class of grammars that
can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right scan of the
input.
Input
Output
Stack Parser
Action goto
A convenient way to implement a shift-reduce parser is to use a stack to hold grammar symbols and
an input buffer to hold the string w to be parsed. The symbol $ is used to mark the bottom of the stack and
also the right-end of the input.
Notation ally, the top of the stack is identified through a separator symbol |, and the input string to be
parsed appears on the right side of |. The stack content appears on the left of |.
For example, an intermediate stage of parsing can be shown as follows:
$id1 | + id2 * id3$ …. (1)
Here “$id1” is in the stack, while the input yet to be seen is “+ id2 * id3$*
In shift-reduce parser, there are two fundamental operations: shift and reduce.
Shift operation: The next input symbol is shifted onto the top of the stack.
After shifting + into the stack, the above state captured in (1) would change into:
$id1 + | id2 * id3$
Reduce operation: Replaces a set of grammar symbols on the top of the stack with the LHS of a
production rule.
After reducing id1 using E → id, the state (1) would change into:
$E | + id2 * id3$
In every example, we introduce a new start symbol (S’), and define a new production from this new start
symbol to the original start symbol of the grammar.
Consider the following grammar (putting an explicit end-marker $ at the end of the first production:
(1) S’ → S$
(2) S → Sa
(3) S → b
For this example, the NFA for the stack can be shown as follows:
S’ → S. S → .b S → b.
S’ → .S$ S a
S → S.a S → Sa.
S → .Sa
S’ → S.$
S → .b
b
S → b.
The states of DFA are also called “Canonical Collection of Items”. Using the above notation, the ACTION-
GOTO table can be shown as follows:
State A B $ s
1 S3
2 S4,r1 R1 R1
3 R3 R3 R3
4 R2 R2 R2
Closure (I)
repeat
for each item [A α .B , a] in I
for each production B γ in G'
and for each terminal b in First( a)
add item [B . γ , b] to I
until no more additions to I
To find closure for Canonical LR parsers:
Repeat
for each item [A α .B , a] in I
for each production B γ in G'
and for each terminal b in First( a)
add item [B . γ , b] to I
until no more items can be added to I
Example
Consider the following grammar
S' S
S CC
C cC | d
Compute closure (I) where I={[S' .S, $]}
S' .S, $
S .CC, $
C .cC, c
C .cC, d
C .d, c
C .d, d
I0 : S' .S, $
S .CC, $
C .cC, c /d
C .d, c /d
I1 : goto(I0 ,S)
S' S., $
goto(I 0 ,C)
I2 :
S C.C, $
C .cC, $
C .d, $
I3 : goto(I 0 ,c)
C c.C, c/d
C .cC, c/d
C .d, c/d
I4: goto(I 0 ,d)
C d., c/d
I5 : goto(I 2 ,C)
S CC., $
I6 : goto(I 2 ,c)
C c.C, $
C .cC, $
C .d, $
I7: goto(I 2 ,d)
C d., $
I8: goto(I 3 ,C)
C cC., c/d
I9 : goto(I 6 ,C)
C cC., $
To construct sets of LR (1) items for the grammar given in previous slide we will begin by computing
closure of {[S ´ .S, $]}.
To compute closure we use the function given previously.
In this case α = ε , B = S, ß =ε and a=$. So add item [S .CC, $].
Now first(C$) contains c and d so we add following items
We have A=S, α = ε , B = C, ß=C and a=$
Now first(C$) = first(C) contains c and d
So we add the items [C .cC, c], [C .cC, d], [C .dC, c], [C .dC, d].
Similarly we use this function and construct all sets of LR (1) items.
Parse table
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Table 2.6: Parser Table
We are representing shift j as sj and reduction by rule number j as rj. Note that entries corresponding to
[state, terminal] are related to action table and [state, non-terminal] related to goto table. We have [1,$] as
accept because [S ´ S., $] ε I 1
LALR Parse table
Look Ahead LR parsers
Consider a pair of similar looking states (same kernel and different lookaheads) in the set of LR(1) items
I4:C d. , c/d I7 : C d., $
Replace I4 and I7 by a new state I 47 consisting of (C d., c/d/$)
Similarly I 3& I6 and I 8& I 9 form pairs
Merge LR(1) items having the same core
We will combine Ii and Ij to construct new Iij if Ii and Ij have the same core and the difference is only in
look ahead symbols. After merging the sets of LR(1) items for previous example will be as follows:
I0 : S' S$
S .CC $
C .cC c/d
C .d c/d
I1 : goto(I 0 ,S)
S' S. $
I2 : goto(I 0 ,C)
S C.C $
C .cC $
C .d $
I36 : goto(I 2 ,c)
C c.C c/d/$
C .cC c/d/$
C .d c/d/$
I4 : goto(I 0 ,d)
C d. c/d
I 5 : goto(I 2 ,C)
S CC. $
I7 : goto(I 2 ,d)
C d. $
I 89 : goto(I 36 ,C)
C cC. c/d/$
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3,6 r4 r4 r4 r4 9 3
5,8 s5 r6 r6 s4,8 r6 r6
9,10 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
The construction rules for LALR parse table are similar to construction of LR(1) parse table.
Notes on LALR parse table
Modified parser behaves as original except that it will reduce C d on inputs like ccd. The error will
eventually be caught before any more symbols are shifted.
In general core is a set of LR(0) items and LR(1) grammar may produce more than one set of items with the
same core.
4 Parser Generators
Some common parser generators
YACC: Yet Another Compiler Compiler
Bison: GNU Software
ANTLR: AN other Tool for Language Recognition
Yacc/Bison source program specification (accept LALR grammars)
declaration
%%
translation rules
%%
supporting C routines
lex
YACC Manual
y.tab.c
“lex.yy.c” lex.yy.c
cc
a.out
Types of translation
• L-attributed translation
It performs translation during parsing itself.
No need of explicit tree construction.
L represents 'left to right'.
• S-attributed translation
It is performed in connection with bottom up parsing.
'S' represents synthesized.
Types of attributes
• Inherited attributes
It is defined by the semantic rule associated with the production at the parent of node.
Attributes values are confined to the parent of node, its siblings and by itself.
The non-terminal concerned must be in the body of the production.
• Synthesized attributes
It is defined by the semantic rule associated with the production at the node.
The syntax directed definition in which the edges of dependency graph for the attributes in production
body, can go from left to right and not from right to left is called L-attributed definitions. Attributes of L-
attributed definitions may either be synthesized or inherited.
If the attributes are inherited, it must be computed from:
• Inherited attribute associated with the production head.
• Either by inherited or synthesized attribute associated with the production located to the left of the
attribute which is being computed.
• Either by inherited or synthesized attribute associated with the attribute under consideration in such a way
that no cycles can be formed by it in dependency graph.
In production 1, the inherited attribute T' is computed from the value of F which is to its left. In production
2, the inherited attributed Tl' is computed from T'. inh associated with its head and the value of F which
appears to its left in the production. i.e., for computing inherited attribute it must either use from the
above or from the left information of SDD
Syntax tree for a-4+c using the above SDD is shown below.
Figure 2.9:L-Attribute
E → TE”
E” → +T #1 E” | ε
T → F T”
T” → * F #2 T” | ε
F → id #3
Where #1 corresponds to printing “+” operator, #2 corresponds to printing “*,” and # 3 corresponds to
printing id.val.
Look at the above SDT; there are no attributes, it is L-attributed definition as the semantic actions are in
between grammar symbols. This is a simple example of L-attributed definition. Let us analyze this L-
attributed definition and understand how to evaluate attributes with depth first left to right traversal. Take
the parse tree for the input string “a + b*c” and perform Depth first left to right traversal, i.e. at each node
traverse the left sub tree depth wise completely then right sub tree completely.
Follow the traversal in. During the traversal whenever any dummy non-terminal is seen, carry out the
translation.
Converting L-Attributed to S-Attributed Definition
Now that we understand that S-attributed is simple compared to L-attributed definition, let us see how to
convert an L-attributed to an equivalent S-attributed.
Consider an L-attributed with semantic actions in between the grammar symbols. Suppose we have an L-
attributed as follows:
S → A {} B
S→AMB
M→ε {}
E” → +T #1 E” | ε
T → F T”
T” → *F #2 T” | ε
F → id #3
E → TE”
E” → +T A E” | ε
A→ { print(“+”);}
T → F T”
T” → *F B T” |ε
B→ { print(“*”);}
F → id { print(“id”);}
Table 2.14: Solution
6. YACC
YACC—Yet Another Compiler Compiler—is a tool for construction of automatic LALR parser generator.
Using Yacc
Yacc specifications are prepared in a file with extension “.y” For example, “test.y.” Then run this file with
the Yacc command as “$yacc test.y.” This translates yacc specifications into C-specifications under the
default file mane “y.tab.c,” where all the translations are under a function name called yyparse(); Now
compile “y.tab.c” with C-compiler and test the program. The steps to be performed are given below:
Commands to execute
$yacc test.y
This gives an output “y.tab.c,” which is a parser in c under a function name yyparse().
With –v option ($yacc -v test.y), produces file y.output, which gives complete information about the LALR
parser like DFA states, conflicts, number of terminals used, etc.
$cc y.tab.c
$./a.out
Preparing the Yacc specification file
Every yacc specification file consists of three sections: the declarations, grammar rules, and supporting
subroutines. The sections are separated by double percent “%%” marks.
declarations
%%
Translation rules
%%
supporting subroutines
The declaration section is optional. In case if there are no supporting subroutines, then the second %% can
also be skipped; thus, the smallest legal Yacc specification is
%%
Translation rules
Declarations section
Declaration part contains two types of declarations—Yacc declarations or C-declarations. To distinguish
between the two, C-declarations are enclosed within %{ and %}. Here we can have C-declarations like
global variable declarations (int x=l;), header files (#include....), and macro definitions(#define...). This may
be used for defining subroutines in the last section or action part in grammar rules.
Yacc declarations are nothing but tokens or terminals. We can define tokens by %token in the declaration
part. For example, “num” is a terminal in grammar, then we define
% token num in the declaration part. In grammar rules, symbols within single quotes are also taken as
terminals.
We can define the precedence and associativity of the tokens in the declarations section. This is done using
%left, %right, followed by a list of tokens. The tokens defined on the same line will have the same
precedence and associativity; the lines are listed in the order of increasing precedence. Thus,
are used to define the associativity and precedence of the four basic arithmetic operators ‘+’,‘-’,‘/’,‘*’.
Operators ‘*’ and ‘/’ have higher precedence than ‘+’ and both are left associative. The keyword %left is
used to define left associativity and %right is used to define right associativity.
• Such formalism generates Annotated Parse-Trees where each node of the tree is a record with a field for
each attribute (e.g., X.a indicates the attribute a of the grammar symbol X).
• The value of an attribute of a grammar symbol at a given parse-tree node is defined by a semantic rule
associated with the production used at that node.
2. Inherited Attributes: They are computed from the values of the attributes of both the siblings and the
parent nodes.
Form of Syntax Directed Definitions
Each production, A ! α, is associated with a set of semantic rules: b := f(c1, c2, . . . , ck), where f is a
function and either.
b is a synthesized attribute of A, and c1, c2, . . . , ck are attributes of the grammar symbols of the
production, or
b is an inherited attribute of a grammar symbol in α, and c1, c2, . . . , ck are attributes of grammar symbols
in α or attributes of A.