Unit 2
Unit 2
Parser
CFG – Derivation – CFG vs R.E. - Types Of Parser –Bottom UP: Shift Reduce Parsing -
Operator Precedence Parsing, SLR parser- Top Down: Recursive Decent Parser - Non-
Recursive Decent Parser.
SYNTAX ANALYSIS
Every programming language has rules that prescribe the syntactic structure of well-formed
programs. In Pascal, for example, a program is made out of blocks, a block out of statements, a
statement out of expressions, an expression out of tokens, and so on. The syntax of programming
language constructs can be described by context-free grammars or BNF (Backus-Naur Form)
notation. Grammars offer significant advantages to both language designers and compiler writers.
Leftmost derivation :
E lm E + E lm
E * E+E lm
id * E+E lm
id * id+E lm
id * id+id
The string is derive from the grammar ,w= id*id+id, which is consists of all terminal symbols
Rightmost derivation :
E rm E+E
rm E+E * E
rm E+ E*id
rm E+id*id
rm id+id*id
E lm -(E) E rm -(E)
E lm - ( E+E ) E rm - (E+E )
E lm - ( id+E ) E rm - ( E+id )
E lm - ( id+id ) E rm - ( id+id )
String that appear in leftmost derivation are called left sentinel forms.String that appear in
rightmost derivation are called right sentinel forms.
Sentinels:
Given a grammar G with start symbol S, if S → α , where α may contain non-terminals or
terminals, then α is called the sentinel form of G.
Yield or frontier of tree:
Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The sentinel form
in the parse tree is called yield or frontier of the tree.
Parse Tree:
Inner nodes of a parse tree are non-terminal symbols.
The leaves of a parse tree are terminal symbols.
A parse tree can be seen as a graphical representation of a derivation.
Ambiguity:
A grammar that produces more than one parse tree for some sentence is said to be ambiguous
grammar.
Elm id + E Elm E + E * E
Elm id + E * E Elm id + E * E
Elm id + id * E Elm id + id * E
E lm id + id * id Elm id + id * id
WHY LR PARSING:
LR parsers can be constructed to recognize virtually all programming-language constructs for which context-free grammars can be
written.
The LR parsing method is the most general non-backtracking shift-reduce parsing method known, yet it can be implemented as
efficiently as
other shift-reduce methods.
The class of grammars that can be parsed using LR methods is a proper subset of the class of grammars that can be parsed with
predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right
scan of the input.
The disadvantage is that it takes too much work to construct an LR parser by hand for a
typical programming-language grammar. But there are lots of LR parser generators available to
make this task easy.
LL vs LR:
LL LR
Starts with the root nonterminal on the Ends with the root nonterminal on the stack.
stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still Uses the stack for designating what is already
to be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the Tries to recognize a right hand side on the
stack, and pushes the corresponding right stack, pops it, and pushes the corresponding
hand side. nonterminal.
Reads the terminals when it pops one off Reads the terminals while it pushes them on
the stack. the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Notational Conventions
To avoid always having to state that "these are the terminals." "these are the nonterminals," and
so on. we shall employ the following notational conventions:
1. These symbols are terminals:
i) Lowercase letters early in the alphabet such as a, b, c.
ii) Operator symbols such as +, -,etc.
iii) Punctuation symbols such as parentheses, comma. etc.
iv) The digits 0, 1….. 9
v) Boldface strings such as id or if
2. These symbols are nonterminals:
i) Upper-case letters early in the alphabet such as A,B,C.
ii) The letter S, which, when it appears,is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
3. Upper-case letters late in the alphabet. such as X. Y, Z, represent grammar symbols, that is,
either nonterminals or terminals.
4. Lower-case letters late in the alphabet, namely u, v ……z represent strings of terminals.
5. Lower-case Greek letters, α,β, γ, for example, represent strings of grammar symbols. Thus, a
generic production could be written as A→α, indicating that there is a single nonterminal A
on the left of the arrow (the left side of the production) and a string of grammar symbols α to
the right of the arrow (the right ride of the production).
6. If A→α1 A→α2 …………….A→αk as are all productions with A on the left (we call them
A-productions), we may write A→α1| α2|………| αk . We call α1,α2,……… αk the
alternatives for A.
7. Unless otherwise stated, the left side of the first production is the start symbol.
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the root
is called bottom-up parsing.A general type of bottom-up parser is a shift-reduce parser.
SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom -up parsing that attempts to construct a parse tree for an
input string beginning at the leaves (the bottom) and working up towards the root (the top).
Example:
Steps of reduction:
abbcde (b,d can be reduced)
aAbcde (leftmost b is reduced,now Ab,b,d qualified for reduction)
aAcde (d can be reduced)
aAcBe
S
Each replacement of the right side of a production by the left side in the above example is called
reduction, which is equivalent to rightmost derivation in reverse.
Handle:
A substring which is the right side of a production such that replacement of that substring
by the production left side leads eventually to a reduction to the start symbol, by the reverse of a
rightmost derivation is called a handle.
Definition: A handle of a right-sentential form is a production and a position of where
the string may e found and replaced by to produce the previous right-sentential form in the
rightmost derivation of .That is,if ,then in the position following is
a handle of .The string w to the right of the handle contains only terminal symbols.
Example: The actions a shift-reduce parser in parsing the input string id1+id2*id3, according to
the ambiguous grammar for arithmetic expression.
While the primary operations of the parser are shift and reduce, there
are actually four possible actions a shift-reduce parser can make: (1)
shift. (2) reduce,(3) accept, and (4) error.
1. In a shift action, the next input symbol is shifted unto the top of the stack.
2. In a reduce action, the parser knows the right end of the handle is at the
top of the stack. It must then locate the left end of the handle within the
stack and decide with what nonterminal to replace the handle.
3. In an accept action. the parser announces successful completion
of parsing.
4. In an error action, the parser discovers that a syntax error has
occurred and calls an error recovery routine.
Operator grammars have the property that no production right side is (empty) or has two
adjacent non terminals. This property enables the implementation of efficient operator-
precedence parsers.
These precedence relations guide the selection of handles. These operator precedence relations
allow delimiting the handles in the right sentential forms: <· marks the left end, =· appears in
the interior of the handle, and ·> marks the right end.
Precedence Functions
Compilers using operator-precedence parsers need not store the table of precedence relations. In
most cases, the table can be encoded by two precedence functions f and g that map terminal
symbols to integers. We attempt to select land g so that, for symbols a and b.
1. f (a) < g(b) whenever a<·b.
2. f (a) = g(b) whenever a = b. and
3. f(a) > g(b) whenever a ·> b.
Not every table of precedence relations has precedence functions to encode it. but in practical
cases the functions usually exist.
Example: The precedence table given above has the following pair of precedence functions
For example. * <· id, and f(*) < g(id). Note that f id) > g(id) suggests that id •> id; but. in fact,
no precedence relation holds between id and id. Other error entries are similarly replaced by
one or another precedence relation.
Problem 1:
Consider the following grammar, and construct the operator precedence parsing table and
check whether the input string (i) *id=id (ii)id*id=id are successfully parsed or not?
S→L=R
S→R
L→*R
L→id
R→L
Solution:
1.Computation of LEADING:
LEADING(S) = {=, * , id}
LEADING(L) = {* , id}
LEADING(R) = {* , id}
2.Computation of TRAILING:
TRAILING(S) = {= , * , id}
TRAILING(L)= {* , id}
TRAILING(R)= {* , id}
3.Precedence Table:
= * id $
= <· <· ·>
* ·> <· <· ·>
id ·> ·>
$ <· <· <·
Assignment:
Consider the following grammar:
1) S→(L)
S→a
L→L,S
L→S
2) S→a
S→↑
S→(T)
T→T,S
T→S
Solution:
1.Computation of LEADING:
LEADING(E) = {+, * , id}
2.Computation of TRAILING:
TRAILING(E) = {+, * , id}
3.Precedence Table:
+ * id $
+ <·/·> <·/·> <· ·>
* <·/·> <·/·> <· ·>
id ·> ·> ·>
$ <· <· <·
All undefined entries are error. Since the precedence table has multiple defined entries ,the
grammar is not an operator precedence grammar.
Top-Down Parsing:
Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing. When it is not clear which of two alternative productions to use to expand a
non-terminal A, we can rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice.
If there is any production A → αβ1 | αβ2 , it can be rewritten
as A → αA’
A’ → β1 | β2
Consider the grammar , G
: S → iEtS | iEtSeS | a
E→b
Left factored, this grammar
becomes S → iEtSS’ | a
S’ → eS |
εE→b
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer
to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the input
symbol d. Hence discard the chosen production and reset the pointer to second position. This is
called backtracking.
Step4:
Now try the second alternative for A.Now we can halt and announce the successful completion
of parsing.
LL(1) grammars:
The parsing table entries are single entries. So each location has not more than one entry. This type
of grammar is called LL(1) grammar.
Consider this following grammar: S
→ iEtS | iEtSeS | a
E→b
After eliminating left factoring, we have S
→ iEtSS’ | a
S’→ eS | ε E → b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals. FIRST(S) =
{ i, a }
FIRST(S’) = {e, ε } FIRST(E) = { b} FOLLOW(S) = { $ ,e } FOLLOW(S’) = { $ ,e } FOLLOW(E)
= {t}
Since there are more than one production, the grammar is not LL(1) grammar.
LR PARSERS
An efficient bottom-up syntax analysis technique that can be used to parse a large class of
CFG is called LR(k) parsing. The „L‟ is for left-to-right scanning of the input, the „R‟ for
constructing a rightmost derivation in reverse, and the „k‟ for the number of input symbols.
When „k‟ is omitted, it is assumed to be 1.
Advantages of LR parsing:
It recognizes virtually all programming language constructs for which CFG can be
written.
It is an efficient non-backtracking shift-reduce parsing method.
A grammar that can be parsed using LR method is a proper superset of a grammar that
can be parsed with predictive parser.
It detects asyntactic error as soon as possible.
Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.
INPUT
a1 … ai … an $
STACK
It consists of : an input, an output, a stack, a driver program, and a parsing table that has two
parts (action and goto).
Action : The parsing program determines sm, the state currently on top of stack, and ai, the
current input symbol. It then consults action[sm,ai] in the action table which can have one of four
values :
Goto : The function goto takes a state and grammar symbol as arguments and produces a state.
LR Parsing algorithm:
Input: An input string w and an LR parsing table with functions action and goto for grammar G.
Output: If w is in L(G), a bottom-up-parse for w; otherwise, an error indication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input
buffer. The parser then executes the following program :
set ip to point to the first input symbol of
w$; repeat forever begin
let s be the state on top of the stack
and a the symbol pointed to by ip;
if action[s, a] =shift s‟ then begin push
a then s‟ on top of the stack;
advance ip to the next input symbol
end
else if action[s, a]=reduce A→β then begin
pop 2* |β |symbols off the stack;
let s‟ be the state now on top of the stack;
push A then goto[s‟, A] on top of the
stack; output the production A→ β
end
else if action[s, a]=accept then
return
else error( )
end
LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of the
right side. For example, production A → XYZ yields the four items :
A → . XYZ
A → X . YZ
A → XY . Z
A → XYZ .
Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from
I by the two rules:
1. Initially, every item in I is added to closure(I).
2. If A → α . Bβ is in closure(I) and B → γ is a production, then add the item B → . γ to I , if it is
not already there. We apply this rule until no more new items can be added to closure(I).
Goto operation:
Goto(I, X) is defined to be the closure of the set of all items [A→ αX . β] such
that [A→ α . Xβ] is in I.
Steps to construct SLR parsing table for grammar G are:
1. Augment G and produce G‟
2. Construct the canonical collection of set of items C for G‟
3. Construct the parsing action function action and goto using the following algorithm that
requires FOLLOW(A) for each non-terminal of grammar.
If any conflicting actions are generated by the above rules, we say grammar is not SLR(1).
3. The goto transitions for state i are constructed for all non-terminals A using the rule: If
goto(Ii,A)= Ij, then goto[i,A] = j.
4. All entries not defined by rules (2) and (3) are made “error”
5. The initial state of the parser is the one constructed from the set of items containing
[S‟→.S].
I0 : E‟ → . E
E →.E+T
E →.T
T →.T*F
T →.F
F → . (E)
F → . id
GOTO ( I0 , E) GOTO ( I4 , id )
I1 : E‟ → E . I5 : F→ id .
E→E.+T
GOTO ( I6 , T )
GOTO ( I0 , T) I9 : E → E + T .
I2 : E → T . T→T.*F
T→T.*F
GOTO ( I6 , F )
GOTO ( I0 , F) I3 : T → F .
I3 : T → F .
GOTO ( I6 , ( )
I4 : F→ ( . E )
GOTO ( I4 , T) GOTO ( I9 , *)
I2 : E →T . I7 : T → T * . F
T→T.*F F→ . ( E )
F→ . id
GOTO ( I4 , F)
I3 : T → F .
GOTO ( I4 , ()
I4 : F → ( . E)
E →.E+T
E →.T
T →.T*F
T →.F
F → . (E)
F → id
E→E+T|T
T→T*F|F
F→ (E) | id
FOLLOW (E) = { $ , ) , +}
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }
ACTION GOTO
id + * ( ) $ E T F
I0 s5 s4 1 2 3
I1 s6 ACC
I2 r2 s7 r2 r2
I3 r4 r4 r4 r4
I4 s5 s4 8 2 3
I5 r6 r6 r6 r6
I6 s5 s4 9 3
I7 s5 s4 10
I8 s6 s11
I9 r1 s7 r1 r1
I10 r3 r3 r3 r3
I11 r5 r5 r5 r5
Stack implementation:
0 id + id * id $ GOTO ( I0 , id ) = s5 ; shift
0F3 + id * id $ GOTO ( I0 , F ) = 3
GOTO ( I3 , + ) = r4 ; reduce by T → F
0T2 + id * id $ GOTO ( I0 , T ) = 2
GOTO ( I2 , + ) = r2 ; reduce by E → T
0E1 + id * id $ GOTO ( I0 , E ) = 1
GOTO ( I1 , + ) = s6 ; shift
0 E 1 + 6 id 5 * id $ GOTO ( I5 , * ) = r6 ; reduce by F→ id
0E1+6F 3 * id $ GOTO ( I6 , F ) = 3
GOTO ( I3 , * ) = r4 ; reduce by T → F
0E1+6T 9 * id $ GOTO ( I6 , T ) = 9
GOTO ( I9 , * ) = s7 ; shift
0 E + 6 T 9 * 7 id
1 5 $ GOTO ( I5 , $ ) = r6 ; reduce by F→ id
0E +6T 9*7F
1 10 $ GOTO ( I7 , F ) = 10
GOTO ( I10 , $ ) = r3 ; reduce by T → T
*F
0E GOTO (
1 +6T 9 $ I6 ,T)=9
GOTO ( , $ ) = r1 ; reduce by E → E +
I9 T
0E GOTO (
1 $ I0 ,E)=1
GOTO (
I1 , $ ) = accept