Unit 2 Notes
Unit 2 Notes
UNIT-II
Syllabus:
Syntax Analysis (Parser): The Role of the Parser, Syntax Error Handling and
Recovery, Top-Down Parsing, Bottom-Up Parsing, Simple LR Parsing, More
Powerful LR Parsing, Using Ambiguous Grammars, Parser Generator YAAC.
Objective: Design top-down and bottom-up parsers
Outcome: For a given parser specification, design top-down and bottom-up parsers.
Parsing: determining the syntax or the structure of a program. Parse tree specifies
the statements execution sequence.
Structure of the syntax tree depends on the syntactic structure of the programming
language.
Logical Errors: Incorrect reasoning on the part of the programmer and use of
assignment operator = instead of the comparison operator ==.
1. A symbol can be removed from the grammar, if it is not used is producing any
string of the language. (Useless non-terminal and terminal symbols).
S → 0T | 1T | X | 0 | 1
T → 00
A→B A→a|b
B→a|b
3. If ε is not there in the language, epsilon productions can be removed
Non-terminal → ε
S → 0S | 1S | ε => S → 0S | 1S | 0 | 1
Ambiguity
A grammar is ambiguous if there is any sentence for which there exists more than
one parse tree either in Left Most Derivation or Right Most Derivation.
Any parses for an ambiguous grammar have to choose somehow which tree to
return. There are a number of solutions to this; the parser could pick one arbitrarily,
or we can provide some hints about which to choose. Best of all is to rewrite the
grammar so that it is not ambiguous.
There is no general method for removing ambiguity. Ambiguity is acceptable in
spoken languages. Ambiguous programming languages are useless unless the
ambiguity can be resolved.
Left Recursion:
If there is any non-terminal A, such that there is a derivation A ->A for some
string, then the grammar is left recursive.
A A 1 | A 2 | - - - | A m | 1 | 2 | - - - | n
A → AI
AI → AI |
Example 1:
Remove the left recursion from the productions:
E→E+T|T
T→T*F|F
Applying the transformation yields:
E → T EI T → F TI
EI → +T EI | TI → * F TI |
Example 2:
Remove the left recursion from the productions:
E→E+T|E–T|T
T → T * F | T/F | F
Applying the transformation yields:
E → T EI T → F TI
E → + T EI | - T EI | TI → * F TI | /F TI |
Left Factoring:
Left factoring is a grammar transformation that is useful for producing a grammar
suitable for predictive parsing.
When it is not clear which of two alternative productions to use to expand a non-
terminal A, we may be able to rewrite the productions to defer the decision until
we have some enough of the input to make the right choice.
Algorithm:
For all A non-terminal, find the longest prefix that occurs in two or more right-hand sides ofA.
Functions, FIRST and FOLLOW, allow us to fill in the entries of a parsing table
for G, whenever possible. Sets of tokens yielded by the FOLLOW function can
also be used as synchronizing tokens during error recovery.
FOLLOW (A), for nonterminals A: It is the set of terminals a that can appear
immediately to the right of A in some sentential form, that is, the set of terminals
a such that there exists a derivation of the form S=> Aaβ for some and β.
• If A can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A).
Computation of FOLLOW():
To compute FOLLOW(A) for all nonterminals A, apply the following rules until
nothing can be added to any FOLLOW set.
Example:
A -> BC | EFGH | H
B -> b
C -> c | ε
E -> e | ε
F -> CE
G -> g
H -> h | ε
Solution:
follow(A) = {$}
follow(B) = first(C) – { ε } U follow(A) = {C, $}
follow(G) = first(H) – { ε } U follow(A)
={h, ε } – { ε } U {$} = {h, $}
follow(H) = follow(A) = {$}
follow(F) = first(GH) = {g}
follow(E) = first(FGH) U follow(F)
= ((first(F) – { ε }) U first(GH)) U follow(F)
= {c, e} U {g} U {g}
= {c, e, g}
follow(C) = follow(A) U (first (E) – { ε }) U follow (F)
= { $ } U { e } U {g}
= {e, g, $}
TOP-DOWN PARSING:
Top down parsing is the construction of a Parse tree by starting at start symbol and
“guessing” each derivation until we reach a string that matches input. That is,
construct tree from root to leaves.
The advantage of top down parsing in that a parser can directly be written as a
program. Table- driven top-down parsers are of minor practical relevance.
1) Backtracking
2) Recursive descent parser
3) Predictive LL(1) parser
Top-down parsing can be viewed as an attempt to find a left most derivation for an
input string. Equivalently, it can be viewed as an attempt to construct a parse tree
for the input starting from the root and creating the nodes of the parse tree in
preorder.
The special case of recursive –decent parsing, called predictive parsing, where no
backtracking is required. The general form of top-down parsing, called recursive
descent, that may involve backtracking, that is, making repeated scans of the input.
Recursive descent or predictive parsing works only on grammars where the first
terminal symbol of each sub expression provides enough information to choose
which production to use. Recursive descent parser is a top down parser involving
backtracking. It makes a repeated scan of the input.
Backtracking
It will try different production rules to obtain the input string.
Backtracking is powerful, but slower.
Backtracking is not preferred for practical compilers.
And the input string w=cad. To construct a parse tree for this string top-down, we
initially create a tree consisting of a single node labeled scan input pointer points to
c, the first symbol of w. we then use the first production for S to expand tree and
obtain the tree as in the below figures.
c d c d c d
a b a
In Fig(a) the left most leaf, labeled c, matches the first symbol of w, so we now
advance the input pointer to a ,the second symbol of w, and consider the next leaf,
labeled A. We can then expand A using the first alternative for A to obtain the tree
in Fig (b). we now have a match for the second input symbol so we advance the input
pointer to d, the third, input symbol, and compare d against the next leaf, labeled b.
since b does not match the d ,we report failure and go back to A to see where there
is any alternative for Ac that we have not tried but that might produce a match.
In going back to A, we must reset the input pointer to position2,we now try second
alternative for A to obtain the tree of Fig(c).The leaf matches second symbol of w and
the leaf d matches the third symbol .
The left recursive grammar can cause a recursive- descent parser, even one with
backtracking, to go into an infinite loop. That is ,when we try to expand A, we may
eventually find ourselves again trying to expand A without Having consumed any
input.
Predictive Parser: Predictive parser tries to predict the construction of tree using
one or more lookahead symbols from the input string.
• Recursive Descent Parser:
• LL(1) Parser:
}
}
procedure match(token t)
{
if lookahead == t
lookahead = nexttoken;
else
error();
}
procedure error()
{
print(“error”);
}
This approach is also called as predictive parsing. There must be at most one
production in order to avoid backtracking. If there is no such production then no
parse tree exists, and an error is returned.
The crucial property is that the grammar must not be left-recursive. Predictive
parsing works well on those fragments of programming languages in which keywords
occurs frequently.
The model of predictive parser is as follows:
• Stack
• Input
• Parsing Table
• Output
The input buffer consists of the string to be parsed, followed by $, a symbol used as
a right end marker to indicate the end of the input string.
The stack consists of a sequence of grammar symbols with $ on the bottom,
indicating the bottom of the stack. Initially the stack consists of the start symbol of
the grammar on the top of $.
Recursive descent and LL parsers are often called predictive parsers because they
operate by predicting the next step in a derivation.
if X = a then
LL(1) Parser:
It is a non-recursive predictive parser and a table driven parser.
The first L stands for “Left-to-right scan of input”. The second L stands for “Left-most
derivation”. The ‘1’ stands for “1 token of look ahead”.
• No LL (1) grammar can be ambiguous or left recursive.
• Input Buffer: it is an array used to place input string with $ at the end of string.
• Stack: It is used to hold the left sentential for. RHS part of the production rule
is pushed into the stack in reverse order(from right to left).
o Initially stack contains $ on the top of stack.
• Parsing Table: It is a 2D array which contains all nonterminals on rows side and
terminals and $ on column side.
If the table doesn’t contains multiple entries in any column, we say that the given
grammar is LL(1).
Parsing an input string using LL(1) parser and table:
Consider an input string ‘abba’. Parsing steps are:
• Initially stack contains $ on the top.
• Place the input string in input buffer with $ at the end of string.
• Parsing table is used to decide which production rule is used to parse the
string.
• If the stack top is a nonterminal, replace it with its RHS part in reverse.
• If the stack is a terminal and if both top symbol and current input symbol is
same, the pop that terminal.
Stack Input Parsing Action
$ abba$ push S on to the stack
$S abba$ S-> aABC, so replace S with RHS, push RHS in reverse order
$CBAa abba$ a is on stack top and a is current input symbol, so pop a.
$CBA bba$ A->bb, so replace A with RHS in reverse order
$CBbb bba$ pop b
$CBb ba$ pop b
$CB a$ B->a
$Ca a$ pop a
$C $ C-> pop C
$ $ String is accepted
BOTTOM-UP PARSING:
1) Shift Reduce Parser (SRD)
A shift-reduce parser uses a parse stack which (conceptually) contains grammar
symbols.
During the operation of the parser, symbols from the input are shifted onto
the stack. If a prefix of the symbols on top of the stack(handle) matches the RHS
of a grammar rule, then the parser reduces the RHS of the rule to its LHS, replacing
the RHS symbols on top of the stack with the nonterminal occurring on the LHS of
the rule. This shift-reduce process continues until the parser terminates,
reporting either success or failure. It terminates with success when the input is legal
and is accepted by the parser. It terminates with failure if an error is detected in the
input.
Basic Operation that are performed in SLR parsing is
Shift Operation: Shifts current input symbol on to the stack until a reduction can
be applied.
Reduce Operation: If a specific substring(Handle) matching the body of a
production(RHS) is appeared on the top of stack, it can be replaced by the
nonterminal at the head of the production(LHS).
Accept: if the input string is parsed completely. (i.e., Start symbol is on the top of
the stack and current input symbol id $)
Error: If handle is not found on top of stack or input is left in the input buffer, it is
called as error. String is not parsed.
The key decisions during bottom-up parsing are about when to reduce and about
what production to apply
A reduction is a reverse of a step in a derivation
The goal of a bottom-up parser is to construct a reverse of right most derivation.
Handle: It is string or a substring that matches with any of the RHS part of the
production rule in the given grammar.
Handle pruning: Replacing Handle with LHS of the production rule.
A Handle is a substring that matches the body of a production and whose reduction
represents one step along the reverse of a rightmost derivation.
T T E->T
E
Example: E=>T=>T*F=>T*id=>F*id=>id*id
Stack Input Action
$ id*id$ shift
$id *id$ reduce by F->id
$F *id$ reduce by T->F
$T *id$ shift
$T* id$ shift
$T*id $ reduce by F->id
$T*F $ reduce by T->T*F
$T $ reduce by E->T
$E $ accept
LR PARSERS
In LR parser, "L" is for left-to-right scanning of the input and the "R" is for
constructing a rightmost derivation in reverse.
The parser uses a stack to store states, input symbols and grammar symbols. The
combination of the state symbol on top of the stack and the current input symbol
are used to index the parsing table and determine the shift/reduce parsing
decision.
The parsing table consists of two parts: action part and goto part.
The action table is a table with rows indexed by states and columns indexed by
terminal symbols. When the parser is in some state s and the current lookahead
terminal is t, the action taken by the parser depends on the contents of action[s][t],
which can contain four different kinds of entries:
• Shift s': Shift state s' onto the parse stack.
• Reduce r: Reduce by rule r. This is explained in more detail below.
• Accept: Terminate the parse with success, accepting the input.
• Error: Signal a parse error
The goto table is a table with rows indexed by states and columns indexed by
nonterminal symbols. When the parser is in state ‘s’ immediately after reducing by
rule N, then the next state to enter is given by goto[s][N].
Augmented Grammar:
If G is a grammar with start symbol S, then GI, the augmented grammar for G with
a new start symbol SI and production SI ->S.
The purpose of this new start stating production is to indicate to the parser when it
should stop parsing and announce acceptance of the input i.e., acceptance occurs
when and only when the parser is about to reduce by SI->S.
States of an LR parser
States represent set of items. An LR(0) item of G is a production of G with the dot at
some position of the body:
For A->XYZ we have following LR(0) items
A->.XYZ
A->X.YZ
A->XY.Z
A->XYZ.
Note: It is just an example, complete item sets are specified in the above example.
1. Simple LR parser(SLR):
Example:
Action Goto
State Id + * ( ) $ E T F
I0 S5 S4 1 2 3
1 S6 Accept
2 R2 S7 R2 R2
3 R4 R4 R4 R4
4 S5 S4 8 2 3
5 R6 R6 R6 R6
6 S5 S4 9 3
7 S5 S4 10
8 S6 S11
9 R1 S7 r1 r1
10 R3 R3 R3 R3
11 R5 R5 R5 R5
2. CLR Parser
Example:
S ->CC
C ->cC/d.
1.Number the grammar productions:
1. S ->CC
2. C ->cC
3. C ->d
SI ->S
S ->CC
C ->cC
C ->d.
We match the item [SI ->.S,$] with the term [A ->.B, a] In the procedure closure,
i.e.,
A = SI
=
B =S
=
a=$
Function closure tells us to add [B->.r,b] for each production B->r and terminal b in FIRST (a).
Now ->r must be S->CC, and since is and a is $, b may only be $. Thus,
S->.CC,$
We continue to compute the closure by adding all items [C->.r,b] for b in FIRST [C$] i.e.,
matching [S->.CC,$] against [A ->.B, a] we have, A=S, = B=C and a=$.
We add items:
C->.cC, c
C->cC, d
C->.d, c
C->.d,d
None of the new items have a non-terminal immediately to the right of the dot,
so we have completed our first set of LR(1) items. The initial I0 items are:
I0:
SI->.S, $
S->.CC, $
C->.cC, c/d
C->.d, c/d
Now we start computing goto (I0,X) for various non-terminals i.e.,
Goto (I0,S):
I1 : SI->S.,$ -> reduced item.
Goto (I0,C): I2 :
S->C.C, $
C->.cC,$
C->.d,$
Goto (I0,c) : I3 :
C->c.C, c/d
C->.cC, c/d
C->.d, c/d
Goto (I0,d) : I4 :
C->d., c/d -> reduced item.
Goto (I2,C) : I5 :
S->CC., $ -> reduced item.
Goto (I2,C) : I6
C->c.C, $
C->.cC, $
C->.d, $
Goto (I2,d) : I7
C->d., $ -> reduced item.
Goto (I3,C) : I8
C->cC., c/d -> reduced item.
Goto (I3,C) : I3
C->c.C, c/d
C->.cC, c/d
C->.d, c/d
Goto (I3,d) : I4
C->d., c/d. -> reduced item.
Goto (I6,C) : I9
C->cC., $ -> reduced item.
Goto (I6,C) : I6
C->c.C, $
C->,cC, $
C->.d, $
Goto (I6,d) : I7
C->d., $ -> reduced item.
All are completely reduced. So now we construct the canonical LR(1) parsing
table –
Here there is no need to find FOLLOW ( ) set, as we have already taken look-a-
head for each set of productions while constructing the states.
3. LALR Parser
Example:
2.For each core present among the set of LR (1) items, find all sets having that core,
and replace there sets by their Union# (plus them into a single term)
I0 ->same as previous
I1 -> same as previous
I2 -> same as previous
I36 – Clubbing item I3 and I6 into one I36 item.
C ->cC, c/d/$
C->cC, c/d/$
C->d, c/d/$
I5 ->some as previous
I47 - Clubbing item I4 and I7 into one I47 item
C->d, c/d/$
I89 - Clubbing item I8 and I9 into one I89 item
C->cC, c/d/$
LALR Parsing table construction:
Action Goto
State
c d $ S C
0 S36 S47 1 2
1 Accept
2 S36 S47 5
36 S36 S47 89
47 r3 r3
5 r1
89 r2 r2 r2
YACC (Yet Another Compiler Compiler) is one such automatic tool for parser generation. It
is an UNIX based utility tool for LALR parser generator. LEX and YACC work together to
analyze the program syntactically, can report conflicts or ambiguities in the form of error
messages.
YACC Specification
%{ declarations %} required header files are declared
% token required tokens are declared
...
| <body>n { <semantic action>n }
;
In a YACC production, unquoted strings of letters and digits not declared as tokens are
considered as non-terminals.
A quoted single character, e.g., 'c', is considered as terminal symbol c, as well as the integer
code for the token represented by that character (i.e., Lex would return the character code
for 'c' to the parser, as an integer).
Second part of YACC specification contains semantic action which are a sequence of C
statements.
In this, the symbol $$ refers to the attribute value associated with the nonterminal of the
head, while $i refers to the value associated with the i th grammar symbol (terminal or
nonterminal) of the body.
The third part of a YACC specification consists of supporting C-routines. A lexical analyzer
by the name yylex() must be provided.
Example:
YACC source program for a simple desk calculator that reads an arithmetic expression,
evaluates it, and then prints its numeric value.
Grammar for arithmetic expressions
E -> E + T | T
T -> T * F | F
F -> ( E ) | digit
The token digit is a single digit between 0 and 9.
%{
#include<stdio.h>
#include<ctype.h>
%}
%token DIGIT
%%
line : expr '\n' { printf("%d\n", $1); }
;
expr : expr '+' term { $$ = $1 + $3; }
| term
;
term : term '*' factor { $$ = $1 * $3; }
| factor
;
factor : '(' expr ')' { $$ = $2; }
| DIGIT
;
%%
void main()
{
yyparse();
}
void yyerror(char *s)
{
fprintf(stderr, "\nerror\n");
}
int yylex() {
int c;
c = getchar();
if (isdigit(c)) {
yylval = c-'0';
return DIGIT;
}
return c;
}