CD Unit2
CD Unit2
Syntax Analysis
• The role of the parser
• Context-Free Grammars
• Derivations
• Parse Trees
• Ambiguity
• Left Recursion
• Left Factoring
• The second phase in compilation of a source program after lexical
analysis is syntax analysis.
• While lexical analyzer reads an input source program and produces as
output a sequence of tokens.
• syntax analyzer takes the tokens from lexical analyzer and groups
them some programming structure called “syntax tree or parse tree”
• If the syntax is not recognized then the syntax error will be generated.
• Example : consider the source program statement (a+b)*c;
The lexical analyzer reads the above statement and broken it into
the set of tokens like
a identifier
+ operator
b identifier
* operator
c identifier
• Now the syntax analyzer collect the above tokens from lexical analyzer
and arrange them into a structure called parse tree or syntax tree.
The role of parser
• The parser or syntax analyzer takes the token (Such as
Keywords, Identifier etc.) from the scanner (or) Lexical
analyzer.
• It then verifies whether the input String can be generated
from the grammar of the source language.
• The parser should also report syntactical errors in a manner
that is easily understand by the user.
• These errors are recovered by error handler.
• Parsers are classified into three types
1.Universal parser –These technique process the capability
of parsing any type of grammar.
2.Top-down parsing
3.Bottom-up parsing
Context-Free Grammars
• Context free grammar is a formal grammar which is used to generate
all possible strings in a given formal language.
• Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
• G describes the grammar
• T describes a finite set of terminal symbols. (lowercase letters,
operators symbols)
• V describes a finite set of non-terminal symbols (uppercase letters)
• P describes a set of production rules
• S is the start symbol.
• In CFG, the start symbol is used to derive the string.
• You can derive the string by repeatedly replacing a non-terminal
by the right hand side of the production, until all non-terminal
have been replaced by terminal symbols.
Production rules:
S → aSa
S → bSb
S→c
• Now check that abbcbba string can be derived from the given
CFG
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
Derivations
• Derivation is a sequence of production rules. It is used to get
the input string through these production rules. During parsing
we have to take two decisions.
These are as follows:
• We have to decide the non-terminal which is to be replaced.
• We have to decide the production rule by which the non-
terminal will be replaced.
• We have two options to decide which non-terminal to be
replaced with production rule.
Left-most Derivation
• In the left most derivation, the input is scanned and replaced with
the production rule from left to right. So in left most derivatives we
read the input string from left to right.
Example:
• Production rules:
S=S+S
S=S-S
S = a | b |c
Input:
a-b+c
The left-most derivation is:
S=S+S
S=S-S+S
S=a-S+S
S=a-b+S
S=a-b+c
Right-most Derivation
• In the right most derivation, the input is scanned and replaced
with the production rule from right to left. So in right most
derivatives we read the input string from right to left.
The right-most derivation is:
S=S-S
S=S-S+S
S=S-S+c
S=S-b+c
S=a-b+c
Parse Trees
• Parse tree is the graphical representation of symbol. The
symbol can be terminal or non-terminal.
• In parsing, the string is derived using the start symbol. The root
of the parse tree is that start symbol.
• It is the graphical representation of symbol that can be terminals
or non-terminals.
• Parse tree follows the precedence of operators. The deepest
sub-tree traversed first. So, the operator in the parent node has
less precedence over the operator in the sub-tree.
The parse tree follows these points:
• All leaf nodes have to be terminals.
• All interior nodes have to be non-terminals.
• In-order traversal gives original input string.
Production rules:
S= S + S | S * S
S= a|b|c
Input:
a*b+c
Ambiguity
• A context free grammar is said to be ambiguous if a String in the
language of the grammar can be represented by two or more
different parse trees.
• A grammar which has more than one left most derivation or right
most derivation for the same input String is called Ambiguous
grammar.
• For Example:
• Let us consider this grammar: E -> E+E|id We can create a 2 parse tree
from this grammar to obtain a string id+id+id. The following are the 2
parse trees generated by left-most derivation:
• Both the above parse trees are derived from the same grammar rules but both
parse trees are different. Hence the grammar is ambiguous.
Let us now consider the following grammar:
E -> I
E -> E + E
E -> E * E
E -> (E)
I -> ε | 0 | 1 | … | 9
• From the above grammar String 3*2+5 can be derived in 2 ways:
I) First leftmost derivation II) Second leftmost derivation
E=>E*E E=>E+E
=>I*E =>E*E+E
=>3*E+E =>I*E+E
=>3*I+E =>3*E+E
=>3*2+E =>3*I+E
=>3*2+I =>3*2+I
=>3*2+5 =>3*2+5
Left Recursion
• A production of grammar is said to have left recursion if the left most variable of
its RHS is same as variable of its LHS
• A grammar which have left recursive productions is called left recursive grammar
A->Aα| β
• If left recursion is present in the grammar then the top-down parser can enter
into infinity loop.
A()
{
A();
α
}
• Here the problem is that top-down parsing cannot handle the
grammar which contains the left recursion productions.so we have to
eliminate the left recursion productions.
• Left recursion is eliminated by converting grammar into right
recursive grammar
• Example: Eliminate the left recursion for the following grammar
E->E+T|T
T->T*F|F
F->(E)|id
Left Factoring
• When 2 or more productions starting with the same set of symbol so
we need grammar transformation called as left factoring
• Confusion to expand the non terminal A like which production to be
used
• Initially A is expanded till alpha., based on the remaining symbols A’
can replace beta 1 or beta 2 so the confusion gets eliminated
Example: E-> T+E|T
Sol:
E->TE’
E’->+E| ε
Top-Down Parsing
• The process of construction of parse tree starting from root and
proceed to children is called top down parsing
i.e., starting from start symbol of grammar and reaching the i/p
string.
The process of constructing the parse tree which starts from the root
and goes down to the leaf is Top-Down Parsing.
• Top-Down Parsers constructs from the Grammar which is free from
ambiguity and left recursion.
• Top-Down Parsers uses leftmost derivation to construct a parse tree.
• It does not allow Grammar With Common Prefixes.
Classification of parsing technique:
Parser
a A d
Step3: Step4:
S S
a A d a A d
b c
Step5: Step6:
S S
a B a B
c c d
T T->FT’ T->FT’
F F->(E) F->id
Rules:
1.If the production is in the form of A-> α then we have to calculate
First(α) add A-> α to M[A,a]
2.If the production is in the form of A-> ε then calculate Follow(A) add to
M[A,a]
3.The remaining entries of the parsing table are filled with errors.
5.To check whether the String is accepted or not:
• We need to parse the i/p string w=id*id+id
• So we have to consider stack, input and output
• Stack is a memory unit
• Initial configuration of stack is $ and E
$ is the bottom of the stack
We need to load the start symbol of the grammar on to the stack
E is the top of the stack which is the start symbol of the grammar E
• In i/p buffer we need to store the string to be parsed .,and this i/p string should be
followed by $
• If the non terminal is on top of the stack and i/p pointer points to any i/p symbol then
we need to refer the parsing table
• If terminal symbol is on top of the stack that must be matched with the current i/p
symbol then we can remove the terminal symbol on top of the stack and remove the
i/p symbol from the i/p buffer(i.e., we have to advance the i/p pointer to the next i/p
symbol)
• As E in the top of the stack,.that should be replaced by symbols on the
right hand side production in reverse order so T must be top on stack
• If epsilon is pushed no need to write that
• Generate the parse tree
• If stack is empty input buffer is also empty
Stack Input Action
$E id+id*id$ E->TE’
$E’T id+id*id$ T->FT’
$E’T’F id+id*id$ F->id
$E’T’id id+id*id$ Pop and remove id
$E’T’ +id*id$ T’->ε
$E’ +id*id$ E’->+TE’
$E’T+ +id*id$ Pop and remove +
$E’T id*id$ T->FT’
$E’T’F id*id$ F->id
$E’T’id id*id$ Pop and remove id
$E’T’ *id$ T->*FT’
$E’T’F* *id$ Pop and remove *
$E’T’F id$ F->id
$E’T’id id$ Pop and remove id
$E’T’ $ T’->ε
$E’ $ E’->ε
$ $ Pop and remove $
Not LL(1)
• A grammar is said to be LL(1) if its predictive parsing table has no
multiple entries
• In the previous table all entries are single entries and some blanks
occurred, this grammar is said to be LL(1) grammar
• If the predictive parsing table has multiple entries it is called as Not
LL(1) grammar
• So here M[S’,e] are having multiple entries it is Not LL(1) grammar
Error Recovery in Predictive parsing:
• An Error is detected during predictive parsing when the terminal on the
stack does not match the next input symbol.
• When a non terminal ‘A’ is on the top of the stack and ‘a’ is the next input
symbol and the parsing table entry M[A,a] is empty ,indicating an error.
• The purpose of reducing number of errors in the parser table is called error
recovery.
• LL(1) parser uses “panic mode “error recovery
Panic mode Error recovery:
• This is based on the idea of skipping the input symbol until a token in a set
of Synchronizing tokens appear.
• The Synchronizing sets should be chosen so that the parser recovers quickly
from errors
Rules:
1.If the parser looks up an entry M[A,b] for which it is blank ,then the input
symbol ‘b’ id skipped.
2.If the entry is “Synch”,then the non-terminal on top of the stack is popped
so that the parsing can continue.
3.If a token on top of the stack does not match the input symbol ,then the
token is popped out of the stack.
Example : consider the parsing table for the grammar
E->TE’ Follow(E)= Follow(E’)={$,)}
E’->+TE’| ε
T->FT’ Follow(T)= Follow(T’)={+,$,)}
T’->*FT’| ε
F->(E)|id Follow(F)={+,*,$,)}
using FOLLOW symbols on non-terminal as Synchronizing Symbols we have
id + * ( ) $
E E->TE’ E->TE’ synch synch