CD Unit 2
CD Unit 2
UNIT-II
SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and
generates a syntax tree or parse tree.
Advantages of grammar for syntactic specification :
1. A grammar gives a precise and easy-to-understand syntactic specification of a programming
language.
2. An efficient parser can be constructed automatically from a properly designed grammar.
3. A grammar imparts a structure to a source program that is useful for its translation into
object code and for the detection of errors.
4. New constructs can be added to a language more easily when there is a grammatical
description of the language.
CONTEXT-FREE GRAMMARS
A Context-Free Grammar is a quadruple that consists of terminals, non-terminals, start
symbol and productions.
Terminals: These are the basic symbols from which strings are formed.
Non-Termnals: These are the syntactic variables that denote a set of strings. These help to define
the language generated by the grammar.
Start Symbol: One non-terminal in the grammar is denoted as the “Start-symbol” and the set of
strings it denotes is the language defined by the grammar.
Productions: It specifies the manner in which terminals and non-terminals can be combined to form
strings. Each production consists of a non-terminal, followed by an arrow, followed by a string of non-
terminals and terminals.
Example of context-free grammar: The following grammar defines simple arithmetic expressions:
expr → expr op expr
expr → (expr) A production rule specifies how a non-terminal symbol can be replaced by a
string of terminal or non-terminal symbols.
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
In this grammar,
id + - * / ↑ ( ) are terminals.
expr , op are non-terminals.
expr is the start symbol.
Each line is a production.
Derivations:
Two basic requirements for a grammar are :
1. To generate a valid string.
2. To recognize a valid string.
Derivation is a process that generates a valid string with the help of grammar by replacing the non-
terminals on the left with the string on the right side of the production.
Example : Consider the following grammar for arithmetic expressions :
E → E+E | E*E | ( E ) | - E | id
To generate a valid string - ( id+id ) from the grammar the steps are
1. E → - E
2. E → - ( E )
3. E → - ( E+E )
UNIT-II COMPILER DESIGN
4. E → - ( id+E )
5. E → - ( id+id )
In the above derivation,
E is the start symbol.
- (id+id) is the required sentence (only terminals).
Strings such as E, -E, -(E), . . . are called sentinel forms.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first for
replacement.
Example:
Given grammar G : E → E+E | E*E | ( E ) | - E | id
Sentence to be derived : – (id+id)
LEFTMOST DERIVATION RIGHTMOST DERIVATION
E→-E E→-E
E→-(E) E→-(E)
E → - ( E+E ) E → - (E+E )
E → - ( id+E ) E → - ( E+id )
E → - ( id+id ) E → - ( id+id )
String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
Given a grammar G with start symbol S, if S → α , where α may contain non-terminals or
terminals, then α is called the sentinel form of G.
Yield or frontier of tree:
Each interior node of a parse tree is a non-terminal. The children of node can be a terminal or
non-terminal of the sentinel forms that are read from left to right. The sentinel form in the parse tree is
called yield or frontier of the tree.
Ambiguity:
A grammar that produces more than one parse for some sentence is said to be ambiguous
grammar.
Example : Given grammar G : E → E+E | E*E | ( E ) | - E | id
The sentence id+id*id has the following two distinct leftmost derivations:
E → E+ E E → E* E
E → id + E E→E+E*E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id
UNIT-II COMPILER DESIGN
E E
E + E E * E
id E * E E + E id
id id id id
WRITING A GRAMMAR
There are four categories in writing a grammar :
1. Regular Expression Vs Context Free Grammar
2. Eliminating ambiguous grammar.
3. Eliminating left-recursion
4. Left-factoring.
Each parsing method can handle grammars only of a certain form hence, the initial grammar may
have to be rewritten to make it parsable.
Regular Expressions vs. Context-Free Grammars:
For instance, if you have an ambiguous expression like a + b * c, you can add rules to enforce the desired precedence:
(a + b) * c.
Adding Precedence Rules:
UNIT-II Explicitly define the precedence of operators in the grammar. COMPILER DESIGN
For example, if you have arithmetic operators, specify their precedence levels
(e.g., multiplication before addition)
Operators with higher precedence should be
1. stmt closer to the leaf nodes in the parse tree, while
lower precedence operators should be closer to
the root.
E1
E2 S1 S2
2. stmt
S2
if expr then stmt
E 2 S 1
1. **Obtaining Tokens**: The parser receives a string of tokens from the lexical analyzer. These tokens
represent the fundamental building blocks of the source code.
2. **Grammar Verification**: It verifies whether the token string can be generated by the grammar
defined for the source language. In other words, it ensures that the code adheres to the specified syntax
rules.
3. **Syntax Error Detection**: If the token sequence violates the grammar rules, the parser reports
syntax errors in the program. These errors could range from misspelled identifiers to unbalanced
parentheses in arithmetic expressions.
4. **Error Recovery**: The parser also handles commonly occurring errors gracefully. It aims to recover
from these errors so that it can continue processing the input.
UNIT-II COMPILER DESIGN
Top–down parsing: A parser can start with the start symbol and try to transform it to the input
string. Example: LL Parsers.
Bottom–up parsing: A parser can start with input and attempt to rewrite it into the start symbol.
Example: LR Parsers.
TOP-DOWN PARSING
It can be viewed as an attempt to find a left-most derivation for an input string or an attempt
to construct a parse tree for the input starting from the root to the leaves.
Types of top-down parsing: LL
1. Recursive descent parsing ( Brute force / with backtracking) left to right left most derivation
2. Predictive parsing ( LL(1) parser / first and follow / without backtracking )
1. RECURSIVE DESCENT PARSING
Recursive descent parsing is one of the top-down parsing techniques that uses a set of
recursive procedures to scan its input.
This parsing method may involve backtracking, that is, making repeated scans of the input.
Example for backtracking:
Consider the grammar G: S → cAd
A → ab | a
and the input string w=cad.
The parse tree can be constructed using the following top-down approach:
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.
S
c A d
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the
second symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative .
S
c A d
a b
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the input
symbol.
Hence discard the chosen production and reset the pointer to second position. This is called
backtracking.
Step4: Now try the second alternative for A.
S
c A d
a
UNIT-II COMPILER DESIGN
Procedure EPRIME( )
begin
If input_symbol=’+’ then ADVANCE( );
T( );
EPRIME( );
end
Procedure T( ) begin
F( );
TPRIME( );
end
Procedure TPRIME( ) begin
If input_symbol=’*’ then ADVANCE( );
F( );
TPRIME( );
end
Procedure F( ) begin
If input-symbol=’id’ then ADVANCE( );
else if input-symbol=’(‘ then ADVANCE( );
E( );
else if input-symbol=’)’ then ADVANCE( );
end
else ERROR( );
Stack implementation:
To recognize input id+id*id :
Procedure Input String
E() id+id*id
T() id+id*id
F() id+id*id
UNIT-II COMPILER DESIGN
ADVANCE() id+id*id
TPRIME() id+id*id
UNIT-II COMPILER DESIGN
EPRIME() id+id*id
ADVANCE() id+id*id
T() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
ADVANCE() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
PREDICTIVE PARSING
Predictive parsing is a special case of recursive descent parsing where no backtracking is
required.
The key problem of predictive parsing is to determine the production to be applied for a non-
terminal in case of alternatives.
Non-recursive predictive parser
INPUT a + b $
STACK
X Predictive parsing program
OUTPUT
Y
Parsing Table M
The table-driven predictive parser has an input buffer, stack, a parsing table and an output
stream.
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.
UNIT-II COMPILER DESIGN
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the current
input symbol. These two symbols determine the parser action. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input
symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table M. This entry
will either be an X-production of the grammar or an error entry.
If M[X, a] = {X → UVW},the parser replaces X on top of the stack by WVU. If
M[X, a] = error, the parser calls an error recovery routine.
Algorithm for nonrecursive predictive parsing:
Input : A string w and a parsing table M for grammar G.
Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$ in the
input buffer. The program that utilizes the predictive parsing table M to produce a parse for the input is
as follows:
set ip to point to the first symbol of w$;
repeat
let X be the top stack symbol and a the symbol pointed to by ip; if
X is a terminal or $ then
if X = a then
pop X from the stack and advance ip
else error()
else /* X is a non-terminal */
if M[X, a] = X →Y1Y2 … Yk then begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, with Y1 on top;
output the production X → Y1 Y2 . . . Yk
end
else error()
until X = $ /* stack is empty */
Follow of a non terminal in the RHS is the next non terminal first is follow
UNIT-II COMPILER DESIGN
NON- id + * ( ) $
TERMINAL
E E → TE’ E → TE’
E’ E’ → +TE’ E’ → ε E’→ ε
T T → FT’ T → FT’
T’ T’→ ε T’→ *FT’ T’ → ε T’ → ε
F F → id F → (E)
UNIT-II COMPILER DESIGN
Stack implementation:
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.
Consider this following grammar:
S → iEtS | iEtSeS | a
E→b
After eliminating left factoring, we have
S → iEtSS’ | a
S’→ eS | ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
UNIT-II COMPILER DESIGN
Parsing table:
NON- b i t $
TERMINAL
S→a S → iEtSS’
S’ S’ → eS S’ → ε
S’ → ε
E→b
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
blog – anilkumarprathipati.wordpress.com/