2014-CD Ch-03 SAn
2014-CD Ch-03 SAn
Syntax analysis is the second phase of a compiler. In this chapter, we will learn the basic concepts used in the
construction of a parser. In CH-02, we have seen that a lexical analyzer can identify tokens with the help of
regular expressions and pattern rules. From this point of view a syntax analyzer or parser takes the input from
a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against
the production rules to detect any errors in the code. The output of this phase is a parse tree.
Token & token Intermediate
read char value Parse
Lexical Rest of Representation
Source Parser Tree
Analyzer Front end
program
put back char getNextToken
id
Symbol table
In this way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase. Parsers are expected to parse the whole code even if some errors exist in
the program. Parsers use error recovering strategies. But a lexical analyzer cannot check the syntax of a given
sentence due to the limitations of the regular expressions. Regular expressions cannot check balancing tokens,
such as parenthesis. Therefore, this syntax analysis phase uses context-free grammar (CFG), which is
recognized by push-down automata. The syntax of a language is specified by a context free grammar (CFG).
CFG is a helpful tool in describing the syntax of programming languages. It implies that every Regular
Grammar is also context-free, but there exists some problems, which are beyond the scope of Regular
Grammar. The rules in a CFG are mostly recursive. A syntax analyzer checks whether a given program
satisfies the rules implied by a CFG or not. If it satisfies, the syntax analyzer creates a parse tree for the given
program.
Before we proceed to the details of CFG, first let we see the definition of context-free grammar and introduce
terminologies used in parsing technology. A context-free grammar has four components:
A set of non-terminals (V): are syntactic variables that define sets of strings which are help to define
the language generated by the grammar. Non-terminals are represented by capital letters and they can
be further derivate.
A set of tokens, known as terminal symbols (Σ): are the basic symbols from which strings are formed.
Terminals are represented by small letters and they don’t have further derivation.
A set of productions (P): the productions of a grammar specify the manner in which the terminals and
non-terminals can be combined to form strings. Each production consists of a non-terminal (called left
side of the production), an arrow, and a sequence of tokens with non-terminals (called right side of
the production).
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start symbol)
by the right side of a production, for that non-terminal.
Example: We take the problem of palindrome language; L = {w | w = wR}, which cannot be described by
means of Regular Expression. Because it is not a regular language. But it can be described by means
of CFG, as illustrated below:
G = (V, Σ, P, S); Where: G- is a grammar,
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
3.3.1. Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we
take two decisions for some sentential form of input:
Deciding the non-terminal which is to be replaced.
Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options; left-most
derivation and right-most derivation
i. Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation.
The sentential form derived by the left-most derivation is called the left-sentential form.
If we scan and replace the input with production rules, from right to left, it is known as right-most derivation.
The sentential form derived from the right-most derivation is called the right-sentential form.
E → E + E
E → E * E
E → id
For the input string: id + id * id
The left-most derivation is: The right-most derivation is:
E → E * E E → E + E
E → E + E * E E → E + E * E
E → id + E * E E → E + E * id
E → id + id * E E → E + id * id
E → id + id * id E → id + id * id
Notice in left-most derivation, the left-most side non-terminal is always processed first; whereas in right-most
derivation, the right-most side non-terminal is always processed first
E → id + id * E
E → id + id * id
Step 1: E → E * E
Step 2: E → E + E * E
Step 3: E → id + E * E
Step 4: E → id + id * E
Step 5: E → id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first, therefore
the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Exercise: for the given input string; a + b * c draw the parse tree by using right-most derivation.
3.3.3. Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least
one string.
Example:
E → E + E
E → E – E
E → id
For the string id - id + id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in grammar
is not good for a compiler construction. No method can detect and remove ambiguity automatically, but it can
be removed by either re-writing the whole grammar without ambiguity, or by setting and following
associativity and precedence constraints.
a. Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by the
associativity of those operators. If the operation is left-associative, then the operand will be taken by the left
operator or if the operation is right-associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression
contains:
Compiled by Tseganesh M. CS@ WCU 5
Compiler Design (CoSc3102)
id op id op id
it will be evaluated as: (id op id) op id
For example, (id + id) + id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will
be:
id op (id op id)
For example, id ^ (id ^ id)
b. Precedence
If two different operators share a common operand, the precedence of operators decides which will take the
operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another
corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily removed. As in
the previous example, mathematically * (multiplication) has precedence over + (addition), so the expression
2+3*4 will always be interpreted as:
2 + (3 * 4)
Example:
(1) A => Aα|β; this is an example of immediate left recursion, where A is any non-terminal
symbol and α represents a string of non-terminals.
(2) S => Aα|β
A => Sd ; this is an example of indirect-left recursion
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser
may go into a loop forever.
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left recursion. Second
method is to use the following algorithm, which should eliminate all direct and indirect left recursions.
START
END
Example
If more than one grammar production rules has a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand.
factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make one
production for each common prefixes and the rest of the derivation is added by new productions.
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal. For example,
α → t β
That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production rules. We
do not consider what the non-terminal can generate but instead, we see what would be the next terminal symbol
that follows the productions of a non-terminal.
These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.
There are four common error-recovery strategies that can be implemented in the parser to deal with errors
in the code.
a. Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by not
processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way of error-
recovery and also, it prevents the parser from developing infinite loops.
b. Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of statement
allow the parser to parse ahead. For example, inserting a missing semicolon, replacing comma with a semicolon
etc. Parser designers have to be careful here because one wrong correction may lead to an infinite loop.
c. Error productions
Some common errors are known to the compiler designers that may occur in the code. In addition, the designers
can create augmented grammar to be used, as productions that generate erroneous constructs when these errors
are encountered.
d. Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is intended to do
and tries to find out a closest match for it, which is error-free. When an erroneous input (statement) X is fed, it
creates a parse tree for some closest error-free statement Y. This may allow the parser to make minimal changes
in the source code, but due to the complexity (time and space) of this strategy, it has not been implemented in
practice yet.
If watched closely, we find most of the leaf nodes are single child to their parent nodes. This information can
be eliminated before feeding it to the next phase. By hiding extra information, we can obtain a tree as shown
below:
ASTs are important data structures in a compiler with least unnecessary information. ASTs are more compact
than a parse tree and can be easily used by a compiler.
When the parser starts constructing the parse tree from the start symbol and then tries to transform the start
symbol to the input, it is called top-down parsing. Means that top-down parsing technique parses the input, and
starts constructing a parse tree from the root node gradually moving down to the leaf nodes. Top-down parsing
uses left most derivation.
Recursive descent is a common form of top-down parsing technique that constructs the parse tree from the top
and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. It is called
recursive as it uses recursive procedures to process the input. This parsing technique recursively parses the
input to make a parse tree, which may or may not require back-tracking. But the grammar associated with it
(if not left factored) cannot avoid back-tracking. A form of recursive-descent parsing that does not require any
back-tracking is known as predictive parsing. This parsing technique is regarded recursive as it uses context-
free grammar which is recursive in nature.
a. Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the production
rules to replace them (if matched). Backtracking means, if one derivation of a production fails, the syntax
analyzer restarts the process using different rules of same production. This technique may process the input
string more than once to determine the right production.
To understand top-down parser, take the following example of CFG: For an input string: read,
S → rXd | rZd
X → oa | ea
Z → ai
It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e. ‘r’.
The very production of S (S → rXd) matches with it. So the top-down parser advances to the next input letter
(i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its production from the left (X → oa). It does
not match with the next input symbol. So the top-down parser backtracks to obtain the next production rule of
X, (X → ea). Now the parser matches all the input letters in an ordered manner. The string is accepted.
b. Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which production is to be
used to replace the input string. The predictive parser does not suffer from backtracking. To accomplish its
tasks, the predictive parser uses a look-ahead pointer, which points to the next input symbols. To make the
parser back-tracking free, the predictive parser puts some constraints on the grammar and accepts only a class
of grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree. Both the stack
and the input contains an end symbol $ to denote that the stack is empty and the input is consumed. The parser
refers to the parsing table to take any decision on the input and stack element combination.
In recursive descent parsing, the parser may have more than one production to choose from for a single
instance of input, whereas in predictive parser, each step has at most one production to choose. There might
be instances where there is no production matching the input string, making the parsing procedure to fail.
c. LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some restrictions
to get the simplified version, in order to achieve easy implementation. LL grammar can be implemented by
means of both algorithms namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in LL(k)
stands for left-most derivation and k itself represents the number of lookahead. Generally k = 1, so LL(k) may
also be written as LL(1).
LL Parsing Algorithm
We may stick to deterministic LL(1) for parser explanation, as the size of table grows exponentially with the
value of k. Secondly, if a given grammar is not LL(1), then usually, it is not LL(k), for any given k.
Input:
string ω
parsing table M for grammar G
Output:
If ω is in L(G) then left-most derivation of ω,
error otherwise.
repeat
let X be the top stack symbol and a the symbol pointed by ip.
if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
endif
else /* X is non-terminal */
if M[X,a] = X → Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X → Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */
As the name suggests, bottom-up parsing starts from the leaf nodes of a tree and works in upward direction
till it reaches the root node. Here, we start from a sentence and then apply production rules in reverse manner
in order to reach the start symbol.
S aABe
A Abc|b
B d
a. Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step and
reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input symbol, which
is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is treated as a
single node of the parse tree.
Reduce step: When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is known
as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP function is
performed on the stack which pops off the handle and replaces it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free grammar
which makes it the most efficient syntax analysis technique. LR parsers are also known as LR(k) parsers, where
L stands for left-to-right scanning of the input stream; R stands for the construction of right-most derivation
in reverse, and k denotes the number of lookahead symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:
LR Parsing Algorithm
token = next_token()
repeat forever
s = top of stack
else
error()
LL vs. LR
LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still to be
Uses the stack for designating what is already seen.
expected.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the stack, and Tries to recognize a right hand side on the stack, pops
pushes the corresponding right hand side. it, and pushes the corresponding nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off the stack. Reads the terminals while it pushes them on the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Questions for exercise: working with the following question for more elaboration
1. Which of the following derivations does a top-down parser use while parsing an input string? The
input is assumed to be scanned in left to right order.
(a) Leftmost derivation (c) Rightmost derivation
(b) Leftmost derivation traced out in reverse (d) Rightmost derivation traced out in reverse
Answer (a)
LL grammars are often classified by numbers, such as LL(1), LL(0) and so on. The number in the parenthesis
tells the maximum number of terminals we may have to look at a time to choose the right production at any
point in the grammar. The most common (and useful) kind of LL grammar is LL(1) where you can always
choose the right production by looking at only the first terminal on the input at any given time. With LL(2)
you have to look at two symbols, and so on. There exist grammars that are not LL(k) grammars for any fixed
value of k at all, and they are sadly quite common.
Let us see an example of top down parsing for following grammar. Let input string be ax.
S -> Ax
A -> a
A -> b
An LL(1) parser starts with S and asks “which production should I attempt?” Naturally, it predicts the only
alternative of S. From there it tries to match A by calling method A (in a recursive-descent parser). Lookahead
a predicts production
A -> a
The parser matches a, returns to S and matches x. Done. The derivation tree is:
S
/ \
A x
|
a
Answer: (a)
If a grammar has more than one leftmost (or rightmost) derivation for a single sentential form, the grammar is
ambiguous. The leftmost and rightmost derivations for a sentential form may differ, even in an unambiguous
grammar
3. Which of the following grammar rules violate the requirements of an operator grammar? P, Q, R
are nonterminals, and r,s,t are terminals
Answer: (e)
Explanation:
An operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For
example, most calculators use operator precedence parsers to convert from the human-readable infix notation
with order of operations format into an internally optimized computer-readable format like Reverse Polish
notation (RPN). An operator precedence grammar is a kind of context-free grammar that can be parsed with
an operator-precedence parser. It has the property that no production has either an empty (ε) right-hand side or
two adjacent non-terminals in its right-hand side. These properties allow the terminals of the grammar to be
described by a precedence relation, and a parser that exploits that relation is considerably simpler than more
general-purpose parsers such as LALR parsers.
4. Consider the grammar with the following translation rules and E as the start symbol.
Answer: (c)
Explanation:
We can calculate the value by constructing the parse tree for the expression 2 # 3 & 5 # 6 &4. Alternatively,
we can calculate by considering following precedence and associativity rules.
Precedence in a grammar is enforced by making sure that a production rule with higher precedence operator
will never produce an expression with operator with lower precedence.
In the given grammar ‘&’ has higher precedence than ‘#’.
Left associativity for operator * in a grammar is enforced by making sure that for a production rule like S ->
S1 * S2 in grammar, S2 should never produce an expression with *. On the other hand, to ensure right
associativity, S1 should never produce an expression with *. In the given grammar, both ‘#’ and & are left-
associative.
So expression 2 # 3 & 5 # 6 &4 will become ((2 # (3 & 5)) # (6 & 4))
Let us apply translation rules, we get ((2 * (3 + 5)) * (6 + 4)) = 160.
1 Introduction to YACC
A parser generator is a program that takes as input a specification of a syntax, and produces as output a
procedure for recognizing that language. Historically, they are also called compiler-compilers.
YACC (yet another compiler-compiler) is an LALR(1) (LookAhead, Left-to-right, Rightmost derivation
producer with 1 lookahead token) parser generator. YACC was originally designed for being complemented
by Lex.
/* definitions */
....
%%
/* rules */
....
%%
/* auxiliary routines */
....
Input File: Definition Part: this part includes information about the tokens used in the syntax definition:
%token NUMBER
%token ID
Yacc also recognizes single characters as tokens. Therefore, assigned token numbers should not overlap
ASCII codes.
The definition part can include C code external to the definition of the parser and variable declarations,
within %{ and %} in the first column.
It can also include the specification of the starting symbol in the grammar:
%start nonterminal
Input File: Rule Part: this part contains grammar definition in a modified BNF form.
Input File:
If yylex() is not defined in the auxiliary routines sections, then it should be included:
#include "lex.yy.c"
If it contains the main() definition, it must be compiled to be executable. Otherwise, the code can be an
external function definition for the function int yyparse()
If called with the –d option in the command line, Yacc produces as output a header file y.tab.h with all its
specific definition (particularly important are token definitions to be included, for example, in a Lex input
file).
If called with the –v option, Yacc produces as output a file y.output containing a textual description of the
LALR(1) parsing table used by the parser. This is useful for tracking down how the parser solves conflicts.
%{
#include <ctype.h>
#include <stdio.h>
#define YYSTYPE double /* double type for yacc stack */
%}
%%
Lines : Lines S '\n' { printf("OK \n"); }
| S '\n’
| error '\n' {yyerror("Error: reenter last line:");
yyerrok; };
S : '(' S ')’
| '[' S ']’
| /* empty */ ;
%%
#include "lex.yy.c"
void yyerror(char * s)
/* yacc error handler */
{
fprintf (stderr, "%s\n", s);
}
int main(void)
{
return yyparse();
}
%{
%}
%%
[ \t] { /* skip blanks and tabs */ }
\n|. { return yytext[0]; }
Compiled by Tseganesh M. CS@ WCU 20
Compiler Design (CoSc3102)
%%