L3new Comp ASynt
L3new Comp ASynt
Syntax Analysis
SITE : http://www.info.univ-tours.fr/˜mirian/
The parser obtains a string of tokens from the lexical analyser and verifies that
the string can be generated by the grammar for the source language
The parser finds a derivation of a given sentence using the grammar or reports
that none exists.
The parser should
report any syntax errors in an intelligible fashion
recover from commonly occurring errors so that it can continue processing
the reminder of its input
Output of the parser: some representation of a parse tree for the stream of
tokens produced by the lexical analyser
token parse tree intermediate
lexical analyzer parser rest of front end representation
get next token
symbol table
A general algorithm for syntax analysis of CFG has a high cost in terms of time
complexity: O(n3 )
We need grammar classes allowing syntax analysis to be linear
In this context, there are two ways to build parse trees:
1. Top-down: build the parse trees from the top (root) to the bottom (leaves)
We have to decide which rule A → β should be applied to a node
labelled A
Expanding A results in new nodes (children of A) labelled by symbols
in β
2. Bottom-up: start from the leaves and work up to the root
We have to decide when rule A → β should be applied and we should
find neighbour nodes labelled by symbols in β
Reducing rule A → β results in adding a node A to the tree. A’s
children are the nodes labelled by the symbols in β.
In both cases, the input to the parser is scanned from left to right, one symbol at
a time.
TLC - Mı́rian Halfeld-Ferrari – p. 4/7
Examples
Grammars are capable of describing most, but not all, of the syntax of
programming languages
Certain constraints on the input, such as the requirement that identifiers be
declared before being used, cannot be described by a CFG
Because each parsing method can handle grammars only of a certain form, the
initial grammar may have to be rewritten to make it parsable by the method
chosen
Eliminate ambiguity
Recursion
1. Left recursive grammar: It has a non terminal A such that there is a
+
derivation A ⇒ Aα for some string α
2. Right recursive grammar: It has a non terminal A such that there is a
+
derivation A ⇒ αA for some string α
Top-down parsing methods cannot handle left-recursive grammars, so a
transformation that eliminates left recursion is needed.
d c d c d
c A A A
a b a
In many cases, by carefully writting a grammar, eliminating left recursion from it, and
left factoring the resulting grammar, we can obtain a grammar that can be parsed by
a recursive parser that needs no backtracking
A → Aα | β
When it is not clear which of the two alternative productions to use to expand a
nonterminal A, we may be able to rewrite the A-productions to defer the decision
until we have seen enough of the input to make the right choice
stmt → if expr then stmt else stmt
if expr then stmt
In general, if A → αβ1 | αβ2 are two A-productions, and the input begins with a
non empty string derived from α, we do not know whether to expand A to αβ1 or
to αβ2 .
We may defer the decision by expanding A to αA′ . Then after seeing the input
derived from α, we expand A′ to β1 or to β2
A→ αA′
A′ → β1 | β2
NOTATION
LL(n) Analysis
LR(n) Analysis
where
L: Left to right. The input string is analysed from left to right
L: Leftmost. Uses the leftmost derivation
R: Rightmost. Uses the rightmost derivation
n : the number of input symbols we need to know in order to perform the analysis
Example: LL(2) is a grammar having the following characteristics:
To decide which production rules to use during the syntax analysis, three kinds of
information about non terminals of CFG are requested:
For a non terminal A we want to know:
Remark: The algorithms we are going to present can be used to grammars that do not
have useless symbols or rules
Input: A grammar G
Output: Non terminals that generate ǫ are marked yes;
otherwise they are marked no
Algorithm 1:
Non terminals that derive the empty string
If G does not have any production of the form A → ǫ (for some non terminal A)
then all non terminals are marked with no
else apply the following steps until no new information can be generated
1. L = list of all productions of G but those having one terminal at right-hand side
Productions whose right-hand side have a terminal do not derive ǫ
2. For each non terminal A without productions, mark it with no
3. While there is a production A → ǫ:
delete from L all productions with A at the left-hand side;
delete every occurrence of A in the right-hand side of the productions in L;
mark A with yes
Remark : The production B → C is replaced by B → ǫ if the occurrence of C in
TLC - Mı́rian Halfeld-Ferrari – p. 21/7
its right-hand side is deleted.
Computing F IRST (A) for non terminal A -
Starting terminal symbols
Let G = (V, T, P, S) be a CFG. Formally, we define the set of starting terminal symbols
of a non terminal A ∈ V as:
∗
F IRST (A) = {a | a is a terminal and A ⇒ aα,
for some string α ∈ (V ∪ T )∗ }
Input: A grammar G
Output: The set of starting terminal symbols for all non terminals of G
Algorithm 2:
Computation of F IRST (A) for all non terminal A
Given a grammar G, we also need to introduce the concept of the set of starting terminal
symbols for a string α ∈ (V ∪ T )∗ , F IRST (α):
If α = ǫ
then F IRST (α) = F IRST (ǫ) = ∅
If α is a terminal a
then F IRST (α) = F IRST (a) = {a}
If α is a non terminal
then F IRST (α) computed by Algorithm 2
If α is a string Aβ and A is a non terminal that derives ǫ
then F IRST (α) = F IRST (Aβ) = F IRST (A) ∪ F IRST (β)
If α is a string Aβ and A is a non terminal that does not derive ǫ
then F IRST (α) = F IRST (Aβ) = F IRST (A)
If α is a string aβ where a is a terminal
then F IRST (α) = F IRST (aβ) = {a}
A non terminal can appear at the end of a string and, in this case, it does not
have a following symbol
Due to this fact and to treat this case as the others, we introduce a new terminal
symbol $ to indicate the end of a string
$ is a right endmarker and corresponds to the end of a file
This symbol is introduced as a production of S
∗
We suppose that S$ ⇒ δAψ. Put $ in FOLLOW(S)
Input: A grammar G
Output: The set of following terminal symbols for all non terminals of G
Algorithm 3:
Computation of F OLLOW (A) for all non terminal A
We can consider the problem of how to choose the production to be used during
the syntax analysis
To this end we use a predictive parsing table for a given grammar G
To build this table we use the following algorithm whose main ideas are:
1. Suppose A → α is a production with a in F IRST (α).
Then, the parser will expand A by α when the current input symbol is a
∗
2. Complication occurs when α = ǫ or α ⇒ ǫ. In this case, we should again
expand A by α if the current input symbol is in F OLLOW (A)
Input: Grammar G
Output: Parsing table M
Algorithm: Construction of a predictive parsing table
The algorithm for constructing a predictive parsing table can be applied to any
grammar G to produce a parsing table M
For some grammars however, M may have some entries that are multiply-defined
If G is left recursive or ambiguous, then M will have at least one multiply-defined
entry
A grammar whose parsing table has no multiply-defined entries is said to be LL(1)
LL(1) grammars have several distinctive properties
No ambiguous or left-recursive grammar can be LL(1)
A grammar G is LL(1) iff whenever A → α | β are two distinct productions
of G the following conditions hold:
1. For no terminal a do both α and β derive strings with a
2. At most one of α and β can derive the empty string
∗
3. If β ⇒ ǫ, then α does not derive any string beginning with the terminal
in F OLLOW (A)
E → T E′ T → FT′
F → (E) E′ → +T E ′
T′ → ∗F T ′ F → a
E′ → ǫ T′ → ǫ
is LL(1)
Constructs a parse tree for an input string beginning at the leaves (the bottom)
and working up towards the root (the top)
Reduction of a string w to the start symbol of a grammar
At each reduction step a particular substring matching the right side of a
production is replaced by the symbol on the left of that production
If the substring is chosen correctly at each step, a rightmost derivation is traced
out in reverse
Handles: A handle of a string is a substring that matches the right side of a
production and whose reduction to the nonterminal on the left side of the
production represents one step along the reverse of a rightmost derivation
Grammar:
S → aABe A → Abc | b B→ d
The sentence abbcde can be reduced as follows:
abbcde
aAbcde
aAde
aABe TLC - Mı́rian Halfeld-Ferrari – p. 37/7
LR Parsers
a1 a2 ... ai an $
sm
sm−1
Xm−i
...
s0 action goto
The driver program is the same for all LR parsers ; only the parsing table
changes from one parser to another.
The program uses a stack to store a string of the form
s0 X1 s1 X2 s2 . . . Xm sm
1. If action[sm , ai ] = shift s, then the parser executes a shift move, entering the
configuration
(s0 X1 s1 X2 . . . Xm sm ai s , ai+1 an $)
The parser shifts both the current input symbol ai and the next state s,
which is given by action[sm , ai ], onto the stack.
ai+1 becomes the current input symbol
2. If action[sm , ai ] = reduce A → β then the parser executes a reduce move,
entering the configuration (s0 X1 s1 X2 . . . Xm−r sm−r A s , ai ai+1 an $)
where s = goto[sm−r , A] and r is the length of β
The parser first popped 2r symbols off the stack (r state symbols and r
grammar symbols), exposing state sm−r
The parser pushed both A (the left side of the production) and s (the entry
goto[sm−r , A]) onto the stack
The current input symbol is not changed in a reduce move
3. If action[sm , ai ] = accept, parsing is completed
4. If action[sm , ai ] = error, the parser has discovered an error
TLC - Mı́rian Halfeld-Ferrari – p. 42/7
Algorithm: LR parsing
Input: An input string w and an LR parsing table with functions action and goto for
grammar G.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in
the input buffer.
The parser executes the following program until an accept or error action is encountered.
The value of goto[s, a] for terminal a is found in the action field connected with
the shift action on input a for state s
The goto fields gives goto[s, A] for non terminals A
We have not yet seen how the entries for the parsing table were selected
1. SLR : simple LR
The weakest of the three in terms of number of grammars for which it succeeds,
but is the easiest to implement
SLR table: parsing table constructed by this method
SLR parser
SLR grammar
2. Canonical LR : the most powerful and the most expensive
3. LALR (Lookahead LR): Intermediate in power and cost
LALR method works on most programming-language grammars and, with some
effort can be implemented efficiently
Definitions
LR(0) item or item (for short) of a grammar G:
A production of G with a dot at some position at the right side.
Intuitively: an item indicates how much of a production we have seen at a given point in
the parsing process
Example:
A → X.Y Z
indicates that we have just seen on the input a string derivable from X and that we hope
next to see a string derivable from Y Z
If I is a set of items for a grammar G, then closure(I) is the set of items constructed
from I by two rules:
Intuitively:
(i) A → α.Bβ in closure(I) indicates that, at this point of the parsing, we think we might
next see a substring derivable from Bβ as input.
(ii) If B → γ is a production we expect we might see a substring derivable from γ at this
point.
E → E + .T
T → .T ∗ F
T → .F
F → .(E)
F → .id
We compute goto(I, +) by examining I for items with + immediately to the right of the
dot
For every grammar G, the goto function of the canonical collection of sets of
items defines a deterministic finite state automaton D
The DFA D is obtained from a NFA N by the subset construction
States of N are the items
There is a transition from A → α.Xβ to A → αX.β labelled X, and there is
a transition from A → α.Bβ to B → .γ labelled ǫ
The closure(I) for a set of items (states of N ) I is the ǫ-closure of a set of
states of NFA states
goto(I, X) gives the transition from I on symbol X in the DFA constructed from
N by the subset construction
Output: The SLR parsing table functions action and goto for G′
In SLR method:
State i calls for reduction by A → α if:
the set of items Ii contains item [A → α.] and
a ∈ F OLLOW (A)
In some situations, when state i appears on the top of the stack, the viable prefix
βα on the stack is such that βA cannot be followed by a in a right-sentential form
Thus, the reduction A → α would be invalid on input a
State 2 (which is the state corresponding to viable prefix L only) should not call
for reduction of that L to R
Remarks
It is possible to carry more information in the state to rule out some invalid
reductions
By splitting states when necessary we can arrange to have each state of an LR
parse indicate exactly which input symbols can follow α such that there is a
possible reduction to A
The extra information is incorporated into the state by redefining items to include
a terminal symbol as a second component
LR(1) item: [A → α.β, a] where A → α.β is a production and a is a terminal or
the right endmarker $
In LR(1), the 1 refers to the length of the second component (lookahead of the
item)
The lookahead has no effect in an item of the form [A → α.β, a], where β is not ǫ
An item of the form [A → α., a] calls for reduction by A → α only if the next input
symbol is a
Thus, we reduce by A → α only on those input symbols a for which [A → α., a]
is in an LR(1) item in the state on the top of the stack
TLC - Mı́rian Halfeld-Ferrari – p. 62/7
The set of a’s will always be a subset of F OLLOW (A), but it could be a proper
Formally:
LR(1) item [A → α.β, a] is valid for a viable prefix γ if there is a rightmost
derivation:
∗ ∗
S ⇒ δAw ⇒ δαβw
where:
1. γ = δα and
2. either a is the first symbol of w, or
w is ǫ and a is $
Method for constructing the collection of sets of valid LR(1) items is essentially
the same as the way we built the canonical collection of sets of LR(0) items. We
only need to modify the two procedures closure and goto
∗ ∗
S ⇒ γBby ⇒ γµby
Output: The sets of LR(1) items that are the set of items valid for one or more viable
prefixes of G′
Method: The procedure closure and goto and the main routine items for constructing the
sets of items are shown in the following
function closure(I);
begin
repeat
for each item [A → α.Bβ, a] in I,
each production B → γ in G′ , and
each terminal b ∈ F IRST (βa) such that [B → .γ, b] 6∈ I do
add [B → .γ, b] to I;
until no more items can be added to I;
return (I)
end
procedure items(G’);
begin
C := {closure({[S ′ → .S, $]})}
repeat
for each set of items I in C and each grammar symbol X
such that goto(I, X) is not empty and not in C do
add goto(I, X) to C
until no more sets of items can be added to C
end
Output: The canonical LR parsing table functions action and goto for G′
We consider the grammar G = ({S, C}, { c, d}, P, S) where the productions are:
(1) S → CC (2) C → cC (3) C → d
The augmented grammar G has also rule S ′ → S
Canonical parsing table for grammar
State action goto
c d $ S C
0 s3 s4 1 2
1 accept
2 s6 s7 5
3 s3 s4 8
4 r3 r3
5 r1
6 s6 s7 9
7 r3
8 r2 r2
9 r2 TLC - Mı́rian Halfeld-Ferrari – p. 69/7
Construction LALR Parsing Tables
Lookahead-LR technique
Often used in practise because the tables obtained by it are considerably smaller
that the canonical LR tables
Common syntactic constructs of programming languages can be expressed
conveniently by an LALR grammar
The same is almost true for SLR grammars - but there are a few constructs that
cannot be conveniently handled by SLR techniques
Parser size
SLR and LALR tables for a grammar: always have the same number of
states
For a language like Pascal
SLR and LALR: several hundred states
LR: several thousand states
C → d.
We can look for sets of LR(1) items having the same core (set of first
components) and we may merge these sets
A core is a set of LR(0) items for the grammar at hand, and an LR(1) grammar
may produce more than two sets of items with the same core
Suppose we have an LR(1) grammar (i.e., one whose sets of LR(1) items
produce no parsing action conflicts)
If we replace all states having the same core with their union, it is possible that
the resulting union will have a conflict
But a reduce/shift conflict is unlikely
Suppose in the union a conflict on lookahead a:
[A → α., a] calling for a reduction by A → a
[B → β.aγ, b] calling for a shift
Then for some set of items from which the union was formed has item
[A → α., a], and since the cores of all these states are the same, it must
have an item [B → β.aγ, c] for same c.
But then this state has the same shift/reduce conflict on a and the grammar
was no LR(1) as we assumed
TLC - Mı́rian Halfeld-Ferrari – p. 76/7
Thus:
1. The merging of states with common cores can never produce a
shift/reduce conflict, because shift actions depend only on the core, not to
the lookahead
2. It is possible that a merger will produce a reduce/reduce conflict