Chapter 4
Chapter 4
Anyone is free to download and print the PDF edition of this book for per-
sonal use. Commercial distribution, printing, or reproduction without the
author’s consent is expressly prohibited. All other rights are reserved.
You can find the latest version of the PDF edition, and purchase inexpen-
sive hardcover copies at http://compilerbook.org
Chapter 4 – Parsing
4.1 Overview
35
36 CHAPTER 4. PARSING
Grammar G2
1. P→E
2. E→E+E
3. E → ident
4. E → int
36
4.2. CONTEXT FREE GRAMMARS 37
37
38 CHAPTER 4. PARSING
P P
E E
E + E E+E
38
4.2. CONTEXT FREE GRAMMARS 39
Grammar G3
1. P→E
2. E→E+T
3. E→T
4. T → ident
5. T → int
Grammar G4
1. P→E
2. E→E+T
3. E→T
4. T→T*F
5. T→F
6. F → ident
7. F → int
39
40 CHAPTER 4. PARSING
Grammar G5
1. P→S
2. S → if E then S
3. S → if E then S else S
4. S → other
Do this now:
Write out the two possible parse trees for this sentence:
if E then if E then other else other.
4.3 LL Grammars
LL(1) grammars are a subset of CFGs that are easy to parse with simple
algorithms. A grammar is LL(1) if it can be parsed by considering only
one non-terminal and the next token in the input stream.
To ensure that a grammar is LL(1), we must do the following:
Once we have taken those steps, then we can prove that it is LL(1) by
generating the F IRST and F OLLOW sets for the grammar, and using them
to create the LL(1) parse table. If the parse table contains no conflicts, then
the grammar is clearly LL(1).
40
4.3. LL GRAMMARS 41
Substitute with:
A → β1 A′ |β2 A′ |...
A’ → α1 A′ |α2 A′ |...|ǫ
Grammar G6
1. P →E
2. E → T E’
3. E’ → + T E’
4. E’ → ǫ
5. T → ident
6. T → int
41
42 CHAPTER 4. PARSING
A → αA′
A′ → β1 |β2 |...
Grammar G7
1. P→E
2. E → id
3. E → id [ E ]
4. E → id ( E )
Grammar G8
1. P →E
2. E → id E’
3. E’ → [ E ]
4. E’ → ( E )
5. E’ → ǫ
42
4.3. LL GRAMMARS 43
For Terminals:
For each terminal a ∈ Σ: F IRST(a) = {a}
For Non-Terminals:
Repeat:
For each rule X → Y1 Y2 ...Yk in a grammar G:
Add a to F IRST(X)
if a is in F IRST(Y1 )
or a is in F IRST(Yn ) and Y1 ...Yn−1 ⇒ ǫ
If Y1 ...Yk ⇒ ǫ then add ǫ to F IRST(X).
until no more changes occur.
43
44 CHAPTER 4. PARSING
Repeat:
If A → αBβ then:
add F IRST(β) (excepting ǫ) to F OLLOW(B).
If A → αB or F IRST(β) contains ǫ then:
add F OLLOW(A) to F OLLOW(B).
until no more changes occur.
Grammar G9
1. P →E
2. E → T E’
3. E’ → + T E’
4. E’ → ǫ
5. T → F T’
6. T’ → * F T’
7. T’ → ǫ
8. F →(E)
9. F → int
44
4.3. LL GRAMMARS 45
45
46 CHAPTER 4. PARSING
int parse_P() {
return parse_E() && expect_token(TOKEN_EOF);
}
int parse_E() {
return parse_T() && parse_E_prime();
}
int parse_E_prime() {
token_t t = scan_token();
if(t==TOKEN_PLUS) {
return parse_T() && parse_E_prime();
} else {
putback_token(t);
return 1;
}
}
int parse_T() {
return parse_F() && parse_T_prime();
}
int parse_T_prime() {
token_t t = scan_token();
if(t==TOKEN_MULTIPLY) {
return parse_F() && parse_T_prime();
} else {
putback_token(t);
return 1;
}
}
int parse_F() {
token_t t = scan_token();
if(t==TOKEN_LPAREN) {
return parse_E() && expect_token(TOKEN_RPAREN);
} else if(t==TOKEN_INT) {
return 1;
} else {
printf("parse error: unexpected token %s\n",
token_string(t));
return 0;
}
}
46
4.3. LL GRAMMARS 47
For example, here is the parse table for Grammar G9 . Notice that the
entries for P , E, T , and F are straightforward: each can only start with
int or (, and so these tokens cause the rules to descend toward F and a
choice between rule 8 (F → int) and rule 9 (F → (E)). The entry for E ′ is
a little more complicated: a + token results in applying E ′ → +T E ′ , while
) or $ indicates E ′ → ǫ.
47
48 CHAPTER 4. PARSING
Now we have all the pieces necessary to operate the parser. Informally,
the idea is to keep a stack that tracks the current state of the parser. In each
step, we consider the top element of the stack and the next token on the
input. If they match, then pop the stack, accept the token and continue.
If not, then consult the parse table for the next rule to apply. If we can
continue until the end-of-file symbol is matched, then the parse succeeds.
Create a stack S.
Push $ and P onto S.
Let c be the first token on the input.
48
4.4. LR GRAMMARS 49
4.4 LR Grammars
While LL(1) grammars and top-down parsing techniques are easy to work
with, they are not able to represent all of the structures found in many
programming languages. For more general-purpose programming lan-
guages, we must use an LR(1) grammar and associated bottom-up parsing
techniques.
LR(1) is the set of grammars that can be parsed via shift-reduce tech-
niques with a single character of lookahead. LR(1) is a super-set of LL(1)
and can accommodate left recursion and common left prefixes which are
not permitted in LL(1). This enables us to express many programming
constructs in a more natural way. (An LR(1) grammar must still be non-
ambiguous, and it cannot have shift-reduce or reduce-reduce conflicts,
which we will explain below.)
For example, Grammar G10 is an LR(1) grammar:
Grammar G10
1. P→E
2. E→E+T
3. E→T
4. T → id ( E )
5. T → id
P E T
F IRST
F OLLOW
49
50 CHAPTER 4. PARSING
While this example shows that there exists a derivation for the sen-
tence, it does not explain how each action was chosen at each step. For
example, in the second step, we might have chosen to reduce id to T in-
stead of shifting a left parenthesis. This would have been a bad choice,
because there is no rule that begins with T(, but that was not immediately
obvious without attempting to proceed further. To make these decisions,
we must analyze LR(1) grammars in more detail.
50
4.4. LR GRAMMARS 51
Kernel of State 0
P→.E
Then, we compute the closure of the state as follows. For each item
in the state with a non-terminal X immediately to the right of the dot, we
add all rules in the grammar that have X as the left hand side. The newly
added items have a dot at the beginning of the right hand side.
P→.E
E→.E+T
E→.T
Closure of State 0
P→. E
E→. E+T
E→. T
T→. id ( E )
T→. id
51
52 CHAPTER 4. PARSING
You can think of the state this way: It describes the initial state of the
parser as expecting a complete program in the form of an E. However, an
E is known to begin with an E or a T , and a T must begin with an id. All
of those symbols could represent the beginning of the program.
From this state, all of the symbols (terminals and non-terminals both)
to the right of the dot are possible outgoing transitions. If the automa-
ton takes that transition, it moves to a new state containing the matching
items, with the dot moved one position to the right. The closure of the new
state is computed, possibly adding new rules as described above.
For example, from state zero, E, T , and id are the possible transitions,
because each appears to the right of the dot in some rule. Here are the
states for each of those transitions:
Transition on E:
P→E.
E→E.+T
Transition on T:
E→T.
Transition on id:
T → id . ( E )
T → id .
Figure 4.3 gives the complete LR(0) automaton for Grammar G10 . Take
a moment now to trace over the table and be sure that you understand
how it is constructed.
No, really. Stop now and study the figure carefully before continuing.
52
4.4. LR GRAMMARS 53
start accept
State 0
P .E State 2
E State 1 +
E .E+T E E+.T
P E.
E .T T . id ( E )
E E.+T
T . id ( E ) T . id
T . id
id id T
State 5
T id ( . E )
State 4 id E .E+T State 3
T T id . ( E ) +
( E .T E E+T.
T id .
T . id (E)
T . id
T E
State 6
State 8
T id ( E . )
E T.
E E.+T
State 7
T id ( E ) .
53
54 CHAPTER 4. PARSING
The LR(0) automaton tells us the choices available at any step of bot-
tom up parsing. When we reach a state containing an item with a dot at
the end of the rule, that indicates a possible reduction. A transition on a
terminal that moves the dot one position to the right indicates a possible
shift. While the LR(0) automaton tells us the available actions at each step,
it does not always tell us which action to take.1
Two types of conflicts can appear in an LR grammar:
A shift-reduce conflict indicates a choice between a shift action and a
reduce action. For example, state 4 offers a choice between shifting a left
parenthesis and reducing by rule five:
T → id . ( E )
Shift-Reduce Conflict:
T → id .
1 The 0 in LR(0) indicates that it uses zero lookahead tokens, which is a way of saying that
it does not consider the next token before making a reduction. While it is possible to write
out a grammar that is strictly LR(0), such a grammar has very limited utility.
54
4.4. LR GRAMMARS 55
Naturally, each state in the table can be occupied by only one action.
If following the procedure results in a table with more than one state in a
given entry, then you can conclude that the grammar is not SLR. (It might
still be LR(1) – more on that below.)
Here is the SLR parse table for Grammar G10 . Note carefully the states
1 and 4 where there is a choice between shifting and reducing. In state 1, a
lookahead of + causes a shift, while a lookahead of $ results in a reduction
P → E because $ is the only member of F OLLOW(P ).
55
56 CHAPTER 4. PARSING
Now we are ready to parse an input by following the SLR parsing al-
gorithm. The parse requires maintaining a stack of states in the LR(0) au-
tomaton, initially containing the start state S0 . Then, we examine the top
of the stack and the lookahead token, and take the action indicated by the
SLR parse table. On a shift, we consume the token and push the indi-
cated state on the stack. On a reduce by A → β, we pop states from stack
corresponding to each of the symbols in β, then take the additional step of
moving to state G OTO[t, A]. This process continues until we either succeed
by reducing to the start symbol, or fail by encountering an error state.
Loop:
Let s be the top of the stack.
If A CTION[s, a] is accept:
Parse complete.
Else if A CTION[s, a] is shift t:
Push state t on the stack.
Let a be the next input token.
Else if A CTION[s, a] is reduce A → β:
Pop states corresponding to β from the stack.
Let t be the top of stack.
Push G OTO[t, A] onto the stack.
Otherwise:
Halt with a parse error.
56
4.4. LR GRAMMARS 57
57
58 CHAPTER 4. PARSING
Grammar G11
1. S→V=E
2. S → id
3. V → id
4. V → id [ E ]
5. E→V
We need only build part of the LR(0) automaton to see the problem:
State 1
S -> id . [
...
V -> id .
State 0 id
V -> id . [ E ]
S -> . V = E
S -> . id
V
V -> . id
V -> . id [ E ] ...
58
4.4. LR GRAMMARS 59
Here is an example for Grammar G11 . The kernel of the start state
consists of the start symbol with a lookahead of $:
Kernel of State 0
S → . V = E {$}
S → . id {$}
The closure of the start state is computed by adding the rules for V
with a lookahead of =, because = follows V in rule 1:
Closure of State 0
S→. V = E {$}
S→. id {$}
V→. id {=}
V→. id [ E ] {=}
59
60 CHAPTER 4. PARSING
Closure of State 1
S → id . {$}
V → id . {=}
V → id . [ E ] {=}
Now you can see how the lookahead solves the reduce-reduce conflict.
When the next token on the input is $, we can only reduce by S → id.
When the next token is =, we can only reduce by V → id. By tracking
lookaheads in a more fine-grained manner than SLR, we are able to parse
arbitrary LR(1) grammars.
Figure 4.6 gives the complete LR(1) automaton for Grammar G10 . Take
a moment now to trace over the table and be sure that you understand
how it is constructed.
One aspect of state zero is worth clarifying. When constructing the
closure of a state, we must consider all rules in the grammar, including
the rule corresponding to the item under closure. The item E → . E + T
is initially added with a lookahead of {$}. Then, evaluating that item, we
add all rules that have E on the left hand side, adding a lookahead of {+}.
So, we add E → . E + T again, this time with a lookahead of {+}, resulting
in a single item with a lookahead set of {$, +}
Once again: Stop now and study the figure carefully before continuing.
60
4.4. LR GRAMMARS 61
start accept
State 0
P .E {$} State 2
E State 1 +
E .E+T {$,+} E E+.T {$,+}
P E. {$}
E .T {$,+} T . id ( E ) {$,+}
E E.+T {$,+}
T . id ( E ) {$,+} T . id {$,+}
T . id {$,+}
T id id T
State 4
State 8 State 7 State 3
T id . ( E ) {$,+}
E T . {$,+} T id ( E ) . {$,+} E E+T. {$,+}
T id . {$,+}
( id E )
State 5
T id ( . E ) {$,+}
T State 9 State 6
E .E+T {+,) } State 15
T id . ( E ) {+,) } T id ( E . ) {$,+}
E .T {+,) } E T . {+,) }
T id . {+,) } E E.+T {+,) }
T . id (E) {+,) }
T . id {+,) }
T ( id id +
State 12
T id ( . E ) {+,) } State 10
State 13
E .E+T {+,) } E + E E+.T {+,) }
T id ( E . ) {+,) }
E .T {+,) } T . id ( E ) {+,) }
E E.+T {+,) }
T . id (E) {+,) } T . id {+,) }
T . id {+,) }
) T
State 14 State 11
T id ( E ) . {+,) } E E+T. {+,) }
61
62 CHAPTER 4. PARSING
E → . E + T {$+} E → . E + T {)+}
E→.T {$+} E→.T {)+}
E → . E + T {$)+}
E→.T {$)+}
The resulting LALR automaton has the same number of states as the
LR(0) automaton, but has more precise lookahead information available
for each item. While this may seem a minor distinction, experience has
shown this simple improvement to be highly effective at obtaining the ef-
ficiency of SLR parsing while accommodating a large number of practical
grammars.
Now that you have some experience working with different kinds of gram-
mars, let’s step back and review how they relate to each other.
62
4.6. THE CHOMSKY HIERARCHY 63
LL(2) parser would require a row for pairs {aa, ab, ac, ...}.
63
64 CHAPTER 4. PARSING
left hand side, and a mix of terminals and non-terminals on the right hand
side. We call these “context free” because the meaning of a non-terminal is
the same in all places where it appears. As you have learned in this chap-
ter, a CFG requires a pushdown automaton, which is achieved by coupling
a finite automaton with a stack. If the grammar is ambiguous, the automa-
ton will be non-deterministic and therefore impractical. In practice, we
restrict ourselves to using subsets of CFGs (like LL(1) and LR(1) that are
non-ambiguous and result in a deterministic automaton that completes in
bounded time.
Context sensitive languages are those described by context sensitive
grammars where each rule can be of the form αAβ → αγβ. We call these
“context sensitive” because the interpretation of a non-terminal is con-
trolled by context in which it appears. Context sensitive languages re-
quire a non-deterministic linear bounded automaton, which is bounded
in memory consumption, but not in execution time. Context sensitive lan-
guages are not very practical for computer languages.
Recursively enumerable languages are the least restrictive set of lan-
guages, described by rules of the form α → β where α and β can be any
combination of terminals and non-terminals. These languages can only be
recognized by a full Turing machine, and are the least practical of all.
64
4.7. EXERCISES 65
4.7 Exercises
Grammar G12
1. P→S
2. P→SP
3. S → if E then S
4. S → if E then S else S
5. S → while E S
6. S → begin P end
7. S → print E
8. S→E
9. E → id
10. E → integer
11. E→E+E
(a) Point out all aspects of Grammar G12 which are not LL(1).
(b) Write a new grammar which accepts the same language, but
avoids left recursion and common left prefixes.
(c) Write the FIRST and FOLLOW sets for the new grammar.
(d) Write out the LL(1) parse table for the new grammar.
(e) Is the new grammar an LL(1) grammar? Explain your answer
carefully.
4. Consider the following grammar:
Grammar G13
1. S → id = E
2. E→E+P
3. E→P
4. P → id
5. P → (E)
6. P → id(E)
65
66 CHAPTER 4. PARSING
T
T & T | F
( F -> F ) -> T
F
! ( T | F )
( T -> F ) & T
10. Write a hand-coded parser that reads in regular expressions and out-
puts the corresponding NFA, using the Graphviz [2] DOT language.
11. Write a parser-construction tool that reads in an LL(1) grammar and
produces working code for a table-driven parser as output.
66
4.8. FURTHER READING 67
67
68 CHAPTER 4. PARSING
68