03 Lexing Parsing
03 Lexing Parsing
Introduction to Compilers
Fall 2012
Source
Lexer Parser Types AST/IR
code
2
Lexing vs. Parsing
• Language grammars usually split into two levels
■ Tokens — the “words” that make up “parts of speech”
- Ex: Identifier [a-zA-Z_]+
- Ex: Number [0-9]+
■ Programs, types, statements, expressions, declarations,
definitions, etc — the “phrases” of the language
- Ex: if (expr) expr;
- Ex: def id(id, ..., id) expr end
• Tokens are identified by the lexer
■ Regular expressions
• Everything else is done by the parser
■ Uses grammar in which tokens are primitives
■ Implementations can look inside tokens where needed
3
Lexing vs. Parsing (cont’d)
• Lexing and parsing often produce abstract syntax
tree as a result
■ For efficiency, some compilers go further, and directly
generate intermediate representations
4
Parsing theory
• Goal of parsing: Discovering a parse tree (or
derivation) from a sentence, or deciding there is no
such parse tree
• There’s an alphabet soup of parsers
■ Cocke-Younger-Kasami (CYK) algorithm; Earley’s Parser
- Can parse any context-free grammar (but inefficient)
■ LL(k)
- top-down, parses input left-to right (first L), produces a leftmost
derivation (second L), k characters of lookahead
■ LR(k)
- bottom-up, parses input left-to-right (L), produces a rightmost derivation
(R), k characters of lookahead
6
Example: Arithmetic expressions
• High-level grammar:
■ E → E + E | n | (E)
• What should the tokens be?
■ Typically they are the non-terminals in the grammar
- {+, (, ), n}
- Notice that n itself represents a set of values
- Lexers use regular expressions to define tokens
■ But what will a typical input actually look like?
1 + 2 + \n ( 3 + 4 2 ) eof
8
Lexing with ocamllex (.mll)
(* Slightly simplified format *)
{ header }
rule entrypoint = parse
regexp_1 { action_1 }
| …
| regexp_n { action_n }
and …
{ trailer }
9
Example
{
open Ex1_parser
exception Eof
}
rule token = parse
[' ' '\t' '\r'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
open Ex1_parser
exception Eof
# 7 "ex1_lexer.ml"
let __ocaml_lex_tables = {...} (* table-driven automaton *)
let rec token lexbuf = ... (* the generated matching fn *)
11
Lexer limitations
• Automata limited to 32767 states
■ Can be a problem for languages with lots of keywords
rule token = parse
"keyword_1" { ... }
| "keyword_2" { ... }
| ...
| "keyword_n" { ... }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
{ IDENT id}
■ Solution?
12
Parsing
• Now we can build a parser that works with lexemes
(tokens) from token.mll
■ Recall from 330 that parsers work by consuming one
character at a time off input while building up parse tree
■ Now the input stream will be tokens, rather than chars
1 + 2 + \n ( 3 + 4 2 ) eof
INT(1) PLUS INT(2) PLUS LPAREN INT(3) PLUS INT(42) RPAREN eof
13
Suitability of Grammar
• Problem: our grammar is ambiguous
■ E → E + E | n | (E)
■ Exercise: find an input that shows ambiguity
• There are parsing technologies that can work with
ambiguous grammars
■ But they’ll provide multiple parses for ambiguous strings,
which is probably not what we want
• Solution: remove ambiguity
■ One way to do this from 330:
■ E→T|E+T
■ T → n | (E)
14
Parsing with ocamlyacc (.mly)
%{ type token =
header | INT of (int)
%} | EOL
declarations | PLUS
%% | LPAREN
rules | RPAREN
%%
trailer val main :
(Lexing.lexbuf -> token) ->
.mly input Lexing.lexbuf -> int
.mli output
15
Parsing with ocamlyacc (.mly)
%{ (* header *)
header type token = ...
%} ...
declarations let yytables = ...
%% (* trailer *)
rules
%% .ml output
trailer
.mly input
17
Example
%token <int> INT
%token EOL PLUS LPAREN RPAREN
%start main /* the entry point */
%type <int> main
%%
main:
| expr EOL { $1 } (* 1 *)
expr:
| term { $1 } (* 2 *)
| expr PLUS term { $1 + $3 } (* 3 *)
term:
| INT { $1 } (* 4 *)
| LPAREN expr RPAREN { $2 } (* 5 *)
. 1+2+(3+42)$ main:
| expr EOL { $1 }
term[1].+2+(3+42)$ expr:
| term { $1 }
expr[1].+2+(3+42)$ | expr PLUS term { $1 + $3 }
term:
expr[1]+term[2].+(3+42)$ | INT { $1 }
expr[3].+(3+42)$ | LPAREN expr RPAREN { $2 }
main:
main[48] | expr EOL { $1 }
expr:
| term { $1 }
expr[48] | expr PLUS term { $1 + $3 }
term:
| INT { $1 }
expr[3]
+ term[45] | LPAREN expr RPAREN { $2 }
expr[1] + term[2]
( ) ■ The “.” indicates where
expr[45]
we are in the parse
term[1]1 2 ■ We’ve skipped several
expr[3] + term[42]
intermediate steps
term[3] 42 here, to focus only on
actions
3 ■ (Details next)
20
Invoking lexer/parser
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Ex1_parser.main Ex1_lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Ex1_lexer.Eof ->
exit 0
21
Terminology review
• Derivation
■ A sequence of steps using the productions to go from the start
symbol to a string
• Rightmost (leftmost) derivation
■ A derivation in which the rightmost (leftmost) nonterminal is
rewritten at each step
• Sentential form
■ A sequence of terminals and non-terminals derived from the
start-symbol of the grammar with 0 or more reductions
■ I.e., some intermediate step on the way from the start symbol to
a string in the language of the grammar
• Right- (left-)sentential form
■ A sentential form from a rightmost (leftmost) derivation
• FIRST(α)
■ Set of initial symbols of strings derived from α 22
Bottom-up parsing
• ocamlyacc builds a bottom-up parser
■ Builds derivation from input back to start symbol
S γ0 γ1 γ2 … γn–1 γn input
bottom-up
• To reduce γi to γi–1
■ Find production A → β where β is in γi, and replace β with A
• In terms of parse tree, working from leaves to root
■ Nodes with no parent in a partial tree form its upper fringe
■ Since each replacement of β with A shrinks upper fringe,
we call it a reduction.
• Note: need not actually buidl parse tree
■ |parse tree nodes| = |words| + |reductions|
23
Bottom-up parsing, illustrated
LR(1) parsing
• Scan input left-to-right rule B → γ
• Rightmost derivtaion
• 1 token lookahead S ⇒* α B y ⇒ α γ y ⇒* x y
24
Bottom-up parsing, illustrated
LR(1) parsing
• Scan input left-to-right rule B → γ
• Rightmost derivtaion
• 1 token lookahead S ⇒* α B y ⇒ α γ y ⇒* x y
x y
25
Finding reductions
• Consider the following grammar
1. S → a A B e Sentential
Production Position
Form
2. A → A b c
abbcde 3 2
3. | b aAbcde 2 4
4. B → d aAde 4 3
Input: abbcde aABe 1 4
S N/A N/A
26
Handles
• Goal: Find substring β of tree’s frontier that matches
some production A → β
■ (And that occurs in the rightmost derivation)
■ Informally, we call this substring β a handle
• Formally,
■ A handle of a right-sentential form γ is a pair (A→β,k) where
- A→β is a production and k is the position in γ of β’s rightmost symbol.
- If (A→β,k) is a handle, then replacing β at k with A produces the right
sentential form from which γ is derived in the rightmost derivation.
■ Because γ is a right-sentential form, the substring to the
right of a handle contains only terminal symbols
- the parser doesn’t need to scan past the handle (only lookahead)
27
Example
• Grammar
1. S → E Sentential Handle
Production
Form (prod,k)
2. E → E + T
S
3. |E-T 1 E 1,1
4. |T 3 E-T 3,3
5. T → T * F 5 E-T*F 5,5
9 E-T*id 9,5
6. |T/F 7 E-F*id 7,3
7. |F 8 E-n*id 8,3
8. F → n 4 T-n*id 4,1
7 F-n*id 7,1
9. | id
9 id-n*id 9,1
10. | (E) Handles for rightmost derivation of id-n*id
28
Finding reductions
• Theorem: If G is unambiguous, then every right-
sentential form has a unique handle
■ If we can find those handles, we can build a derivation!
• Sketch of Proof:
■ G is unambiguous rightmost derivation is unique
■ a unique production A → β applied to derive γi from γi–1
■ and a unique position k at which A→β is applied
■ a unique handle (A→β,k)
29
Bottom-up handle pruning
• Handle pruning: discovering handle and reducing it
■ Handle pruning forms the basis for bottom-up parsing
• So, to construct a rightmost derivation
S γ0 γ1 γ2 … γn–1 γn input
30
Shift-reduce parsing algorithm
• Maintain a stack of terminals and non-terminals
matched so far
■ Rightmost terminal/non-terminal on top of stack
■ Since we’re building rightmost derivation, will look at top
elements of stack for reductions
push INVALID
token ← next_token( ) Potential errors
repeat until (top of stack = Goal and token = EOF)
if the top of the stack is a handle A→β
• Can’t find handle
then // reduce β to A • Reach end of file
pop |β| symbols off the stack
push A onto the stack
else if (token ≠ EOF)
then // shift
push token
token ← next_token( )
else // need to shift, but out of input
report an error
31
Example
1. Shift until the top of the stack is the right end of a handle
2. Find the left end of the handle & reduce
E – T
T T * F
F F id
id n
33
Algorithm actions
• Shift-reduce parsers have just four actions
■ Shift — next word is shifted onto the stack
■ Reduce — right end of handle is at top of stack
- Locate left end of handle within the stack
- Pop handle off stack and push appropriate lhs
■ Accept — stop parsing and report success
■ Error — call an error reporting/recovery routine
• Cost of operations
■ Accept is constant time
■ Shift is just a push and a call to the scanner
■ Reduce takes |rhs| pops and 1 push
- If handle-finding requires state, put it in the stack 2x work
35
LR(1) parsing
• Can use a set of tables to describe LR(1) parser
source Table-driven
code Scanner output
Parser
ACTION &
Parser
grammar GOTO
Generator
Tables
37
Example parser table
• ocamlyacc -v ex1_parser.mly — produce .output file
with parser table
action goto
state . EOL + N ( ) main expr term productions
0 (special)
1 s3 s4 acc 6 7 entry → . main
2 (special)
3 r4 term → INT .
4 s3 s4 8 7 term → ( . expr )
5 (special)
6 s9 s10 main → expr . EOL | expr → expr . + term
7 r2 expr → term .
8 s10 s11 expr → expr . + term | term → ( expr . )
9 r1 main → expr EOL .
10 s3 s4 12 expr → expr + . term
11 r5 term → ( expr ) .
12 r3 expr → expr + term .
39
Example parser table (cont’d)
• Notes
■ Notice derivation is built up (bottom to top)
■ Table only contains kernel of each state
- Apply closure operation to see all the productions in the state
• LR(1) parsing requires start symbol not on any rhs
■ Thus, ocamlyacc actually adds another production
- %entry% → \001 main
- (so the acc in the previous table is a slight fib)
• Values returned from actions stored on the stack
■ Reduce triggers computation of action result
40
Why does this work?
• Stack = upper fringe
■ So all possible handles on top of stack
■ Shift inputs until top elements of stack form a handle
• Build a handle-recognizing DFA
■ Language of handles is regular
■ ACTION and GOTO tables encode the DFA
- Shift = DFA transition
- Reduce = DFA accept
- New state = GOTO[state at top of stack (afetr pop), lhs]
• If we can build these tables, grammar is LR(1)
41
LR(k) items
• An LR(k) item is a pair [P, δ], where
■ P is a production A→β with a • at some position in the rhs
■ δ is a lookahead string of length ≤ k (words or $)
■ The • in an item indicates the position of the top of the stack
• LR(1):
■ [A→•βγ,a] — input so far consistent with using A →βγ
immediately after symbol on top of stack
■ [A →β•γ,a] — input so far consistent with using A →βγ at
this point in the parse, and parser has already recognized β
■ [A →βγ•,a] — parser has seen βγ, and lookahead of a
consistent with reducing to A
• LR(1) items represent valid configurations of an
LR(1) parser; DFA states are sets of LR(1) items
42
LR(k) items, cont’d
• Ex: A→BCD with lookahead a can yield 4 items
■ [A→•BCD,a], [A→B•CD,a], [A→BC•D,a], [A→BCD•,a]
■ Notice: set of LR(1) items for a grammar is finite
• Carry lookaheads along to choose correct reduction
■ Lookahead has no direct use in [A→β•γ,a]
■ In [A→β•,a], a lookahead of a reduction by A →β
■ For { [A→β•,a],[B→γ•δ,b] }
- Lookahead of a reduce to A
- FIRST(δ) shift
- (else error)
43
LR(1) table construction
• States of LR(1) parser contain sets of LR(1) items
• Initial state s0
• Assume S’ is the start symbol of grammar, does not appear in rhs
• (Extend grammar if necessary to ensure this)
• s0 = closure([S’ →•S,$]) ($ = EOF)
• For each sk and each terminal/non-terminal X, compute
new state goto(sk,X)
• Use closure() to “fill out” kernel of new state
• If the new state is not already in the collection, add it
• Record all the transitions created by goto( )
• These become ACTION and GOTO tables
• i.e., the handle-finding DFA
• This process eventually reaches a fixpoint
44
Closure()
• [A→β•Bδ,a] implies [B→•γ,x] for each production
with B on lhs and each x ∈ FIRST(δa)
- (If you’re about to see a B, you may also see a ɣ)
Closure( s )
while ( s is still changing )
∀ items [A → β •Bδ,a] ∈ s // item with • to left of nonterminal B
∀ productions B → γ ∈ P // all productions for B
∀ b ∈ FIRST(δa) // tokens appearing after B
if [B → • γ,b] ∉ s // form LR(1) item w/ new lookahead
then add [B→ • γ,b] to s // add item to s if new
[E → T+ • E]
[kernel item]
[derived item]
[E → • T+E]
[E → • T]
[T → • id]
46
Example — closure with LR(1)
S→E
E → T+E [S → • E, $]
[E → • T+E, $]
| T
[E → • T, $]
T → id [T → • id, +]
[T → • id, $]
[kernel item]
[derived item]
[E → T+ • E, $]
[E → • T+E, $]
[E → • T, $]
[T → • id, +]
[T → • id, $]
47
Goto
• Goto(s,x) computes the state that the parser would
reach if it recognized an x while in state s
■ Goto( { [A→β•Xδ,a] }, X ) produces [A→βX•δ,a]
■ Should also includes closure( [A→βX•δ,a] )
Goto( s, X )
new ←Ø
∀ items [A→β•Xδ,a] ∈ s // for each item with • to left of X
new ← new ∪ [A→βX•δ,a] // add item with • to right of X
return closure(new) // remember to compute closure!
49
Example — goto with LR(1)
S→E
E → T+E [S → E •, $]
E
| T
[S → • E, $]
T → id [E → • T+E, $]
T [E → T • +E, $]
[E → • T, $]
[E → T •, $]
[T → • id, +]
[kernel item] [T → • id, $]
[derived item] id
[T → id •, +]
[T → id •, $]
50
Building parser states
cc0 ← closure ( [S’→ •S, $] )
CC ← { cc0 }
while ( new sets are still being added to CC)
for each unmarked set ccj ∈ CC
mark ccj as processed
for each x following a • in an item in ccj
temp ← goto(ccj, x)
if temp ∉ CC
then CC ← CC ∪ { temp }
record transitions from ccj to temp on x
[S → • E] [E → T + • E]
T [E → T • +E] +
[E → • T+E] [E → • T+E]
[E → • T] [E → T •]
[E → • T]
[T → • id] [T → • id]
id
id
E E
[S → E •] [T → id •] [E → T + E •]
52
Example LR(1) states
S→E
E → T+E
| T
T → id T
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E E
[T → id •, +]
[S → E •, $] [T → id •, $] [E → T + E •, $]
53
Building ACTION and GOTO tables
∀ set sx ∈ S
∀ item i ∈ sx
if i is [A→β •a γ,b] and goto(sx,a) = sk, a ∈ terminals // • to left of terminal a
then ACTION[x,a] ← “shift k” // shift if lookahead = a
else if i is [S’→S •,$] // start production done,
then ACTION[x , $] ← “accept” // accept if lookahead = $
else if i is [A→β •,a] // • all the way to right
then ACTION[x,a] ← “reduce A→β” // → production done
∀ n ∈ nonterminals // reduce if lookahead = a
if goto(sx ,n) = sk
then GOTO[x,n] ← k // store transitions for nonterminals
54
Ex ACTION and GOTO tables
ACTION GOTO
1.S → E id + $ E T
S0 s3 1 2
2.E → T+E S1 acc
S2 s4 r3
3. | T S3
S4 s3
r4 r4
5 2
4.T → id S5 r2
S0 S2
T S4
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E S3 E
[T → id •, +]
S1 S5
[S → E •, $] [T → id •, $] [E → T + E •, $]
55
Ex ACTION and GOTO tables
ACTION GOTO
1.S → E id + $ E T
Entries S0 s3 1 2
2.E → T+E for S1 acc
shift S2 s4 r3
3. | T S3
S4 s3
r4 r4
5 2
4.T → id S5 r2
S0 S2
T S4
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E S3 E
[T → id •, +]
S1 S5
[S → E •, $] [T → id •, $] [E → T + E •, $]
56
Ex ACTION and GOTO tables
ACTION GOTO
1.S → E id + $ E T
Entry S0 s3 1 2
2.E → T+E for S1 acc
accept S2 s4 r3
3. | T S3
S4 s3
r4 r4
5 2
4.T → id S5 r2
S0 S2
T S4
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E S3 E
[T → id •, +]
S1 S5
[S → E •, $] [T → id •, $] [E → T + E •, $]
57
Ex ACTION and GOTO tables
ACTION GOTO
1.S → E id + $ E T
Entries S0 s3 1 2
2.E → T+E for S1 acc
reduce S2 s4 r3
3. | T S3
S4 s3
r4 r4
5 2
4.T → id S5 r2
S0 S2
T S4
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E S3 E
[T → id •, +]
S1 S5
[S → E •, $] [T → id •, $] [E → T + E •, $]
58
Ex ACTION and GOTO tables
ACTION GOTO
1.S → E id + $ E T
Entries S0 s3 1 2
2.E → T+E for S1 acc
GOTO S2 s4 r3
3. | T S3
S4 s3
r4 r4
5 2
4.T → id S5 r2
S0 S2
T S4
[S → • E, $]
T [E → T • +E, $] + [E → T + • E, $]
[E → • T+E, $] [E → • T+E, $]
[E → • T, $] [E → T •, $]
[E → • T, $]
[T → • id, +] [T → • id, +]
[T → • id, $] id id
[T → • id, $]
E S3 E
[T → id •, +]
S1 S5
[S → E •, $] [T → id •, $] [E → T + E •, $]
59
What can go wrong?
• What if set s contains [A→β•aγ,b] and [B→β•,a] ?
■ First item generates “shift”, second generates “reduce”
■ Both define ACTION[s,a] — cannot do both actions
■ This is a shift/reduce conflict
• What if set s contains [A→γ•, a] and [B→γ•, a] ?
■ Each generates “reduce”, but with a different production
■ Both define ACTION[s,a] — cannot do both reductions
■ This is called a reduce/reduce conflict
• In either case, the grammar is not LR(1)
60
Shift/reduce conflict
%token <int> INT
%token EOL PLUS LPAREN RPAREN
%start main /* the entry point */
%type <int> main
%%
main:
| expr EOL { $1 }
expr:
| INT { $1 }
| expr PLUS expr { $1 + $3 }
| LPAREN expr RPAREN { $2 }
• Associativity unspecified
■ Ambiguous grammars always have conflicts
■ But, some non-ambiguous grammars also have conflicts
61
Solving conflicts
• Refactor grammar
• Specify operator precedence and associativity
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
■ Lots of details here
- See “12.4.2 Declarations” at
- http://caml.inria.fr/pub/docs/manual-ocaml/manual026.html#htoc151
■ When comparing operator on stack with lookahead
- Shift if lookahead has higher prec OR same prec, right assoc
- Reduce if lookahead has lower prec OR same prec, left assoc
■ Can use smaller, simpler (ambiguous) grammars
- Like the one we just saw
62
Left vs. right recursion
• Right recursion *
■ Required for termination in top-down parsers *
w *
■ Produces right-associative operators x z
y
w*(x*(y*z))
• Left recursion
■ Works fine in bottom-up parsers *
*
■ Limits required stack space * z
y
■ Produces left-associative operators w
x
( (w * x ) * y ) * z
• Rule of thumb
■ Left recursion for bottom-up parsers
■ Right recursion for top-down parsers
63
Reduce/reduce conflict (1)
%token <int> INT
%token EOL PLUS LPAREN RPAREN
%start main /* the entry point */
%type <int> main
%%
main:
| expr EOL { $1 }
expr:
| INT { $1 }
| term { $1 }
| term PLUS expr { $1 + $3 }
term :
| INT { $1 }
| LPAREN expr RPAREN { $2 }
64
Reduce/reduce conflict (2)
%token <int> INT
%token EOL PLUS LPAREN RPAREN
%start main /* the entry point */
%type <int> main
%%
main:
| expr EOL { $1 }
expr:
| term1 { $1 }
| term1 PLUS PLUS expr { $1 + $4 }
| term2 PLUS expr { $1 + $3 }
term1 :
| INT { $1 }
| LPAREN expr RPAREN { $2 }
term2 :
| INT { $1 }
65
Shrinking the tables
• Combine terminals
■ E.g., number and identifier, or + and -, or * and /
- Directly removes a column, may remove a row
• Combine rows or columns (table compression)
■ Implement identical rows once and remap states
■ Requires extra indirection on each lookup
■ Use separate mapping for ACTION and for GOTO
• Use another construction algorithm
■ LALR(1) used by ocamlyacc
66
LALR(1) parser
• Define the core of a set of LR(1) items as
■ Set of LR(0) items derived by ignoring lookahead symbols
[E → a •, b] [E → a •]
[A → a •, c] [A → a •]
LR(1) state Core
67
LALR(1) example
[E → a •, b]
[E → a •, b]
[A → ba •, c]
[A → ba •, c]
[E → a •, d]
[E → a •, d] [A → ba •, b]
[A → ba •, b]
Merged state
LR(1) states
68
LALR(1) vs. LR(1)
• Example grammar
S’ → S
S → aAd | bBd | aBe | bAe
A→c
B→c
• LR(0) ?
• LR(1) ?
• LALR(1) ?
69
LR(k) Parsers
• Properties
■ Strictly more powerful than LL(k) parsers
■ Most general non-backtracking shift-reduce parser
■ Detects error as soon as possible in left-to-right scan of
input
- Contents of stack are viable prefixes
- Possible for remaining input to lead to successful parse
70
Error handling (lexing)
• What happens when input not handled by any lexing
rule?
■ An exception gets raised
■ Better to provide more information, e.g.,
rule token = parse
...
71
Error handling (parsing)
• What happens when parsing a string not in the
grammar?
■ Reject the input
■ Do we keep going, parsing more characters?
- May cause a cascade of error messages
- Could be more useful to programmer, if they don’t need to stop at the
first error message (what do you do, in practice?)
72
Error example (1)
...
expr:
| term { $1 }
| expr PLUS term { $1 + $3 }
| error { Printf.printf "invalid expression"; 0 }
term: ...
73
Error example (2)
...
term:
| INT { $1 }
| LPAREN expr RPAREN { $2 }
| LPAREN error RPAREN {Printf.printf "Syntax error!\n"; 0}
74
Error recovery in practice
• A very hard thing to get right!
■ Necessarily involves guessing at what malformed inputs
you may see
76
Additional Parsing Technologies
• For a long time, parsing was a “dead” field
■ Considered solved a long time ago
• Recently, people have come back to it
■ LALR parsing can have unnecessary parsing conflicts
■ LALR parsing tradeoffs more important when computers
were slower and memory was smaller
• Many recent new (or new-old) parsing techniques
■ GLR — generalized LR parsing, for amibuous grammars
■ LL(*) — ANTLR
■ Packrat parsing — for parsing expression grammars
■ etc...
• The input syntax to many of these looks like yacc/
lex
77
Designing language syntax
• Idea 1: Make it look like other, popular languages
■ Java did this (OO with C syntax)
• Idea 2: Make it look like the domain
■ There may be well-established notation in the domain (e.g.,
mathematics)
■ Domain experts already know that notation
• Idea 3: Measure design choices
■ E.g., ask users to perform programming (or related) task
with various choices of syntax, evaluate performance,
survey them on understanding
- This is very hard to do!
• Idea 4: Make your users adapt
■ People are really good at learning...
78