4.parsing
4.parsing
DESIGN
Adapted from slides by Steve Zdancewic, UPenn
creating an abstract representation of program syntax
PARSING
Today: Parsing
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If
Analysis &
Eq Assn None Transformation
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
… 3
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5
call a function f
f ( x )
with argument x
4
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5
call a function f
f ( x )
with argument x
5
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5
call a function f
f ( x )
with argument x
• Strategy:
– Parse the token stream to build a tree showing how the pieces
relate
– Forget the “concrete” syntax, remember the “abstract” syntax
7
Describing Syntax
• How are we going to write these things down?
“An if statement starts with the token IF, then an LPAREN, some
tokens that make up a condition, and an RPAREN, then an
LBRACE, some tokens that make up the body, and an RBRACE.”
12
Another Example: Sum
Grammar
• A grammar that accepts parenthesized sums of numbers:
S ⟼ E+S
| E
E ⟼ number |
(S)
e.g.: (1 + 2 + (3 + 4)) + 5
S ⟼ E+S 4 productions
S ⟼ E 2 nonterminals: S, E
E ⟼ number 4 terminals: (, ), +, number
E ⟼ (S) Start symbol: S
13
Derivations in CFGs
• Example: derive (1 + 2 + (3 + 4)) + 5S ⟼ E + S | E
• S⟼E+S E ⟼ number |
⟼ (S) + S (S)
⟼ (E + S) + S For arbitrary strings a, b, g and
⟼ (1 + S) + S production rule A ⟼ b
a single step of the derivation is:
⟼ (1 + E + S) + S
⟼ (1 + 2 + S) + S aAg ⟼ abg
⟼ (1 + 2 + E) + S
⟼ (1 + 2 + (S)) + S ( substitute b for an occurrence of A)
⟼ (1 + 2 + (E + S)) + S
⟼ (1 + 2 + (3 + S)) + S
⟼ (1 + 2 + (3 + E)) + S In general, there are many possible
⟼ (1 + 2 + (3 + 4)) + S derivations for a given string
⟼ (1 + 2 + (3 + 4)) + E
Note: Underline indicates symbol
⟼ (1 + 2 + (3 + 4)) + 5 being expanded.
14
Loops and Termination
• Some care is needed when defining grammars
• Consider: S ⟼
E
E ⟼
S has nonterminal definitions that are “nonproductive”.
– This grammar
(i.e. they don’t mention any terminal symbols)
– There is no finite derivation starting from S, so the language is
empty.
S ⟼ (S
• Consider: )
– This grammar is productive, but again there is no finite derivation
starting from S, so the language is empty
PARSER GENERATORS
17
Getting Started with
Yacc/Bison
• https://www.gnu.org/software/bison/
• https://sourceforge.net/projects/gnuwin32/files/bison/2.4.1/b
ison-2.4.1-setup.exe/download?use_mirror=cfhcable
for Windows (but Cygwin or WSL works better)
18
Anatomy of a Yacc file
%{
int yylex(void);
prelude: helper functions,
written in C
%}
%union {
int ival;
char* sval; } token definitions, to be used in lexer
%%
exp:
body: grammar rules
NUM { printf("number\n"); }
and associated actions
| exp PLUS exp { printf("addition\n"); }
(again, written in C)
%%
int main(){
yyparse(); end: arbitrary C code, can call
} the parsing function yyparse 19
Yacc Actions
NUM { printf("number\n"); }if we just want to know
what’s in the program
20
Running the Lexer
• Running yacc -d <filename>.y generates y.tab.c and y.tab.h
• Otherwise, we can use the lexer and parser as a library, and call
the generated yyparse function in other files (the rest of the
compiler)
S⟼S–S S⟼S–S
⟼1–S ⟼S –3
vs.
⟼… ⟼…
23
Ambiguity
• Consider this grammar:
S⟼S–S |
number
• How do we parse 1 – 2 – 3?
S⟼S–S S⟼S–S
⟼1 – S ⟼S –3
vs.
⟼1 – S–S ⟼S–S–3
⟼1 – 2–S ⟼1–S–3
⟼1 – 2–3 ⟼1–2–3
24
Ambiguity
• Consider this grammar:
S⟼S–S |
number
• How do we parse 1 – 2 – 3?
S⟼S–S S⟼S–S
⟼1 – S ⟼S –3
vs.
⟼1 – S–S ⟼S–S–3
⟼1 – 2–S ⟼1–S–3
⟼1 – 2–3 ⟼1–2–3
25
Associativity and Precedence
• Consider this grammar:S ⟼ E – S | E
E ⟼ number |
(S)
• This grammar makes ‘–’ right-associative
• If we want to generate 1 – 2 – 3:
S⟼E–S S⟼E–S
⟼1 – S ⟼E–E
⟼1 – E – S but ⟼E–3
⟼1 – 2–S ⟼ can’t make
⟼1 – 2–E
⟼1 – 2–3
• So 1 – (2 – 3) is the only possible parse
• Note that the grammar is right recursive!
• Exercise: How would you make ‘–’ left-associative?
26
Eliminating Ambiguity
• We can often eliminate ambiguity by layering the grammar
(precedence) and allowing recursion only on one side
(associativity).
• Higher-precedence operators go farther from the start
symbol.
• Example: S ⟼ S + S | S * S | ( S ) |
number
• To disambiguate:
– Decide (following math) to make ‘*’ higher precedence than ‘+’
– Make ‘+’ left associative S0 ⟼ S0 + S1 |
– Make ‘*’ right associative
• Now 1 + 2 + 3 * 4 must mean S1
(1 + 2) + (3 * 4)
S1 ⟼ S2 * S1 |
• Exercise: add some arithmetic ops to parse1.y, and give them the
right associativity and precedence! For instance: 1 – 2 – 3 * 4
should be -13 29
Expressions and Statements
• Most languages have at least two kinds of
nonterminals, expressions and statements
– Expressions (arithmetic, array lookup, ...) compute values
– Statements (assignment, loops, …) change state
• Usually expressions can appear inside statements, but
not vice versa
stmt: ID ASSIGN exp SEMICOLON
• (Note that C breaks this rule, which makes everything
harder!)
• Extending a parser:
1. What do we want the syntax to look like?
2. Which parts have to be specific tokens (keywords, symbols,
identifiers) and which can be more complex structures
(expressions, statements)?
3. Add the new productions to the appropriate nonterminal
4. See if this causes any new conflicts, add
associativity/precedence directives as necessary
31
32