0% found this document useful (0 votes)
7 views

4.parsing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

4.parsing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

CS 473: COMPILER

DESIGN
Adapted from slides by Steve Zdancewic, UPenn
creating an abstract representation of program syntax

PARSING
Today: Parsing
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If
Analysis &
Eq Assn None Transformation
b 0 a 1

Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
… 3
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5

call a function f
f ( x )
with argument x

if statement with condition b


if ( b ) { a = 0 ; }
and body a = 0

4
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5

call a function f
f ( x )
with argument x

if statement with condition b


if ( b ) { a = 0 ; }
and body a = 0

5
Parsing: Structure
Tokens:
Structure:
10 + 5 add 10 and 5

call a function f
f ( x )
with argument x

if statement with condition b


if ( b ) { a = 0 ; }
and body that assigns 0 to a

• Figure out what role each token is playing


• Catch most syntax errors (“valid pieces, bad arrangement”)
• Understand what program the user wrote
• Turn it into a representation we can traverse and analyze
6
Parsing: Overview
• Input: stream of tokens (generated by lexer)
• Output: abstract syntax tree

• Strategy:
– Parse the token stream to build a tree showing how the pieces
relate
– Forget the “concrete” syntax, remember the “abstract” syntax

• Will catch lots of malformed programs! Wrong number of


operators, missing semicolons, unmatched parens; most
“syntax errors” appear here
– But no type errors, initialization, etc.: we still don’t know what
anything means!

7
Describing Syntax
• How are we going to write these things down?

• Exercise: Describe the structure of an if statement, in terms


of the tokens involved.

“An if statement starts with the token IF, then an LPAREN, some
tokens that make up a condition, and an RPAREN, then an
LBRACE, some tokens that make up the body, and an RBRACE.”

if_stmt ::= IF LPAREN cond RPAREN LBRACE stmts RBRACE

• Note: regexps aren’t expressive enough for this, because


they can’t really do recursion (example: can’t describe
matching parentheses)
8
9
CONTEXT-FREE
GRAMMARS
Context-Free Grammars
• Here is a specification of the language of balanced parens:
S⟼
(S)S
S⟼e
• The definition is recursive: S mentions itself.

• Idea: “derive” a string in the language by starting with S and


rewriting according to the rules:
– Example: S ⟼ (S)S ⟼ ((S)S)S ⟼ ((e)S)S ⟼ ((e)S)e ⟼ ((e)e)e
= (())

• You can replace the “nonterminal” S by its definition


anywhere
• A context-free grammar accepts a string when there is a
derivation from the start symbol
11
Context-Free Grammars:
Definition
• A Context-Free Grammar (CFG) consists of
– A set of terminals (e.g., a lexical token or e)
– A set of nonterminals (e.g., S and other syntactic variables)
– A designated nonterminal called the start symbol
– A set of productions: LHS ⟼ RHS
• LHS is a nonterminal
• RHS is a string of terminals and nonterminals

• Example: The balanced parentheses language:


S⟼
(S)S
S⟼e
• Exercise: How many terminals? How many nonterminals?
Productions?

12
Another Example: Sum
Grammar
• A grammar that accepts parenthesized sums of numbers:
S ⟼ E+S
| E
E ⟼ number |
(S)
e.g.: (1 + 2 + (3 + 4)) + 5

• Note the vertical bar ‘|’ is shorthand for multiple


productions:

S ⟼ E+S 4 productions
S ⟼ E 2 nonterminals: S, E
E ⟼ number 4 terminals: (, ), +, number
E ⟼ (S) Start symbol: S

13
Derivations in CFGs
• Example: derive (1 + 2 + (3 + 4)) + 5S ⟼ E + S | E
• S⟼E+S E ⟼ number |
⟼ (S) + S (S)
⟼ (E + S) + S For arbitrary strings a, b, g and
⟼ (1 + S) + S production rule A ⟼ b
a single step of the derivation is:
⟼ (1 + E + S) + S
⟼ (1 + 2 + S) + S aAg ⟼ abg
⟼ (1 + 2 + E) + S
⟼ (1 + 2 + (S)) + S ( substitute b for an occurrence of A)
⟼ (1 + 2 + (E + S)) + S
⟼ (1 + 2 + (3 + S)) + S
⟼ (1 + 2 + (3 + E)) + S In general, there are many possible
⟼ (1 + 2 + (3 + 4)) + S derivations for a given string
⟼ (1 + 2 + (3 + 4)) + E
Note: Underline indicates symbol
⟼ (1 + 2 + (3 + 4)) + 5 being expanded.

14
Loops and Termination
• Some care is needed when defining grammars
• Consider: S ⟼
E
E ⟼
S has nonterminal definitions that are “nonproductive”.
– This grammar
(i.e. they don’t mention any terminal symbols)
– There is no finite derivation starting from S, so the language is
empty.
S ⟼ (S
• Consider: )
– This grammar is productive, but again there is no finite derivation
starting from S, so the language is empty

• When writing a large grammar, it’s easy to accidentally “chain”


many nonterminals without a base case
• Upshot: be aware of “vacuously empty” CFG grammars.
– Every nonterminal should eventually rewrite to an alternative that
contains only terminal symbols.
15
16
debugging parser conflicts
disambiguating grammars

PARSER GENERATORS

17
Getting Started with
Yacc/Bison
• https://www.gnu.org/software/bison/
• https://sourceforge.net/projects/gnuwin32/files/bison/2.4.1/b
ison-2.4.1-setup.exe/download?use_mirror=cfhcable
for Windows (but Cygwin or WSL works better)

• Run yacc -d or bison -yd <grammar>.y to get two files:


– y.tab.c implements the parser
– y.tab.h defines tokens and values for use in lexing – include
in .lex file (replaces files like tokens.h)

• Not every grammar can be automatically parsed!


– When run, reports number of shift/reduce and reduce/reduce
conflicts
– yacc -dv or bison -ydv also produces y.output, which describes
the parser states and conflicts

18
Anatomy of a Yacc file
%{
int yylex(void);
prelude: helper functions,
written in C
%}

%union {
int ival;
char* sval; } token definitions, to be used in lexer

%token <ival> NUM

%%
exp:
body: grammar rules
NUM { printf("number\n"); }
and associated actions
| exp PLUS exp { printf("addition\n"); }
(again, written in C)
%%

int main(){
yyparse(); end: arbitrary C code, can call
} the parsing function yyparse 19
Yacc Actions
NUM { printf("number\n"); }if we just want to know
what’s in the program

$$ is the return value for this


NUM { $$ = $1; }
production; $1 gets the
value of the first symbol

exp PLUS exp { $$ = $1 + $3; } access values of tokens and


nonterminals by their
position in the rule

• Later, we’ll build a representation of the program


instead of running it right away!

20
Running the Lexer
• Running yacc -d <filename>.y generates y.tab.c and y.tab.h

• y.tab.c defines a function called yyparse, which parses a


stream of tokens according to the grammar (using yylex to get
the tokens)

• y.tab.h defines the tokens and their values, and should be


used in the lexer instead of defining them there

• If parser has a main function, we can just compile and run


y.tab.c together with the lex.yy.c from the lexer

• Otherwise, we can use the lexer and parser as a library, and call
the generated yyparse function in other files (the rest of the
compiler)

• Adding the -v argument (e.g. yacc –dv <filename>.y) also


generates y.output, which can help with debugging (more on
21
this later)
22
Ambiguity
• Consider this grammar:
S⟼S–S |
number
• How do we parse 1 – 2 – 3?

S⟼S–S S⟼S–S
⟼1–S ⟼S –3
vs.
⟼… ⟼…

“This is an expression that


“This is an expression that
computes 1 minus
computes <an expression>
<an expression>”
minus 3”

23
Ambiguity
• Consider this grammar:
S⟼S–S |
number
• How do we parse 1 – 2 – 3?

S⟼S–S S⟼S–S
⟼1 – S ⟼S –3
vs.
⟼1 – S–S ⟼S–S–3
⟼1 – 2–S ⟼1–S–3
⟼1 – 2–3 ⟼1–2–3

“This is an expression that


“This is an expression that
computes 1 minus <2 minus 3>”
computes <1 minus 2> minus 3”

24
Ambiguity
• Consider this grammar:
S⟼S–S |
number
• How do we parse 1 – 2 – 3?

S⟼S–S S⟼S–S
⟼1 – S ⟼S –3
vs.
⟼1 – S–S ⟼S–S–3
⟼1 – 2–S ⟼1–S–3
⟼1 – 2–3 ⟼1–2–3

“This is an expression that


“This is an expression that
does 1 – (2 – 3)”
does (1 – 2) – 3”

25
Associativity and Precedence
• Consider this grammar:S ⟼ E – S | E
E ⟼ number |
(S)
• This grammar makes ‘–’ right-associative
• If we want to generate 1 – 2 – 3:
S⟼E–S S⟼E–S
⟼1 – S ⟼E–E
⟼1 – E – S but ⟼E–3
⟼1 – 2–S ⟼ can’t make
⟼1 – 2–E
⟼1 – 2–3
• So 1 – (2 – 3) is the only possible parse
• Note that the grammar is right recursive!
• Exercise: How would you make ‘–’ left-associative?
26
Eliminating Ambiguity
• We can often eliminate ambiguity by layering the grammar
(precedence) and allowing recursion only on one side
(associativity).
• Higher-precedence operators go farther from the start
symbol.
• Example: S ⟼ S + S | S * S | ( S ) |
number

• To disambiguate:
– Decide (following math) to make ‘*’ higher precedence than ‘+’
– Make ‘+’ left associative S0 ⟼ S0 + S1 |
– Make ‘*’ right associative
• Now 1 + 2 + 3 * 4 must mean S1
(1 + 2) + (3 * 4)
S1 ⟼ S2 * S1 |

• Note that operations can only S2


appear in the bottom
nonterminal S2 if they’re S2 ⟼ number | ( S0 ) 27
28
Precedence and Associativity Declarations
• Parser generators like yacc/bison support precedence and
associativity declarations
– Resolve common conflicts without changing the grammar by hand
• Example:
%left PLUS
%left TIMES
• Tokens can be declared left, right, or nonassoc
• Tokens further down have higher precedence (bind tighter, get
evaluated first)
• Precedence of a rule is based on the precedence of its last
terminal:
E⟼E+E has the precedence of “+”
E ⟼ if E then E else E has the precedence of
“else”
• Can’t apply precedence to nonterminals

• Exercise: add some arithmetic ops to parse1.y, and give them the
right associativity and precedence! For instance: 1 – 2 – 3 * 4
should be -13 29
Expressions and Statements
• Most languages have at least two kinds of
nonterminals, expressions and statements
– Expressions (arithmetic, array lookup, ...) compute values
– Statements (assignment, loops, …) change state
• Usually expressions can appear inside statements, but
not vice versa
stmt: ID ASSIGN exp SEMICOLON
• (Note that C breaks this rule, which makes everything
harder!)

• Once the grammar is a little more complicated, having


the parser act as an interpreter is much harder and
less efficient!
– We’ll want to have the parser produce an internal
representation of the program structure instead
30
Program 2: Parser
• Posted on the course website (
https://www.cs.uic.edu/~mansky/teaching/cs473/sp21/progr
am2.html
)
• Extend a simple parser with support for more features
• Due next Wednesday at the start of class
• Submit via Gradescope

• Extending a parser:
1. What do we want the syntax to look like?
2. Which parts have to be specific tokens (keywords, symbols,
identifiers) and which can be more complex structures
(expressions, statements)?
3. Add the new productions to the appropriate nonterminal
4. See if this causes any new conflicts, add
associativity/precedence directives as necessary

31
32

You might also like