Compiler Rewind
Compiler Rewind
Compilers
Programming languages are notations for describing computation to people
and to machines.
All the software running on the computers was written in some programming
language.
Types of Languages:
High level languages
Low level languages
Machine level languages
Compilers
A program must be translated into a form in which it can be executed by a
computer.
Program in Program in
Source Compiler Target
Language Language
Errors
Phases of Compiler
Phases of Compiler
Phases of Compiler
Lexical Analyzer in Perspective
token
source program lexical analyzer
parser
get next token
symbol table
Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser
Lexical Analyzer
Definition:
- Lexical analysis is the first phase of a compiler. It takes the modified
source code from language preprocessors that are written in the form of
sentences.
- Lexical analysis breaks these syntaxes into a series of tokens, by removing
any white spaces or comments in the source code.
Tokens
Lexers
Source Code Lexical Analyzer Syntax Analyzer
Req. for
Tokens
Lexical Analysis
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
• PATTERN
• A description of a form that the lexemes of a token may take.
• For keywords, the pattern is just a sequence of characters that form keywords.
• Example: The rules which characterize the set of strings for a token – integers [0-9]+
• LEXEME
• Actual sequence of characters that matches pattern and is classified by a token. Examples include:
• Identifiers: x, count, name, etc…
Example: E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Finite Automata
We shall now discover how Lex turns its input program into a lexical
analyzer. At the heart of the transition is the formalism known as finite
automata. These are essentially graphs, like transition diagrams, with
a few differences:
1. Finite automata are recognizers; they simply say "yes" or "no" about
each
possible input string.
Finite Automata
2. Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the
labels of their edges. A symbol can label several edges out of the
same state, and E, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for
each symbol of its input alphabet exactly one edge with that symbol
leaving that state.
Sentence
Syntactic Analysis
int* foo(int i, int j))
{ extra parenthesis
for(k=0; i j; )
Missing
fi( i > j )
expression
return j;
not a keyword
}
Syntactic Analysis
int* foo(int i, int j)
{ return type mismatch
for(k=0; i < j; j++ )
if( i < j-2 ) undeclared var
sum = sum+i
return sum; return type
} mismatch
undeclared var
Context-Free Grammars
The mathematical model of a grammar is:
G = (S,N,T,P)
• S is the start-symbol
• N is a set of non-terminal symbols
• T is a set of terminal symbols
• P is a set of productions — P: N (N T)*
18
Context-Free Grammars
Context-free syntax is specified with a grammar, usually in Backus-Naur form (BNF).
In BNF production have the form
SaSa
SbSb
S c
19
Context-Free Grammars
1. <goal>:= <expr>
2. <expr> := <expr> <op> <term>
3. | <term>
4. <term> := number
5. | id
6. <op> := +
7. | -
20
Deriving Valid Sentences
Production Result
Given a grammar, valid sentences
<goal>
can be derived by repeated
1 <expr>
substitution.
2 <expr> <op> <term>
5 <expr> <op> y To recognize a valid sentence in
7 <expr> - y some CFG, we reverse this process
2 <expr> <op> <term> - y and build up a parse.
4 <expr> <op> 2 - y
6 <expr> + 2 - y
3 <term> + 2 - y
5 x+2-y
21
Syntax Trees/Parse Trees/Generation Trees
22
Abstract Syntax Trees
So, compilers often use an abstract syntax tree (AST).
23
The Frontend
Parser Checks the stream of words and their
parts of speech (produced by the scanner) for
grammatical correctness
Determines if the input is syntactically well
formed
Guides checking at deeper levels than syntax
Builds an IR representation of the code
24
Parsing
25
Parsing- The big picture
Bottom-up parser:
- starts at the leaves and fills in
- starts in a state valid for legal first tokens
- as input is consumed, changes state to encode possibilities (recognize valid prefixes)
- uses a stack to store both state and sentential forms
27
Derivations
• At each step, we choose a non-terminal to replace
• Different choices can lead to different derivations
29
Derivations and Parse Trees
Leftmost derivation
This evaluates as x – ( 2 * y )
30
Derivations and Parse Trees
Rightmost derivation
This evaluates as ( x – 2 ) * y
31
Ambiguity
This sentential form has two derivations
if Expr1 then if Expr2 then Stmt1 else Stmt2
32
Ambiguity
Removing the ambiguity
• Must rewrite the grammar to avoid generating the problem
• Match each else to innermost unmatched if (common sense rule)
34
Role of a Parser
•- Not all sequences of tokens are program.
-Parser must distinguish between valid and invalid
sequences of tokens.
- An expressive way to describe the syntax.
• - An acceptor mechanism that determines if input
token stream satisfies the syntax.
35
Study of Parsing
-Parsing is the process of discovering a derivation for
some sentence
-Mathematical model of syntax – a grammar G.
-Algortihm for testing membership in L(G).
Two Approaches
• Top-down parsers LL(1), recursive descent
• Start at the root of the parse tree and grow toward leaves
• Pick a production & try to match the input
• Bad “pick” may need to backtrack
37
Grammars and Parsers
• LL(1) parsers Grammars that this
• Left-to-right input
can handle are called
• Leftmost derivation
• 1 symbol of look-ahead
LL(1) grammars
38
Parsing Restriction
• In a compiler’s parser, we don’t have long-distance vision. We are usually
limited to just one-symbol of lookahead. The lookahead symbol is the next
symbol coming up in the input.
• Another restruction to parsing would be to implement backtracking.
39
Recursive Descent
• The first technique for implementing a predictive parser is called recursive-descent.
• A recursive-descent parser consists of several small functions, one for each nonterminal in the grammar.
• As we parse a sentence, we call the functions that correspond to the left side nonterminal of the
productions we are applying. If these productions are recursive, end up calling the functions recursively.
40
Example
• Expression grammar
# Production rule
1 expr → expr + term
2 | expr - term
3 | term
4 term → term * factor
5 | term / factor
6 | factor
7 factor → number
8 | identifier
• Input string x – 2 * y
41
Example Current position in the
input stream
Rule Sentential form Input string
- expr x - 2 * y expr
2 expr + term x - 2 * y
3 term + term x – 2 * y
6 factor + term x – 2 * y expr + term
8 <id> + term x – 2 * y
- <id,x> + term x – 2 * y
term
• Problem:
fact
• Can’t match next terminal
• We guessed wrong at step 2 x
42
Backtracking
Rule Sentential form Input string
- expr x - 2 * y
2 expr + term x - 2 * y
3 term + term x – 2 * y Undo all these
6 factor + term x – 2 * y productions
8 <id> + term x – 2 * y
? <id,x> + term x – 2 * y
• Rollback productions
• Choose a different production for expr
• Continue
43
Retrying
Rule Sentential form Input string expr
- expr x - 2 * y
2 expr - term x - 2 * y
3 term - term x – 2 * y expr - term
6 factor - term x – 2 * y
8 <id> - term x – 2 * y
term fact
- <id,x> - term x – 2 * y
3 <id,x> - factor x – 2 * y
7 <id,x> - <num> x – 2 * y fact 2
• Problem:
x
• More input to read
• Another cause of backtracking
44
Successful Parse
Rule Sentential form Input string expr
- expr x - 2 * y
2 expr - term x - 2 * y
3 term - term x – 2 * y expr - term
6 factor - term x – 2 * y
8 <id> - term x – 2 * y *
term term fact
- <id,x> - term x – 2 * y
4 <id,x> - term * fact x – 2 * y
6 <id,x> - fact * fact x – 2 * y fact fact y
7 <id,x> - <num> * fact x – 2 * y
- <id,x> - <num,2> * fact x – 2 * y x 2
8 <id,x> - <num,2> * <id> x – 2 * y
• All terminals match – we’re finished
45
Other Possible Parses
Rule Sentential form Input string
- expr x - 2 * y
2 expr + term x - 2 * y
2 expr + term + term x – 2 * y
2 expr + term + term + term x – 2 * y
2 expr + term + term + term + term x – 2 * y
• Problem: termination
• Wrong choice leads to infinite expansion
(More importantly: without consuming any input!)
• May not be as obvious as this
• Our grammar is left recursive
46
Left Recursion
• Formally,
A grammar is left recursive if a non-terminal A such that A →* A a (for some set of symbols
a)
What does →* mean?
A→Bx
B→Ay
• Bad news:
Top-down parsers cannot handle left recursion
• Good news:
We can systematically eliminate left recursion
47
Removing Left Recursion
• Two cases of left recursion:
# Production rule # Production rule
1 expr → expr + term 4 term → term * factor
2 | expr - term 5 | term / factor
3 | term 6 | factor
• Transform as follows:
# Production rule # Production rule
1 expr → term expr2 4 term → factor term2
2 expr2 → + term expr2 5 term2 → * factor term2
3 | - term expr2 6 | / factor term2
4 | e | e
48
Right-Recursive Grammar
# Production rule
1 expr → term expr2
Two productions with no
2 expr2 → + term expr2
choice at all
3 | - term expr2
4 | e All other productions are
5 term → factor term2 uniquely identified by a
6 term2 → * factor term2 terminal symbol at the start of
7 | / factor term2 RHS
8 | e
9 factor → number • We can choose the right production by looking at the
10 | identifier next input symbol
• This is called lookahead
• BUT, this can be tricky…
49
Examples NFA