0% found this document useful (0 votes)
21 views

Compiler Rewind

Uploaded by

neelamysr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Compiler Rewind

Uploaded by

neelamysr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Compiler Rewind

Compilers
Programming languages are notations for describing computation to people
and to machines.

All the software running on the computers was written in some programming
language.

Types of Languages:
High level languages
Low level languages
Machine level languages
Compilers
A program must be translated into a form in which it can be executed by a
computer.

The software systems that do this translation are called COMPILERS

Program in Program in
Source Compiler Target
Language Language

Errors
Phases of Compiler
Phases of Compiler
Phases of Compiler
Lexical Analyzer in Perspective
token
source program lexical analyzer
parser
get next token

symbol table

Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser
Lexical Analyzer
Definition:
- Lexical analysis is the first phase of a compiler. It takes the modified
source code from language preprocessors that are written in the form of
sentences.
- Lexical analysis breaks these syntaxes into a series of tokens, by removing
any white spaces or comments in the source code.

Tokens
Lexers
Source Code Lexical Analyzer Syntax Analyzer
Req. for
Tokens
Lexical Analysis
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition

• Reviewing Finite Automata Concepts


• Non-Deterministic and Deterministic FA
• Conversion Process
• Regular Expressions to NFA
• NFA to DFA

• Relating NFAs/DFAs /Conversion to Lexical Analysis


Basic Terminology
• What are Major Terms for Lexical Analysis?
• TOKEN
• A pair consisting of a token name and an optional attribute value.
• A particular keyword, or a sequence of input characters denoting identifier.
Examples Include Identifier, Integer, Float, Assign, LParen, RParen, etc.

• PATTERN
• A description of a form that the lexemes of a token may take.
• For keywords, the pattern is just a sequence of characters that form keywords.
• Example: The rules which characterize the set of strings for a token – integers [0-9]+

• LEXEME
• Actual sequence of characters that matches pattern and is classified by a token. Examples include:
• Identifiers: x, count, name, etc…

• Integers: 345, 20 -12, etc.


Attributes for Tokens
Tokens influence parsing decision;
The attributes influence the translation of tokens.

Example: E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Finite Automata
We shall now discover how Lex turns its input program into a lexical
analyzer. At the heart of the transition is the formalism known as finite
automata. These are essentially graphs, like transition diagrams, with
a few differences:

1. Finite automata are recognizers; they simply say "yes" or "no" about
each
possible input string.
Finite Automata
2. Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the
labels of their edges. A symbol can label several edges out of the
same state, and E, the empty string, is a possible label.

(b) Deterministic finite automata (DFA) have, for each state, and for
each symbol of its input alphabet exactly one edge with that symbol
leaving that state.

Both deterministic and nondeterministic finite automata are capable of recognizing


the same languages. In fact these languages are exactly the same
languages, called the regular languages, that regular expressions can de~cribe.~
Non Deterministic Finite Automata
A nondeterministic finite automaton (NFA) consists of:

1. A finite set of states S.


2. A set of input symbols C, the input alphabet. We assume that E, which
stands for the empty string, is never a member of C.
3. A transition function that gives, for each state, and for each symbol in
C U (E) a set of next states.
4. A state so from S that is distinguished as the start state (or initial state).
5. A set of states F, a subset of S, that is distinguished as the accepting
states (or final states).
Syntactic Analysis
- Natural language analogy: consider the sentence

He Wrote The Program


noun verb article noun
subject predicate object

Sentence
Syntactic Analysis
int* foo(int i, int j))
{ extra parenthesis
for(k=0; i j; )
Missing
fi( i > j )
expression
return j;
not a keyword
}
Syntactic Analysis
int* foo(int i, int j)
{ return type mismatch
for(k=0; i < j; j++ )
if( i < j-2 ) undeclared var
sum = sum+i
return sum; return type
} mismatch
undeclared var
Context-Free Grammars
The mathematical model of a grammar is:
G = (S,N,T,P)

• S is the start-symbol
• N is a set of non-terminal symbols
• T is a set of terminal symbols
• P is a set of productions — P: N  (N T)*

A grammar is called RECURSIVE if it permits derivations of the form


Aw1Aw2
More specifically, it is called LEFT RECURSIVE if
AAw2
And RIGHT RECURSIVE
Aw1A

18
Context-Free Grammars
Context-free syntax is specified with a grammar, usually in Backus-Naur form (BNF).
In BNF production have the form

Left side  definition

Where left-side contain one non-terminal

Left side ^ N = null

Let production for any grammar are:

SaSa
SbSb
S c
19
Context-Free Grammars
1. <goal>:= <expr>
2. <expr> := <expr> <op> <term>
3. | <term>
4. <term> := number
5. | id
6. <op> := +
7. | -

20
Deriving Valid Sentences
Production Result
Given a grammar, valid sentences
<goal>
can be derived by repeated
1 <expr>
substitution.
2 <expr> <op> <term>
5 <expr> <op> y To recognize a valid sentence in
7 <expr> - y some CFG, we reverse this process
2 <expr> <op> <term> - y and build up a parse.
4 <expr> <op> 2 - y
6 <expr> + 2 - y
3 <term> + 2 - y
5 x+2-y
21
Syntax Trees/Parse Trees/Generation Trees

A parse can be represented by a tree called a parse


or syntax tree.

Obviously, this contains a lot of unnecessary


information

22
Abstract Syntax Trees
So, compilers often use an abstract syntax tree (AST).

ASTs are often used as an IR.

23
The Frontend
Parser Checks the stream of words and their
parts of speech (produced by the scanner) for
grammatical correctness
Determines if the input is syntactically well
formed
Guides checking at deeper levels than syntax
Builds an IR representation of the code

Think of this as the mathematics of diagramming


sentences

24
Parsing

- Checks the stream of words and their parts of speech for


grammatical correctness
- Determines if the input is syntactically well formed
- Guides context-sensitive (“semantic”) analysis (type checking)
- Builds IR for source program

25
Parsing- The big picture

Goal is a flexible parser generator system


26
Parsing Techniques
Top-down parser:
- starts at the root of derivation tree and fills in
- picks a production and tries to match the input
- may require backtracking
- some grammars are backtrack-free (predictive)

Bottom-up parser:
- starts at the leaves and fills in
- starts in a state valid for legal first tokens
- as input is consumed, changes state to encode possibilities (recognize valid prefixes)
- uses a stack to store both state and sentential forms

27
Derivations
• At each step, we choose a non-terminal to replace
• Different choices can lead to different derivations

Two derivations are of interest


• Leftmost derivation — replace leftmost NT at each step
• Rightmost derivation — replace rightmost NT at each step

These are the two systematic derivations


(We don’t care about randomly-ordered derivations!)

The example on the preceding slide was a leftmost derivation


• Of course, there is also a rightmost derivation
• Interestingly, it turns out to be different
28
The Two Derivations for x – 2 * y

Leftmost derivation Rightmost derivation


In both cases, Expr ⇒* id – num * id
• The two derivations produce different parse trees
• Actually, each of two different derivations produces both parse trees as the grammar itself is ambiguous
• The parse trees imply different evaluation orders!

29
Derivations and Parse Trees
Leftmost derivation

This evaluates as x – ( 2 * y )

30
Derivations and Parse Trees
Rightmost derivation

This evaluates as ( x – 2 ) * y

31
Ambiguity
This sentential form has two derivations
if Expr1 then if Expr2 then Stmt1 else Stmt2

production 2, then production 1, then


production 1 production 2

32
Ambiguity
Removing the ambiguity
• Must rewrite the grammar to avoid generating the problem
• Match each else to innermost unmatched if (common sense rule)

Intuition: a NoElse always has no else on its last


cascaded else if statement
With this grammar, the example has only one derivation
33
Ambiguity
if Expr1 then if Expr2 then Stmt1 else Stmt2

This binds the else controlling S2 to the inner if

34
Role of a Parser
•- Not all sequences of tokens are program.
-Parser must distinguish between valid and invalid
sequences of tokens.
- An expressive way to describe the syntax.
• - An acceptor mechanism that determines if input
token stream satisfies the syntax.

35
Study of Parsing
-Parsing is the process of discovering a derivation for
some sentence
-Mathematical model of syntax – a grammar G.
-Algortihm for testing membership in L(G).
Two Approaches
• Top-down parsers LL(1), recursive descent
• Start at the root of the parse tree and grow toward leaves
• Pick a production & try to match the input
• Bad “pick”  may need to backtrack

• Bottom-up parsers LR(1), operator precedence


• Start at the leaves and grow toward root
• As input is consumed, encode possible parse trees in an internal state
• Bottom-up parsers handle a large class of grammars

37
Grammars and Parsers
• LL(1) parsers Grammars that this
• Left-to-right input
can handle are called
• Leftmost derivation
• 1 symbol of look-ahead
LL(1) grammars

• LR(1) parsers Grammars that this


• Left-to-right input
can handle are called
• Rightmost derivation
• 1 symbol of look-ahead
LR(1) grammars
• Also: LL(k), LR(k), SLR, LALR, …

38
Parsing Restriction
• In a compiler’s parser, we don’t have long-distance vision. We are usually
limited to just one-symbol of lookahead. The lookahead symbol is the next
symbol coming up in the input.
• Another restruction to parsing would be to implement backtracking.

39
Recursive Descent
• The first technique for implementing a predictive parser is called recursive-descent.
• A recursive-descent parser consists of several small functions, one for each nonterminal in the grammar.
• As we parse a sentence, we call the functions that correspond to the left side nonterminal of the
productions we are applying. If these productions are recursive, end up calling the functions recursively.

40
Example
• Expression grammar
# Production rule
1 expr → expr + term
2 | expr - term
3 | term
4 term → term * factor
5 | term / factor
6 | factor
7 factor → number
8 | identifier

• Input string x – 2 * y
41
Example Current position in the
input stream
Rule Sentential form Input string
- expr  x - 2 * y expr
2 expr + term  x - 2 * y
3 term + term  x – 2 * y
6 factor + term  x – 2 * y expr + term
8 <id> + term x  – 2 * y
- <id,x> + term x  – 2 * y
term
• Problem:
fact
• Can’t match next terminal
• We guessed wrong at step 2 x

42
Backtracking
Rule Sentential form Input string
- expr  x - 2 * y
2 expr + term  x - 2 * y
3 term + term  x – 2 * y Undo all these
6 factor + term  x – 2 * y productions
8 <id> + term x  – 2 * y
? <id,x> + term x  – 2 * y

• Rollback productions
• Choose a different production for expr
• Continue

43
Retrying
Rule Sentential form Input string expr
- expr  x - 2 * y
2 expr - term  x - 2 * y
3 term - term  x – 2 * y expr - term
6 factor - term  x – 2 * y
8 <id> - term x  – 2 * y
 term fact
- <id,x> - term x – 2 * y
3 <id,x> - factor x –  2 * y
7 <id,x> - <num> x – 2  * y fact 2
• Problem:
x
• More input to read
• Another cause of backtracking

44
Successful Parse
Rule Sentential form Input string expr
- expr  x - 2 * y
2 expr - term  x - 2 * y
3 term - term  x – 2 * y expr - term
6 factor - term  x – 2 * y
8 <id> - term x  – 2 * y *
 term term fact
- <id,x> - term x – 2 * y
4 <id,x> - term * fact x –  2 * y
6 <id,x> - fact * fact x –  2 * y fact fact y
7 <id,x> - <num> * fact x – 2  * y
- <id,x> - <num,2> * fact x – 2 *  y x 2
8 <id,x> - <num,2> * <id> x – 2 * y 
• All terminals match – we’re finished
45
Other Possible Parses
Rule Sentential form Input string
- expr  x - 2 * y
2 expr + term  x - 2 * y
2 expr + term + term  x – 2 * y
2 expr + term + term + term  x – 2 * y
2 expr + term + term + term + term  x – 2 * y

• Problem: termination
• Wrong choice leads to infinite expansion
(More importantly: without consuming any input!)
• May not be as obvious as this
• Our grammar is left recursive

46
Left Recursion
• Formally,
A grammar is left recursive if  a non-terminal A such that A →* A a (for some set of symbols
a)
What does →* mean?
A→Bx
B→Ay
• Bad news:
Top-down parsers cannot handle left recursion

• Good news:
We can systematically eliminate left recursion

47
Removing Left Recursion
• Two cases of left recursion:
# Production rule # Production rule
1 expr → expr + term 4 term → term * factor
2 | expr - term 5 | term / factor
3 | term 6 | factor

• Transform as follows:
# Production rule # Production rule
1 expr → term expr2 4 term → factor term2
2 expr2 → + term expr2 5 term2 → * factor term2
3 | - term expr2 6 | / factor term2
4 | e | e

48
Right-Recursive Grammar
# Production rule
1 expr → term expr2
Two productions with no
2 expr2 → + term expr2
choice at all
3 | - term expr2
4 | e All other productions are
5 term → factor term2 uniquely identified by a
6 term2 → * factor term2 terminal symbol at the start of
7 | / factor term2 RHS
8 | e
9 factor → number • We can choose the right production by looking at the
10 | identifier next input symbol
• This is called lookahead
• BUT, this can be tricky…

49
Examples NFA

NFA recognizing the language of regular expression (aJb)*ab


Examples NFA

NFA recognizing the language of regular expression aa* | bb*


Simulating a DFA

DFA accepting (a|b)*abb

You might also like