Lexical and syntax analysis
Lexical and syntax analysis
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken
Symbol
table
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
A token is a pair a token name and an
optional token value
A pattern is a description of the form that the
lexemes of a token may take
A lexeme is a sequence of characters in the
source program that matches the pattern for
a token
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]
Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
Recognition of tokens
Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Transition diagrams
Transition diagram for relop
Transition diagrams (cont.)
Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
Transition diagram for unsigned numbers
Transition diagrams (cont.)
Transition diagram for whitespace
Architecture of a transition-
diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …
…
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Lexical Analyzer Generator -
Lex
Lex Source Lexical lex.yy.c
program Compiler
lex.l
lex.yy.c
C a.out
compiler
Sequence
Input stream a.out
of tokens
Structure of Lex programs
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Example
%{
Int installID() {/* funtion to
/* definitions of manifest constants
install the lexeme, whose first
LT, LE, EQ, NE, GT, GE, character is pointed to by
IF, THEN, ELSE, ID, NUMBER, RELOP */ yytext, and whose length is
%} yyleng, into the symbol table
and return a pointer thereto
*/
/* regular definitions
}
delim [ \t\n]
ws {delim}+
Int installNum() { /* similar to
letter [A-Za-z]
installID, but puts numerical
digit [0-9] constants into a separate
id {letter}({letter}|{digit})* table */
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)? }
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum();
return(NUMBER);}
…
The role of parser
Lexical token Rest of
Source Parse tree Intermediate
program Analyze Parser Front representation
r getNext End
Token
Symbol
table
Uses of grammars
E -> E + T | T
T -> T * F | F
F -> (E) | id
E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
Context free grammars
Terminals
Nonterminal
expression -> expression + term
s expression -> expression – term
Start symbol expression -> term
productions term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id
Derivations
Productions are treated as rewriting rules to
generate a string
Rightmost and leftmost derivations
E -> E + E | E * E | -E | (E) | id
Derivations for –(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Parse trees
-(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Ambiguity
For some strings there exist more than one
parse tree
Or more than one leftmost derivation
Or more than one rightmost derivation
Example: id+id*id
Elimination of ambiguity
Elimination of ambiguity (cont.)
Idea:
A statement appearing between a then and an
else must be matched
Elimination of left recursion
A grammar is left recursive if it has a non-
terminal A such that there is a+derivation A=> Aα
Top down parsing methods cant handle left-
recursive grammars
A simple rule for direct left recursion elimination:
For a rule like:
A -> A α|β
We may replace it with
A -> β A’
A’ -> α A’ | ɛ
Left recursion elimination (cont.)
There are cases like following
S -> Aa | b
A -> Ac | Sd | ɛ
Left recursion elimination algorithm:
Arrange the nonterminals in some order A1,A2,
…,An.
For (each i from 1 to n) {
For (each j from 1 to i-1) {
Replace each production of the form Ai-> Aj γ by the
production Ai -> δ1 γ | δ2 γ | … |δk γ where Aj-> δ1
| δ2 | … |δk are all current Aj productions
}
Eliminate left recursion among the Ai-productions
}
Left factoring
Left factoring is a grammar transformation that
is useful for producing a grammar suitable for
predictive or top-down parsing.
Consider following grammar:
Stmt -> if expr then stmt else stmt
| if expr then stmt
On seeing input if it is not clear for the parser
which production to use
We can easily perform left factoring:
If we have A->αβ1 | αβ2 then we replace it with
A -> αA’
A’ -> β1 | β2
Left factoring (cont.)
Algorithm
For each non-terminal A, find the longest prefix
α common to two or more of its alternatives. If
α<> ɛ, then replace all of A-productions A-
>αβ1 |αβ2 | … | αβn | γ by
A -> αA’ | γ
A’ -> β1 |β2 | … | βn
Example:
S -> i E t S | i E t S e S | a
E -> b
Introduction
A Top-down parser tries to create a parse tree from
the root towards the leafs scanning input from left to
right
It can be also viewed as finding a leftmost derivation
for an input string
Example: id+id*id
E E E E
E -> TE’ lm lm
E
lm
E
lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’
F -> (E) | id id id Ɛ id Ɛ
First and Follow
First() is set of terminals that begins strings derived
from*
If α=>ɛ then is also in First(ɛ)
In predictive parsing when we have A-> α|β, if First(α)
and First(β) are disjoint sets then we can select
appropriate A-production by looking at the next input
Follow(A), for any nonterminal A, is set of terminals a
that can appear
*
immediately after A in some
sentential form
If we have S => αAaβ for some αand βthen a is in
Follow(A)
If A can be the rightmost symbol in some sentential
form, then $ is in Follow(A)
Computing First
To compute First(X) for all grammar symbols X,
* following rules until no more terminals or ɛ
apply
can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a
production for some k>=1, then place a in First(X)
if for some i a is in First(Yi) and ɛ is in all of
First(Y1),…,First(Yi-1)
* that is Y1…Yi-1 => ɛ. if ɛ
is in First(Yj) for j=1,…,k then add ɛ to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
Example!
Computing follow
To compute First(A) for all nonterminals A,
apply following rules until nothing can be
added to any follow set:
1. Place $ in Follow(S) where S is the start
symbol
2. If there is a production A-> αBβ then
everything in First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ,
then everything in Follow(A) is in Follow(B)
Example!
LL(1) Grammars
Predictive parsers are those recursive descent parsers
needing no backtracking
Grammars for which we can create predictive parsers are
called LL(1)
The first L means scanning input from left to right
The second L means leftmost derivation
And 1 stands for using one input symbol for lookahead
A grammar G is LL(1) if and only if whenever A-> α|βare
two distinct productions of G, the following conditions hold:
For no terminal a do αandβ both derive strings beginning
with a
At most one of α or βcan derive empty string
*
If α=> ɛ then βdoes not derive any string beginning with a
terminal in Follow(A).
Construction of predictive
parsing table
For each production A->α in grammar do the
following:
For each terminal a in First(α) add A-> in
M[A,a]
If ɛ is in First(α), then for each terminal b in
Follow(A) add A-> ɛ to M[A,b]. If ɛ is in
First(α) and $ is in Follow(A), add A-> ɛ to
M[A,$] as well
If after performing the above, there is no
production in M[A,a] then set M[A,a] to error
Example F
First
{(,id}
Follow
{+, *, ), $}
E -> TE’ {(,id} {+, ), $}
E’ -> +TE’ | Ɛ T
E {(,id} {), $}
T -> FT’ {+,ɛ}
T’ -> *FT’ | Ɛ E’ {), $}
T’ {*,ɛ} {+, ), $}
F -> (E) | id
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’
Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’
S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
Error recovery in predictive parsing
Panic mode
Place all symbols in Follow(A) into synchronization set
for nonterminal A: skip tokens until an element of
Follow(A) is seen and pop A from stack.
Add to the synchronization set of lower level construct
the symbols that begin higher level constructs
Add symbols in First(A) to the synchronization set of
nonterminal A
If a nonterminal can generate the empty string then the
production deriving can be used as a default
If a terminal on top of the stack cannot be matched, pop
the terminal, issue a message saying that the terminal
was insterted
Non - Input Symbol
terminal id + * ( ) $
Example
EE -> TE’ E -> TE’ synch synch
id id F id T*F
id F id
id
LR Parsing
The most prevalent type of bottom-up parsers
LR(k), mostly interested on parsers with k<=1
Why LR parsers?
Table driven
Can be constructed to recognize all programming
language constructs
Most general non-backtracking shift-reduce parsing
method
Can detect a syntactic error as soon as it is possible to do
so
Class of grammars for which we can construct LR parsers
are superset of those which we can construct LL parsers
States of an LR parser
States represent set of items
An LR(0) item of G is a production of G with
the dot at some position of the body:
For A->XYZ we have following items
A->.XYZ
A->X.YZ
A->XY.Z
A->XYZ.
Example acc
$
T -> T * F | F
F -> (E) | id
I6 I9
E->E+.T
I1 T->.T*F T
E’->E. + T->.F
E->E+T.
T->T.*F
E E->E.+T
F->.(E)
F->.id
I0=closure({[E’-
I2
>.E]} T *
I7
F I10
E’->.E E->T. T->T*.F
E->.E+T F->.(E) T->T*F.
T->T.*F id F->.id
E->.T
T->.T*F
id
T->.F I5
F->.(E)
( F->id. +
F->.id
I4
F->(.E)
I8 I11
E->.E+T
E->.T
E E->E.+T )
T->.T*F F->(E.) F->(E).
T->.F
F->.(E)
F->.id
I3
T>F.
Use of LR(0) automaton
Example: id*id
Lin Stack Symbols Input Action
e
(1) 0 $ id*id$ Shift to 5
(2) 05 $id *id$ Reduce by F->id
(3) 03 $F *id$ Reduce by T->F
(4) 02 $T *id$ Shift to 7
(5) 027 $T* id$ Shift to 5
(6) 0275 $T*id $ Reduce by F->id
(7) 02710 $T*F $ Reduce by T->T*F
(8) 02 $T $ Reduce by E->T
(9) 01 $E $ accept
LR-Parsing model
INPUT a1 … ai … an $
LR Parsing Output
Sm
Program
Sm-1
…
$
ACTIO GOTO
N
LR parsing algorithm
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine;
}
Example (0) E’->E
(1) E -> E + T
(2) E-> T
STAT ACTON GOTO
(3) T -> T * F
E (4) T-> F
id + * ( ) $ E T F (5) F -> (E) id*id+id?
0 S5 S4 1 2 3
(6) F->id
1 S Ac Lin Stac Symbo Input Action
6 c e k ls
2 R S R2 R2 (1) 0 id*id+id Shift to 5
2 7 $
3 R R R4 R4 (2) 05 id *id+id$ Reduce by F-
4 4 >id
4 S5 S4 8 2 3 (3) 03 F *id+id$ Reduce by T-
>F
5 R R R6 R6
6 6 (4) 02 T *id+id$ Shift to 7
6 S5 S4 9 3 (5) 027 T* id+id$ Shift to 5
7 S5 S4 10 (6) 027 T*id +id$ Reduce by F-
5 >id
8 S S1
6 1 (7) 027 T*F +id$ Reduce by T-
10 >T*F
9 R S R1 R1
1 7 (8) 02 T +id$ Reduce by E-
>T
10 R R R3 R3
3 3 (9) 01 E +id$ Shift
11 R R R5 R5 (10) 016 E+ id$ Shift
5 5
(11) 016 E+id $ Reduce by F-
5 >id
Constructing SLR parsing table
Method
Construct C={I0,I1, … , In}, the collection of LR(0)
items for G’
State i is constructed from state Ii:
If [A->α.aβ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to
“shift j”
If [A->α.] is in Ii, then set ACTION[i,a] to “reduce A->α” for
all a in follow(A)
If {S’->.S] is in Ii, then set ACTION[I,$] to “Accept”
If any conflicts appears then we say that the grammar
is not SLR(1).
If GOTO(Ii,A) = Ij then GOTO[i,A]=j
All entries not defined by above rules are made “error”
The initial state of the parser is the one constructed
from the set of items containing [S’->.S]
Example grammar which is not
SLR(1) S -> L=R | R
L -> *R | id
R -> L
I0 I1 I3 I5 I7
S’->.S S’->S. S ->R. L -> id. L -> *R.
S -> .L=R
S->.R I2 I4 I6
I8
L -> .*R | S ->L.=R L->*.R S->L=.R
R -> L.
L->.id R ->L. R->.L R->.L
R ->. L L->.*R L->.*R I9
L->.id L->.id S -> L=R.
Action
=
Shift 6
2 Reduce R->L
More powerful LR parsers
Canonical-LR or just LR method
Use lookahead symbols for items: LR(1) items
Results in a large collection of items
LALR: lookaheads are introduced in LR(0)
items