0% found this document useful (0 votes)
14 views

Lexical and syntax analysis

Uploaded by

21131a4211
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lexical and syntax analysis

Uploaded by

21131a4211
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 63

The role of lexical analyzer

token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
A token is a pair a token name and an
optional token value
A pattern is a description of the form that the
lexemes of a token may take
A lexeme is a sequence of characters in the
source program that matches the pattern for
a token
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);


Attributes for tokens
E = M * C ** 2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>
Lexical errors
Some errors are out of power of lexical
analyzer to recognize:
fi (a == f(x)) …
However it may be able to recognize errors
like:
d = 2r
Such errors are recognized when no pattern
for tokens matches a character sequence
Error recovery
Panic mode: successive characters are
ignored until we reach to a well formed token
Delete one character from the remaining
input
Insert a missing character into the remaining
input
Replace a character by another character
Transpose two adjacent characters
Input buffering
Sometimes lexical analyzer needs to look
ahead some symbols to decide about the
token to return
In C language: we need to look after -, = or <
to decide what token to return
In Fortran: DO 5 I = 1.25
We need to introduce a two buffer scheme to
handle large look-aheads safely
E = M * C * * 2 eof
Specification of tokens
In theory of compilation regular expressions
are used to formalize the specification of
tokens
Regular expressions are means for specifying
regular languages
Example:
 Letter_(letter_ | digit)*
Each regular expression is a pattern
specifying the form of strings
Regular expressions
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
If a is a symbol in ∑then a is a regular
expression, L(a) = {a}
(r) | (s) is a regular expression denoting the
language L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the
language L(r)L(s)
(r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Regular definitions
d1 -> r1
d2 -> r2

dn -> rn

 Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]

Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
Recognition of tokens
Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Transition diagrams
Transition diagram for relop
Transition diagrams (cont.)
Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
Transition diagram for unsigned numbers
Transition diagrams (cont.)
Transition diagram for whitespace
Architecture of a transition-
diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Lexical Analyzer Generator -
Lex
Lex Source Lexical lex.yy.c
program Compiler
lex.l

lex.yy.c
C a.out
compiler

Sequence
Input stream a.out
of tokens
Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Example
%{
Int installID() {/* funtion to
/* definitions of manifest constants
install the lexeme, whose first
LT, LE, EQ, NE, GT, GE, character is pointed to by
IF, THEN, ELSE, ID, NUMBER, RELOP */ yytext, and whose length is
%} yyleng, into the symbol table
and return a pointer thereto
*/
/* regular definitions
}
delim [ \t\n]
ws {delim}+
Int installNum() { /* similar to
letter [A-Za-z]
installID, but puts numerical
digit [0-9] constants into a separate
id {letter}({letter}|{digit})* table */
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)? }

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum();
return(NUMBER);}

The role of parser
Lexical token Rest of
Source Parse tree Intermediate
program Analyze Parser Front representation
r getNext End
Token

Symbol
table
Uses of grammars
E -> E + T | T
T -> T * F | F
F -> (E) | id

E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
Context free grammars
Terminals
Nonterminal
expression -> expression + term
s expression -> expression – term
Start symbol expression -> term
productions term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id
Derivations
Productions are treated as rewriting rules to
generate a string
Rightmost and leftmost derivations
E -> E + E | E * E | -E | (E) | id
Derivations for –(id+id)
 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Parse trees
-(id+id)
 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Ambiguity
For some strings there exist more than one
parse tree
Or more than one leftmost derivation
Or more than one rightmost derivation
Example: id+id*id
Elimination of ambiguity
Elimination of ambiguity (cont.)
Idea:
A statement appearing between a then and an
else must be matched
Elimination of left recursion
 A grammar is left recursive if it has a non-
terminal A such that there is a+derivation A=> Aα
 Top down parsing methods cant handle left-
recursive grammars
 A simple rule for direct left recursion elimination:
 For a rule like:
 A -> A α|β
 We may replace it with
 A -> β A’
 A’ -> α A’ | ɛ
Left recursion elimination (cont.)
There are cases like following
 S -> Aa | b
 A -> Ac | Sd | ɛ
Left recursion elimination algorithm:
 Arrange the nonterminals in some order A1,A2,
…,An.
 For (each i from 1 to n) {
 For (each j from 1 to i-1) {
 Replace each production of the form Ai-> Aj γ by the
production Ai -> δ1 γ | δ2 γ | … |δk γ where Aj-> δ1
| δ2 | … |δk are all current Aj productions
}
 Eliminate left recursion among the Ai-productions
 }
Left factoring
Left factoring is a grammar transformation that
is useful for producing a grammar suitable for
predictive or top-down parsing.
Consider following grammar:
 Stmt -> if expr then stmt else stmt
 | if expr then stmt
On seeing input if it is not clear for the parser
which production to use
We can easily perform left factoring:
 If we have A->αβ1 | αβ2 then we replace it with
 A -> αA’
 A’ -> β1 | β2
Left factoring (cont.)
Algorithm
For each non-terminal A, find the longest prefix
α common to two or more of its alternatives. If
α<> ɛ, then replace all of A-productions A-
>αβ1 |αβ2 | … | αβn | γ by
 A -> αA’ | γ
 A’ -> β1 |β2 | … | βn

Example:
S -> i E t S | i E t S e S | a
E -> b
Introduction
A Top-down parser tries to create a parse tree from
the root towards the leafs scanning input from left to
right
It can be also viewed as finding a leftmost derivation
for an input string
Example: id+id*id

E E E E
E -> TE’ lm lm
E
lm
E
lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’

F -> (E) | id id id Ɛ id Ɛ
First and Follow
First() is set of terminals that begins strings derived
from*
If α=>ɛ then is also in First(ɛ)
In predictive parsing when we have A-> α|β, if First(α)
and First(β) are disjoint sets then we can select
appropriate A-production by looking at the next input
Follow(A), for any nonterminal A, is set of terminals a
that can appear
*
immediately after A in some
sentential form
 If we have S => αAaβ for some αand βthen a is in
Follow(A)
If A can be the rightmost symbol in some sentential
form, then $ is in Follow(A)
Computing First
To compute First(X) for all grammar symbols X,
* following rules until no more terminals or ɛ
apply
can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a
production for some k>=1, then place a in First(X)
if for some i a is in First(Yi) and ɛ is in all of
First(Y1),…,First(Yi-1)
* that is Y1…Yi-1 => ɛ. if ɛ
is in First(Yj) for j=1,…,k then add ɛ to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
Example!
Computing follow
To compute First(A) for all nonterminals A,
apply following rules until nothing can be
added to any follow set:
1. Place $ in Follow(S) where S is the start
symbol
2. If there is a production A-> αBβ then
everything in First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ,
then everything in Follow(A) is in Follow(B)
Example!
LL(1) Grammars
 Predictive parsers are those recursive descent parsers
needing no backtracking
 Grammars for which we can create predictive parsers are
called LL(1)
 The first L means scanning input from left to right
 The second L means leftmost derivation
 And 1 stands for using one input symbol for lookahead
 A grammar G is LL(1) if and only if whenever A-> α|βare
two distinct productions of G, the following conditions hold:
 For no terminal a do αandβ both derive strings beginning
with a
 At most one of α or βcan derive empty string
*
 If α=> ɛ then βdoes not derive any string beginning with a
terminal in Follow(A).
Construction of predictive
parsing table
For each production A->α in grammar do the
following:
 For each terminal a in First(α) add A-> in
M[A,a]
 If ɛ is in First(α), then for each terminal b in
Follow(A) add A-> ɛ to M[A,b]. If ɛ is in
First(α) and $ is in Follow(A), add A-> ɛ to
M[A,$] as well
If after performing the above, there is no
production in M[A,a] then set M[A,a] to error
Example F
First
{(,id}
Follow
{+, *, ), $}
E -> TE’ {(,id} {+, ), $}
E’ -> +TE’ | Ɛ T
E {(,id} {), $}
T -> FT’ {+,ɛ}
T’ -> *FT’ | Ɛ E’ {), $}
T’ {*,ɛ} {+, ), $}
F -> (E) | id
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T T -> FT’ T -> FT’

T’ T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ

F F -> id F -> (E)


Another example
S -> iEtSS’ | a
S’ -> eS | Ɛ
E -> b

Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’

S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
Error recovery in predictive parsing
Panic mode
 Place all symbols in Follow(A) into synchronization set
for nonterminal A: skip tokens until an element of
Follow(A) is seen and pop A from stack.
 Add to the synchronization set of lower level construct
the symbols that begin higher level constructs
 Add symbols in First(A) to the synchronization set of
nonterminal A
 If a nonterminal can generate the empty string then the
production deriving can be used as a default
 If a terminal on top of the stack cannot be matched, pop
the terminal, issue a message saying that the terminal
was insterted
Non - Input Symbol
terminal id + * ( ) $

Example
EE -> TE’ E -> TE’ synch synch

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ


synch T -> FT’ synch synch
TT -> FT’
T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ
T’

FF -> id synch synch F ->synch


(E) synch

Stack Input Action


E$ )id*+id$ Error, Skip )
E$ id*+id$ id is in First(E)
TE’$ id*+id$
FT’E’$ id*+id$
idT’E’$ id*+id$
T’E’$ *+id$
*FT’E’$ *+id$
FT’E’$ +id$ Error, M[F,+]=synch
T’E’$ +id$ F has been poped
Introduction
Constructs parse tree for an input string
beginning at the leaves (the bottom) and
working towards the root (the top)
Example: id*id

E -> E + T | T id*id F * id T * id T*F F E


T -> T * F | F
T*F F
F -> (E) | id id F F id

id id F id T*F

id F id

id
LR Parsing
The most prevalent type of bottom-up parsers
LR(k), mostly interested on parsers with k<=1
Why LR parsers?
 Table driven
 Can be constructed to recognize all programming
language constructs
 Most general non-backtracking shift-reduce parsing
method
 Can detect a syntactic error as soon as it is possible to do
so
 Class of grammars for which we can construct LR parsers
are superset of those which we can construct LL parsers
States of an LR parser
States represent set of items
An LR(0) item of G is a production of G with
the dot at some position of the body:
For A->XYZ we have following items
 A->.XYZ
 A->X.YZ

 A->XY.Z

 A->XYZ.

In a state having A->.XYZ we hope to see a


string derivable from XYZ next on the input.
What about A->X.YZ?
Constructing canonical LR(0)
item sets
 Augmented grammar:
 G with addition of a production: S’->S
 Closure of item sets:
 If I is a set of items, closure(I) is a set of items constructed from I
by the following rules:
 Add every item in I to closure(I)

 If A->α.Bβ is in closure(I) and B->γ is a production then add the

item B->.γ to clsoure(I).


 Example:
I0=closure({[E’->.E]}
E’->E E’->.E
E -> E + T | T E->.E+T
T -> T * F | F E->.T
T->.T*F
F -> (E) | id T->.F
F->.(E)
F->.id
Constructing canonical LR(0)
item sets (cont.)
Goto (I,X) where I is an item set and X is a
grammar symbol is closure of set of all items
[A-> αX. β] where [A-> α.X β] is in I
I1
Example E’->E.
I0=closure({[E’- E E->E.+T
>.E]}
E’->.E T I2
E->.E+T E’->T.
E->.T T->T.*F
T->.T*F I4
T->.F ( F->(.E)
E->.E+T
F->.(E) E->.T
T->.T*F
F->.id T->.F
F->.(E)
F->.id
Closure algorithm
SetOfItems CLOSURE(I) {
J=I;
repeat
for (each item A-> α.Bβ in J)
for (each prodcution B->γ of G)
if (B->.γ is not in J)
add B->.γ to J;
until no more items are added to J on one round;
return J;
GOTO algorithm
SetOfItems GOTO(I,X) {
J=empty;
if (A-> α.X β is in I)
add CLOSURE(A-> αX. β ) to J;
return J;
}
Canonical LR(0) items
Void items(G’) {
C= CLOSURE({[S’->.S]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (GOTO(I,X) is not empty and not in C)
add GOTO(I,X) to C;
until no new set of items are added to C on a
round;
}
E’->E
E -> E + T | T

Example acc
$
T -> T * F | F
F -> (E) | id
I6 I9
E->E+.T
I1 T->.T*F T
E’->E. + T->.F
E->E+T.
T->T.*F
E E->E.+T
F->.(E)
F->.id
I0=closure({[E’-
I2
>.E]} T *
I7
F I10
E’->.E E->T. T->T*.F
E->.E+T F->.(E) T->T*F.
T->T.*F id F->.id
E->.T
T->.T*F
id
T->.F I5
F->.(E)
( F->id. +
F->.id
I4
F->(.E)
I8 I11
E->.E+T
E->.T
E E->E.+T )
T->.T*F F->(E.) F->(E).
T->.F
F->.(E)
F->.id

I3
T>F.
Use of LR(0) automaton
Example: id*id
Lin Stack Symbols Input Action
e
(1) 0 $ id*id$ Shift to 5
(2) 05 $id *id$ Reduce by F->id
(3) 03 $F *id$ Reduce by T->F
(4) 02 $T *id$ Shift to 7
(5) 027 $T* id$ Shift to 5
(6) 0275 $T*id $ Reduce by F->id
(7) 02710 $T*F $ Reduce by T->T*F
(8) 02 $T $ Reduce by E->T
(9) 01 $E $ accept
LR-Parsing model
INPUT a1 … ai … an $

LR Parsing Output
Sm
Program
Sm-1

$
ACTIO GOTO
N
LR parsing algorithm
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine;
}
Example (0) E’->E
(1) E -> E + T
(2) E-> T
STAT ACTON GOTO
(3) T -> T * F
E (4) T-> F
id + * ( ) $ E T F (5) F -> (E) id*id+id?
0 S5 S4 1 2 3
(6) F->id
1 S Ac Lin Stac Symbo Input Action
6 c e k ls
2 R S R2 R2 (1) 0 id*id+id Shift to 5
2 7 $
3 R R R4 R4 (2) 05 id *id+id$ Reduce by F-
4 4 >id
4 S5 S4 8 2 3 (3) 03 F *id+id$ Reduce by T-
>F
5 R R R6 R6
6 6 (4) 02 T *id+id$ Shift to 7
6 S5 S4 9 3 (5) 027 T* id+id$ Shift to 5
7 S5 S4 10 (6) 027 T*id +id$ Reduce by F-
5 >id
8 S S1
6 1 (7) 027 T*F +id$ Reduce by T-
10 >T*F
9 R S R1 R1
1 7 (8) 02 T +id$ Reduce by E-
>T
10 R R R3 R3
3 3 (9) 01 E +id$ Shift
11 R R R5 R5 (10) 016 E+ id$ Shift
5 5
(11) 016 E+id $ Reduce by F-
5 >id
Constructing SLR parsing table
Method
 Construct C={I0,I1, … , In}, the collection of LR(0)
items for G’
 State i is constructed from state Ii:
 If [A->α.aβ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to
“shift j”
 If [A->α.] is in Ii, then set ACTION[i,a] to “reduce A->α” for
all a in follow(A)
 If {S’->.S] is in Ii, then set ACTION[I,$] to “Accept”
 If any conflicts appears then we say that the grammar
is not SLR(1).
 If GOTO(Ii,A) = Ij then GOTO[i,A]=j
 All entries not defined by above rules are made “error”
 The initial state of the parser is the one constructed
from the set of items containing [S’->.S]
Example grammar which is not
SLR(1) S -> L=R | R
L -> *R | id
R -> L
I0 I1 I3 I5 I7
S’->.S S’->S. S ->R. L -> id. L -> *R.
S -> .L=R
S->.R I2 I4 I6
I8
L -> .*R | S ->L.=R L->*.R S->L=.R
R -> L.
L->.id R ->L. R->.L R->.L
R ->. L L->.*R L->.*R I9
L->.id L->.id S -> L=R.

Action
=
Shift 6
2 Reduce R->L
More powerful LR parsers
Canonical-LR or just LR method
Use lookahead symbols for items: LR(1) items
Results in a large collection of items
LALR: lookaheads are introduced in LR(0)
items

You might also like