0% found this document useful (0 votes)
8 views

Ch3 Modified

Uploaded by

Maheen Munir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Ch3 Modified

Uploaded by

Maheen Munir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

1

Lexical Analysis and


Lexical Analyzer
Generators

CHAPTER 3
The Reason Why Lexical
2

Analysis is a Separate Phase

 Convert a stream of characters into a stream


of tokens.
 Simplicity: Conventions about “words" are
often different from conventions about
“sentences".
 Efficiency: Word identification problem has a
much more efficient solution than sentence
identification problem.
 Portability: Character set, special characters,
device features.
Interaction of the Lexical
3

Analyzer with the Parser


Token,
Source Lexical tokenval
Program Parser
Analyzer
GetNextToken

error error

Symbol Table
The Role of the Lexical 4

Analyzer

• It reads the input characters of the source program, group


them into lexemes, and produce as output a sequence of
tokens for each lexeme in the source program.
• It strips out comments and whitespace (blank, newline,
tab, and perhaps other characters that are used to
separate tokens in the input).
• Another task is correlating error messages generated by
the compiler, it associates a line number with each error
message.
5

Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>

token Yes, in the context of parsing, "x" can


also be considered a token
(lookahead) representing a variable. Tokens are the
smallest units of a language, and in
tokenval Parser
this case, "x" represents a variable
name, which is a type of token.
(token attribute) In this expression, "x" is like a variable in
a mathematical equation. It's a
placeholder that can hold different values.
6

Tokens, Patterns, and Lexemes

 A token is a classification of lexical units


 For example: id and num
 Lexemes are the specific character
strings that make up a token
 For example: abc and 123
 Patterns are rules describing the set of
lexemes belonging to a token
 For example: “letter followed by letters and
digits” and “non-empty sequence of digits”
7

Examples of tokens

Token Informal description Sample lexemes


if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);


Lexical Errors 8

 It is hard for a lexical analyzer to tell, without the


aid of other components, that there is a source-
code error. e.g.
fi (x == f(x)) …..
 A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
 Probably the parser in this case - handle an error
due to transposition of the letters.
 The lexical analyzer can detect characters that
are not in the alphabet or strings that have no
pattern.
9

General idea of input buffering

 The string of characters between the two pointers is the current


lexeme.
 Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
n simple terms, imagine you're reading a book and you use your fingers to track where you are on the page.

The "lexemeBegin" pointer is like putting your finger at the start of a word you're about to read.
The "forward" pointer is like your eyes scanning ahead to find the end of that word.
Terms for Parts of Strings 10

1. A prefix of string s is any string obtained by


removing zero or more symbols from the end of
s. For example, ban, banana, and ϵ are prefixes
of banana.
2. A sufix of string s is any string obtained by
removing zero or more symbols from the
beginning of s. For example, nana, banana, and
ϵ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix
and any suffix from s. For instance, banana, nan,
and ϵ are substrings of banana.
11

Terms for Parts of Strings

4. The proper prefixes, suffixes, and


substrings of a string s are those,
prefixes, suffixes, and substrings,
respectively, of s that are not ϵ or
not equal to s itself.
5. A subsequence of s is any string
formed by deleting zero or more
not necessarily consecutive
positions of s. For example, baan is
a subsequence of banana.
Specification of Patterns for
12

Tokens: Definitions

 An alphabet  is a finite set of symbols (characters)


 A string s is a finite sequence of symbols from 
 s denotes the length of string s
  denotes the empty string, thus  = 0
 A language is a specific set of strings over some
fixed alphabet 
Specification of Patterns for
13

Tokens: String Operations

 The concatenation of two strings x and y is denoted


by xy
 The exponentiation of a string s is defined by

s0 = 
si = si-1s for i > 0

note that s = s = s
Specification of Patterns for
14

Tokens: Language Operations

 Union
L  M = {s  s  L or s  M}
 Concatenation
LM = {xy  x  L and y  M}
 Exponentiation
L0 = {}; Li = Li-1L
 Kleene closure
L* = i=0,…, Li
 Positive closure
L+ = i=1,…, Li
Specification of Patterns for
15

Tokens: Regular Expressions

 Basis symbols:
  is a regular expression denoting language {}
 a   is a regular expression denoting {a}
 If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
 rs is a regular expression denoting L(r)  M(s)
 rs is a regular expression denoting L(r)M(s)
 r* is a regular expression denoting L(r)*
 (r) is a regular expression denoting L(r)
 A language defined by a regular
expression is called a regular set
Algebraic Laws for Regular
16

Expressions
Specification of Patterns for Tokens: 17

Regular Definitions

 If Σ is an alphabet of basic symbols, then a regular


definition is a sequence of definitions of the form:

where:
 Each di is a new symbol, not in Σ and not the same
as any other of the d's, and
 Each ri is a regular expression over the alphabet
Σ U {dl, d2,. . . , di-l).
Specification of Patterns for
18

Tokens: Regular Definitions

 Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

 Regular definitions cannot be recursive:

digits  digit digitsdigit wrong!


Specification of Patterns for
19

Tokens: Notational Shorthand

 The following shorthands are often used:


r+ = rr*
r? = r
[a-z] = abc…z
[abc] = abc

 Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and
20

Grammars
Grammar
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term relational operators
 term Regular definitions
term  id if  if
 num then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and Grammars 21

Token ws is different from the other tokens in that,


when we recognize it, we do not return it to the
parser, but rather restart the lexical analysis from the
character that follows the whitespace. It is the
following token that gets returned to the parser.
hen the lexical analyzer
encounters whitespace, blank  b
instead of returning it as
a token to the parser, it
skips it and moves to tab  ^T ^ is used for
the next character. The beginning of line
from the character thatnewline  ^M
lexical analysis restarts

follows the whitespace.


Then, the next non- delim
whitespace token
 blank | tab | newline
encountered after the
whitespace is returnedws  delim +
to the parser
Tokens, Lexeme & Attribute
22

Value
Coding Regular Definitions in
23

Transition Diagrams
relop  <<=<>>>==
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id  letter ( letterdigit )* letter or digit

start letter other


9 10 11 * return(gettoken(),
install_id())
Coding Regular Definitions in
24

Transition Diagrams
What Else Does Lexical Analyzer Do? 25

All Keywords / Reserved words are matched as ids


• After the match, the symbol table or a special keyword
table is consulted
• Keyword table contains string versions of all keywords
and associated token values if 15
then 16
begin 17
... ...
• When a match is found, the token is returned, along with its
symbolic value, i.e., “then”, 16
• If a match is not found, then it is assumed that an id has
been discovered
26

 r = ᶲ , L(r) = {}  r = (a+€)(b+ᶲ) , L(r) = {ab,b}


 r = €, L(r) = {€}  r = (a+b)* , L(r) =
 r = a , L(r) = {a} {a,aa,aaa,…,b,bb,bbb,…}
 r = a+b , L(r) = {a,b}  r = (a+b)*(a+b) , L(r) = (a+b)+
 r = a.b , L(r) = {ab}  r = a*.a* , L(r) = a*
 r = a+b+c , L(r) = {a,b,c}  r = (ab)* , L(r) =
 r = (ab+c).b , L(r) = {abb,ab} {ab,abab,ababab,…}
 r = a+ , L(r) = {a,aa,aaa,aaa,…}
 r = a* , L(r) = {,a,aa,aaa,aaa,…}
 r = (a+ba)(b+a) , L(r) =
{ab,aa,bab,baa}
 r = a+ , L(r) = {a,aa,aaa,aaa,…}
27
Language examples

 Having exactly 2 alphabets


 L={aa,ab,ba,bb}
 Having at least 2 alphabets
 (a+b)(a+b)(a+b)*
 Having at most 2 alphabets
 (€+A+B) (€+A+B)
 Even length string
 ( (a+b)(a+b) )*
 Odd length string
 ( (a+b)(a+b) )*(a+b)
28
29
30
31
32
33
34
35
36
37
38
39
Coding Regular Definitions in Transition
Diagrams: Code 40

token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides the
state = 0;
lexeme_beginning++; next start state
}
else if (c==‘<’) state = 1; to check
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }

The Lex and Flex Scanner
41

Generators

 Lex and its newer cousin flex are scanner generators


 Scanner generators systematically translate regular
definitions into C source code for efficient scanning
 Generated code is easy to integrate in C
applications
Creating a Lexical Analyzer
42

with Lex and Flex

lex
source lex (or flex) lex.yy.c
program
lex.l

lex.yy.c C a.out
compiler

input sequence
stream a.out of tokens
43

LEX Specification

 A lex specification consists of three parts:


regular definitions, C declarations in %{
%}
%%
translation rules
%%
user-defined auxiliary procedures
 The translation rules are of the form:
p1 { action1 }
p2 { action2 }

pn { actionn }
LEX Specification Detailed
C declarations %{
/* definitions of all constants
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ... */
%}
......
letter [A-Za-z]
declarations

digit [0-9]
id {letter}({letter}|{digit})*
......
%%
if { return(IF);}
then { return(THEN);}
Rules

{id} { yylval = install_id(); return(ID); }


......
%%
install_id()
Auxiliary

{ /* procedure to install the lexeme to the ST */


Regular Expressions in Lex 45
Predefinded Functions of Lex 46

yyin :- the input stream pointer (i.e it points to an input file


which is to be scanned or tokenised), however the default input
of default main() is stdin .
yylex() :- implies the main entry point for lex, reads the input
stream generates tokens, returns zero at the end of input stream
. It is called to invoke the lexer (or scanner) and each time
yylex() is called, the scanner continues processing the input
from where it last left off.
yytext :- a buffer that holds the input characters that actually
match the pattern (i.e lexeme) or say a pointer to the matched
string .
47
yyleng :- the length of the lexeme .
yylval :- contains the token value .
yyval :- a local variable .
yyout :- the output stream pointer (i.e it points to a file where it
has to keep the output), however the default output of default
main() is stdout .
yywrap() :- it is called by lex when input is exhausted (or at
EOF). default yywrap always return 1.
yymore() :- returns the next token .
yyless(k) :- returns the first k characters in yytext .
yyparse() :- it parses (i.e builds the parse tree) of lexeme .
48

Example Lex Specification 1

Contains
%{ the matching
Translation #include <stdio.h> lexeme
%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { }
%% Invokes
main() the lexical
{ yylex(); analyzer
}

lex spec.l
gcc lex.yy.c -ll
./a.out < spec.l
49

Example Lex Specification 2


%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
delim [ \t]+
rules %%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
50

Example Lex Specification 3


%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
letter [A-Za-z]
rules id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex Specification 4 51

%{ /* definitions of manifest constants */


#define LT (256)

%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number
%%
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
parser
{ws} { }
if {return IF;} Token
then
else
{return THEN;}
{return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
{yylval = GE; return RELOP;} Install yytext as
int install_id() identifier in symbol table

Design of a Lexical Analyzer
52

Generator

 Translate regular expressions to NFA


 Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
Nondeterministic Finite
53

Automata

 An NFA is a 5-tuple (S, , , s0, F) where

S is a finite set of states


 is a finite set of symbols, the alphabet
 is a mapping from S   to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
54

Transition Graph

 An NFA can be diagrammatically represented by a


labeled directed graph called a transition graph

a
S = {0,1,2,3}
start a b b  = {a,b}
0 1 2 3
s0 = 0
b F = {3}
55

Transition Table

 The mapping  of an NFA can be represented in a


transition table

Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
The Language Defined by an
56

NFA

 An NFA accepts an input string x if and


only if there is some path with edges
labeled with symbols from x in
sequence from the start state to some
accepting state in the transition graph
 A state transition from one state to
another on the path is called a move
 The language defined by an NFA is the
set of input strings it accepts, such as
(ab)*abb for the example NFA
Design of a Lexical Analyzer
57

Generator: RE to NFA to DFA

Lex specification with NFA


regular expressions
p1 { action1 } N(p1) action1

p2 { action2 } start
s0
 N(p2) action2


pn { actionn } 
N(pn) actionn

Subset construction

DFA
From Regular Expression to NFA
58

(Thompson’s Construction)

start
i  f

a start a
i f

start  N(r1) 
r1  r2 i f
 N(r2) 
start
r1 r2 i N(r1) N(r2) f


r* start
i  N(r)  f


Combining the NFAs of a Set of
59

Regular Expressions
start a
1 2

a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
Simulating the Combined NFA
60

Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
none
0 2 7 8 action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
Simulating the Combined NFA
61

Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
none
0 2 5 6 action2
1 4 8 8 action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
62

Deterministic Finite Automata

 A deterministic finite automaton is a


special case of an NFA
 No state has an -transition
 For each state s and input symbol a there is
at most one edge labeled a leaving s
 Each entry in the transition table is a
single state
 At most one path exists to accept a string
 Simulation algorithm is simple
63

Example DFA

A DFA that accepts (ab)*abb

b
b
a
start a b b
0 1 2 3

a a
Conversion of an NFA into a
64

DFA

 The subset construction algorithm


converts an NFA into a DFA using:
-closure(s) = {s}  {t  s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t  s a t and s  T}
 The algorithm produces:
Dstates is the set of states of the new
DFA consisting of sets of states of the
NFA
Dtran is the transition table of the new
DFA
65

-closure and move Examples


-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
 -closure({2,4,7}) = {2,4,7}
start
 a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs (!)
Simulating an NFA using
66

-closure and move


S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “yes”
else return “no”
The Subset Construction
67

Algorithm

Initially, -closure(s0) is the only state in Dstates and it is unmarked


while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
Subset Construction Example 1
68

a
2 3

start    a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
b
b a B = {1,2,3,4,6,7,8}
start a b b C = {1,2,4,5,6,7}
A B D E
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction Example 2
69

a
1 2 a1

start
0  3
a
4
b
5
b
6 a2
a b

7 b 8 a3
b

a3 Dstates
C A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
Minimizing the Number of
70

States of a DFA

C
b
b a b
start a b b start a b b
A B D E AC B D E
a a
a
a b a a
From Regular Expression to DFA
71

Directly

 The “important states” of an NFA are those without


an -transition, that is if
move({s},a)   for some a then s is an important
state
 The subset construction algorithm uses only the
important states when it determines
-closure(move(T,a))
From Regular Expression to DFA
72

Directly (Algorithm)

 Augment the regular expression r with a special end


symbol # to make accepting states important: the
new expression is r#
 Construct a syntax tree for r#
 Traverse the tree to construct functions nullable,
firstpos, lastpos, and followpos
From Regular Expression to DFA 73

Directly: Syntax Tree of


(a|b)*abb#
concatenation
#
6
b
closure 5
b
4
a
* 3
alternation
| position
number
a b (for leafs )
1 2
From Regular Expression to DFA
74

Directly: Annotating the Tree

 nullable(n): the subtree at node n


generates languages including the
empty string
 firstpos(n): set of positions that can
match the first symbol of a string
generated by the subtree at node n
 lastpos(n): the set of positions that can
match the last symbol of a string
generated be the subtree at node n
 followpos(i): the set of positions that
can follow position i in the tree
From Regular Expression to DFA 75
Directly: Annotating the Tree

Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) if nullable(c2)
• nullable(c1)
then firstpos(c1) then lastpos(c1)
/ \ and
 firstpos(c2)  lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c ) lastpos(c )
From Regular Expression to DFA 76

Directly: Syntax Tree of


(a|b)*abb#
{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{1, 2} {1, 2} {3} a {3}
* 3

{1, 2} | {1, 2}

{1} a {1} {2} b {2}


1 2
From Regular Expression to DFA
77

Directly: followpos

for each node n in the tree do


if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do
From Regular Expression to DFA
Directly: Algorithm
78

s0 := firstpos(root) where root is the root of the syntax tree


Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
From Regular Expression to DFA
79

Directly: Example
followp
Node 1
os
3 4 5 6
{1, 2,
1
3} 2

{1, 2,
2
3}
3 {4}b b
a
4 start {5} a 1,2, b 1,2, b 1,2,
1,2,3
5 {6} 3,4 3,5 3,6
a
6 - a
80

Time-Space Tradeoffs

Space Time
Automaton (worst (worst
case) case)
O(rx
NFA O(r)
)
DFA O(2|r|) O(x)

You might also like