Lexical Analysis
By:
Trusha R. Patel
Asst. Prof.
CE Dept, CSPIT, CHARUSAT
1
Role of Lexical Analyzer
1. Main task is to read input characters of the
source program, group them into lexemes
and produce as output a sequence of tokens
for each lexeme in the program
1. Interact with symbol table, when it discovers
a lexeme constituting an identifier, it need to
enter that lexeme into the symbol table
2
Role of Lexical Analyzer
3. Stripping out comments and whitespace
3. Generate error messages ( lexical errors )
3. Keep track of the line number so it can
associate a line number with each error
3
Interaction of Lexical Analyzer from Syntax Analyzer
token
Lexical Syntax to
Source
Analyzer Analyzer semantic
program
(Scanner) (Parser) analyzer
getNextToken
Symbol table
4
Separated Lexical Analyzer from Syntax Analyzer
1. Simplicity of design
e.g. its complex for parser to deal with
comments and whitespaces as syntactic unit
so they are removed in lexical analysis
1. Compiler efficiency is improved as it allows
to apply specialized techniques that serve
only the lexical task not parsing job
specialized input buffering techniques for
reading input characters can speed up the
compiler
5
Separated Lexical Analyzer from Syntax Analyzer
3. Compiler portability is enhanced
input-device-specific peculiarities can be
restricted to the lexical analysis
6
Token, Pattern, Lexeme
Token
It’s a pair consisting of token name and attribute
value
Generally write token name in boldface
Refer to a token by its name
Pattern
Description of the form that lexemes of a token
may take
Lexeme
Sequence of characters in the source program
that matches the pattern for a token
7
Token, Pattern, Lexeme
Sample
Token Informal description
lexeme
if Characters i , f if
else Characters e , l , s , e else
< or > or <= or >= or == or !
comparison <= , =>
=
Letter followed by letters and
id pi , score , D2
digits
number Any numeric constant 3.14159 , 6.02e23
Anything but “ surrounded by
literal “core dumped”
“
8
Attributes for Token
When more then one lexemes can match a
pattern, lexical analyzer must provide
additional information to the subsequent
compiler phases
So lexical analyzer return token name with
attribute value
Token have at most one associated attribute,
although this attribute may have a structure
that combined several information
9
Attributes for Token
Example
Token id
Information about identifier is kept in symbol table
Attribute value for identifier is a pointer to the
symbol table entry for that identifier
10
Attributes for Token
Example
Name and associated attribute value for
E = M * C ** 2
< id , pointer to symbol-table entry of E >
< assign-op >
< id , pointer to symbol-table entry of M >
< mult-op >
< id , pointer to symbol-table entry of C >
<exp-op>
< number , integer value 2 >
11
Lexical Error
fi is encountered for the first time in C :
fi ( a== f(x) ) …
then lexical analyzer cannot tell whether fi is
a misspelling of the keyword if or undefined
function identifier
Since fi is valid lexeme for token id the lexical
analyzer return token id to the parser and
parser handle the error
12
Lexical Error
If lexical analyzer is unable to proceed
because none of the patterns of token
matches any prefix of the remaining input
Simplest recovery strategy is “panic mode”
recovery
delete successive characters from the
remaining input until the lexical analyzer can
find a well-formed token
13
Lexical Error
Other possible error-recovery actions are
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
14
Input Buffering
In C, single-character operator like > , = , <
can also be the beginning of two-character
operator >= , == , <=
Often have to look one or more characters
beyond the next lexeme before we can be
sure about correct lexeme
Introduce a two-buffer scheme that handle
large lookaheads
15
Buffer Pairs
Large amount of time taken to process
characters
This buffering technique have been developed
to reduce the amount of overhead required to
process a single input character
It involves two buffer that are alternately
reloaded
Each buffer is of the same size N (usually the
Buffer 1 Buffer 2
size of disk block)
N N
16
Buffer Pairs
One system read command will read N
characters into buffer instead of one character
If fewer than N characters remain in input file,
then special character eof marks the end of
source file
Maintains two pointer
lexemeBegin
Marks beginning of current lexeme
forward
Scan ahead until a pattern match is found
17
Buffer Pairs
a b = 2 ; N character loaded in buffer
d = a b + 3 ; N character loaded in buffer
Match with id
Start of new token, set forward to one left position
id token generated for ab , now set lexemeBegine to next position of forward
Same way tokens generated for = , 2 , ;
then pointer reach to end of first buffer so next buffer will be loaded
Set points in new buffer and do the same process for the next buffer
a b = 2 ; d = a b + 3 ;
forward
lexemeBegine
18
Symbol, Alphabet, String, Language
Symbol
It can be letters, digits, punctuations, operators
E.g.
a-z A-Z a-z 0-9 ; = + , etc.
Alphabet (Σ)
Finite set of symbols
E.g.
{0,1} { a , b , c } etc.
19
Symbol, Alphabet, String, Language
String
String over alphabet is finite sequence of symbols
drawn from that alphabet
E.g.
Empty string is denoted using ϵ
for alphabet {0,1} 0101 can be a string
Language (L)
Any countable set of strings over some fixed
alphabet
E.g.
for alphabet {0,1} language is set of string of
length 2 can be given as { 00 , 01 , 10 , 11 }
20
Terms for Parts of String
Prefix
String obtain by removing zero or more symbols
from end of string
ϵ b ba ban bana banan banana
E.g. all possible prefixes of banana are:
Suffix
String obtain by removing zero or more symbols
from beginning of string
banana anana nana ana na a ϵ
E.g. all possible suffixes of banana are:
21
Terms for Parts of String
Substring
Obtained by deleting any prefix and any suffix
from string
E.g. some substrings of banana can be
nan anan …
Proper prefixes, suffixes and substrings
they are not ϵ or not equal to string itself
Prefix, suffix and substring are called proper if
22
Terms for Parts of String
Subsequence
Deleting zero or more not necessarily consecutive
positions from string
E.g. substring of banana can be
baan bn …
23
Operations on Languages
OPERATION DEFINITION & NOTATION
Union of L and M L∪M = { s | s is in L or s is in M }
Concatenation of L
and M LM = { st | s is in L and t is in M }
Kleene closer of L
Positive closer of L
Here L and M are two languages
24
Regular Expression
Notatio Notatio
Meaning Meaning
n n
ϵ Null character Used to define
{}
| or / Choice rang of occurrence
Kleene closure
ø Null set
* 0 or more
occurrence U Union
Positive closure
+ 1 or more
[] Character set
occurrence ^ Reverse of set
? 0 or 1 occurrence
() Concatenation
25
Algebraic laws for Regular Expression
LAW DESCRIPTION
r|s=s|r | is commutative
r|(s|t)=(r|s)|t | is associative
r(st)=(rs)t Concatenation is associative
r(s|t)=rs|rt ; (s|t)r Concatenation distributes over
ϵ is the identity for
=sr|tr |
ϵr=rϵ=r concatenation
r* = ( r | ϵ )* ϵ is guaranteed in a closure
r** = r* * Is idempotent
26
Regular expression
RE for identifier
[ a-z A-Z _ ] [ a-z A-Z _ 0-9 ] *
27
Regular Definition
Allow to give names to certain regular
expressions and that name can be used in
subsequent expression
Regular definition is sequence of definitions of
the form
d1 r1
d2 r2
…
dn rn
1) each di is new symbol, not in Σ and not
where
same as any d’s
alphabet Σ ∪ { d1 , d2 , … , dn }
2) each ri is a regular expression over
28
Regular Definition
Regular definition for identifier
letter A | B | … | Z | a | b | … | z | _
digit 0|1|2|…|9
id letter ( letter | digit )*
Using shorthands
letter [ A-Z a-z _ ]
digit [0–9]
id letter ( letter | digit )*
29
Regular Definition
Regular definition for unsigned number
digit 0|1|…|9
. digits | ϵ
digits digit digit *
( E ( + | - | ϵ ) digits ) | ϵ
optionalFraction
optionalExponent
number digits optionalFraction
optionalExponent
Using shorthands
digit [0–9]
digits digit+
number
30 digits ( . digits ) ? ( E [ + - ] ? digits )?
Transition Diagram
Intermediate state in construction of lexical
analyzer is conversion of patterns into stylish
flowchart called “transition diagram”
Transition diagram have a collection of nodes
or circles called “states”
Each state represents a condition that could
occur during the process of scanning the input
looking for a lexeme that matches one of
several pattern
31
Transition Diagram
“Edges” are directed from one state to other,
labeled by a symbol or set of symbols
Is current state is “s” and input symbol is “a”
then find edge out from “s” with label “a”, if
found then advance the forward pointer in
buffer and enter the state of transition
diagram which that edge leads
Assume that all transition diagrams are
deterministic means there is never more than
one edge out of a state with a given symbol
among its labels
32
Transition Diagram
“Start state” or “initial state” indicated by
edge labeled “start”
Transition diagram always begin in the start
state
Certain states are called “accepting state” of
“final state”, indicated the lexeme has been
found and indicated by double circle, if any
action to be taken need to attach that action o
accepting state
If it is necessary to retract the forward pointer
(at that point of time you can not decide that
token can be generated or not so you need to
33
Transition Diagram
TOKEN ATTRIBUTE
LEXEME
NAME VALUE
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
if if -
then then -
Pointer to table
Any id id
entry
Pointer to table
34 Any number number
entry
Transition Diagram (Recognition relational
operation)
= 2 return (relop , LE)
<
1 >
3 return (relop , NE)
other *
4 return (relop , LT)
start =
0 5 return (relop , EQ)
> =
6 7 return (relop , GE)
other *
8 return (relop , GT)
35
Transition Diagram (Recognition identifier and
keyword)
start letter other *
9 10 11
return ( getToken() , installID() )
Letter or digit
Problem is to differentiate keyword and
identifier
36
Transition Diagram (Recognition identifier and
keyword)
Two solutions
Install keyword in symbol-table initially
One field of symbol-table indicate that it’s a keyword or
identifier
when pattern match then getToken find in symbol-table
if found then return pointer to its symbol –table entry
if not then installID enter that identifier in symbol-table
and return the pointer to that entry
Create separate transition diagram for each
keyword
all edge show successive letters of keyword
last test for “nonletter-or-digit”
37
Transition Diagram (Recognition unsigned number)
E
digit
digit digit
digit
start digit . digit E + or - digit
other other other
* * *
38
Transition Diagram (Recognition whitespace)
delim
start delim other *
39
Lexical Analyzer Generator LEX
Lex source file LEX
lex.yy.c
(.l) Compiler
C
lex.yy.c a.out
Compiler
Input stream a.out Sequence of token
40
Structure of Lex Program
declarations Declaration of variables,
constants and regular
%% definitions
translation rules
%%
of form :- Pattern { Action }
Auxiliary functions where
Pattern :- regular expression
which may use regular
definition
from declaration section
Action :- fragment of code
written in C
Hold additional functions used in
the actions
41
Finite Automata
They are recognizers
Simply say “yes” or “no” about each possible
input string
Two flowers
NFA (Nondeterministic Finite Automata)
ϵ is also a possible label
A symbol can label several edges out of he same state,
DFA (Deterministic Finite Automata)
For each symbol exactly one edge with that symbol
42
leaving that state
NFA
Consists of
S finite set of states
Σ set of input symbols ,called input alphabet
Transition gives for each state and each input symbol a set
function of next states
s0 Start state or initial state , from S
F Set of accepting states , subset of S
43
DFA
Special case of NFA
No moves on input ϵ
Each state “s” and input symbol “a”, exactly one
edge out of “s” labeled with “a”
44
RE to NFA (using Thompson Construction)
McNaughton-Yamada-Thompson algorithm
Works recursively
For each subexpression it constructs an NFA with
a single accepting state
45
RE to NFA (using Thompson Construction)
start s
N(s)
start
s
N(t)
ϵ
s|t
ϵ ϵ
start
N(s)
start
ϵ
N(s) N(t)
st
s*
46
RE to NFA (using Thompson Construction)
(a|b)*ab
b
ϵ
ϵ ϵ
a
ϵ ϵ a b b
ϵ b ϵ
47
RE to NFA
More examples
(a|b)*
a * | b *
( a * | b ) * c a *
( a | b ) * c * d
a * b a
a + b * ( c | d )
a + b ( c | d ) a * b
( ( a * b * ) c ) * a *
( a | b )* a b b
a a* | b b*
48
RE to DFA
Steps
1. Add # at the end of RE
2. Construct tree of RE
3. Give numbers (starting with 1) to all alphabet
4. Find FIRSTPOS for all position in constructed tree
5. Find FOLLOWPOS for all position in constructed
tree
6. Take highest FIRSTPOS set as A state
7. Find Next state for all possible input symbol from
A
8. If new set comes then give it new name (i.e. B, C,
D, …)
9. Continue till new won't get any new set
49
10. Make transition table
nullable( ) , fisrtpos( ) , followpos( )
nullable(n)
If “n” is a leaf node labeled ϵ then nullable(n) is
TRUE
If “n” is a leaf node labeled “n” then nullable(n) is
FALSE
firstpos(n)
Set of symbols or positions that can come as first
of subexpression appearing at position “n”
followpos(n)
Set of symbols or positions that can follow the
subexpression appearing at position “n”
50
nullable( ) , fisrtpos( ) , followpos( )
NODE n nullable(n) firstpos(n)
A leaf labeled ϵ true Ø
A leaf with
false {i}
position i
An or-node n=c1| nullable(c1) or firstpos(c1)∪firstpos(c
c2 nullable(c2) 2)
if (nullable(c1))
∪firstpos(c2)
fisrtpos(c1)
A cat-node nullable(c1) and
n=c1c2 nullable(c2)
else
firstpos(c1)
A star-node
true firstpos(c1)
n=c1*
51
nullable( ) , fisrtpos( ) , followpos( )
n
* 1
/
FIRSTPOS (n1) = n1
n n n
1 1 2
FIRSTPOS ( * ) = FIRSTPOS (n1)
FIRSTPOS (n1)
FIRSTPOS ( / ) = U
FIRSTPOS (n2)
.
n n
1 2
FIRSTPOS ( . ) = FIRSTPOS (n1) U FIRSTPOS (n2) if n1is nullable
FIRSTPOS ( . ) = FIRSTPOS (n1) if n1 is not nullable
52
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
A {1 , 2 , 3 } {5}
{1,2,3} . b
{4}
{1,2} * a
{3}
FOLLOWPOS (5) = { } =ø {1,2} /
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
a b
2,3} {1} {2}
FOLLOWPOS (1) = { 1 ,
2 53
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- A {1,2,3}
B {1 , 2 , 3 , 4 a b a
}
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3
,4 }=B
i/p b = FOLLOW (2) = { 1 , 2 , 3 } = A
i/p # = ---
FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 54
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- B {1,2,3,4}
B {1 , 2 , 3 , 4 B C - a b a b
C }{1 , 2 , 3 , 5
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
} 4}=B
i/p b = FOLLOW (2) U FOLLOW (4) = { 1 , 2 , 3 ,
5}=C
FOLLOWPOS (5) = { } =i/pø # = ---
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 55
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- C {1,2,3,5}
B {1 , 2 , 3 , 4 B C - a b a #
C }{1 , 2 , 3 , 5 B A D i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
D }{ } 4}=B
i/p b = FOLLOW (2) = A
i/p # = FOLOW (5) = { } = D
FOLLOWPOS (5) = { } = ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 56
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- D {}
B {1 , 2 , 3 , 4 B C -
C }{1 , 2 , 3 , 5 B A D i/p a = ---
i/p b = ---
D }{ } - - - i/p # = ---
FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 57
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
Initial5state
ab#
A {1 , 2 , 3 } BA- a
B {1 , 2 , 3 , 4 B C -
C }{1 , 2 , 3 , 5 B A D
B
D }{ } - - -
a b a
Final state
b
A C
#
D
b
58
RE to DFA
More examples
( a* | b* ) c* a
( ( a* b a ) | ( b* a ) ) ( a b )*
( a | b )* a ( a| b )
( ( a* b* ) c )* a*
59
RE to NFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
{5}
{1,2,3} . b
{4}
{1,2} * a
{3}
{1,2} /
a b
{1} {2}
60
RE to NFA (using fistpos, followpos, lastpos)
FOLLOWPOS (1) = { 1 ,
( a | b ) * a #b 2 , 3 } (2) = { 1 ,
FOLLOWPOS
1 2 3 4 2 , 3 } (3) = { 4 }
FOLLOWPOS
5 FOLLOWPOS (4) = { 5 }
a FOLLOWPOS (5) = { }
1 a
b #
b a 3 4 5
2 a
It contain 1,2,3
So draw transition from
b 11
#Do
Take 5 (1
so means
same take
for all “a”
5 as so label
isfirstpos(root)={1,2,3}
states
final as with
state “a”)
initial st
Draw
Find state
follow for
of all
1 numbers
12 (2 means “b” so label with “b”)
e can construct DFA from this NFA using subset
13 construction
(3 means “a” method (covered
so label with “a”)in
61
END
62