Unit-2 Lexical Analysis
Unit-2 Lexical Analysis
Unit – 2
Lexical Analyzer
Topics to be covered
Looping
• Interaction of scanner & parser
• Token, Pattern & Lexemes
• Input buffering
• Specification of tokens
• Regular expression & Regular definition
• Transition diagram
• Finite automata
• Regular expression to NFA using Thompson's rule
• Conversion from NFA to DFA using subset construction method
• DFA optimization
• Conversion from regular expression to DFA using Syntax Tree method
1
9/13/2023
Symbol Table
Upon receiving a “Get next token” command from parser, the lexical analyzer reads the input
character until it can identify the next token.
Lexical analyzer also stripping out comments and white space in the form of blanks, tabs, and
newline characters from the source program.
2
9/13/2023
= Operator1
+ Operator2
45 Constant1
Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
3
9/13/2023
Input buffering
Lexical Analysis has to access secondary memory each time to identify tokens.
It is time-consuming and costly. So, the input strings are stored into a buffer and then scanned
by Lexical Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify tokens. It
uses two pointers to scan tokens −
Begin Pointer (bp) − It points to the beginning of the string to be read.
Forward Pointer (fp) − It moves ahead to search for the end of the lexeme.
4
9/13/2023
Input buffering
Initially both the pointers point to the first character of the input string.
The forward pointer moves ahead to search for end of lexeme.
As soon as the blank space is encountered, it indicates end of lexeme.
In above example as soon as fp encounters a blank space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead.
Then both the begin ptr(bp) and forward ptr(fp) are set at next token.
In this scheme, only one buffer is used to store the input string but the problem with this scheme
is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the
buffer has to be refilled, that makes overwriting the first of lexeme.
Input buffering
There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair
The lexical analysis scans the input string from left to right one character at a time.
Buffer divided into two N-character halves, where N is the number of character on one disk block.
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
5
9/13/2023
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
forward forward
lexeme_beginnig
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
6
9/13/2023
Sentinels
forward
lexeme_beginnig
In buffer pairs we must check, that one half of the buffer has not moved off. If it is done, then the
other half must be reloaded.
Thus, for each character read, we make two tests.
We can combine the buffer-end test with the test for the current character.
We can reduce the two tests to one if we extend each buffer to hold a sentinel character at the
end.
The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character EOF.
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof
7
9/13/2023
Regular expression
A regular expression is a sequence of characters that define a pattern.
Notational shorthand's
4. Alphabets: Σ
8
9/13/2023
Operations on languages
Operation Definition
Union of L and M 𝐿 𝑈 𝑀 = {𝑠 | 𝑠 𝑖𝑠 𝑖𝑛 𝐿 𝑜𝑟 𝑠 𝑖𝑠 𝑖𝑛 𝑀 }
Written L U M
Concatenation of L
and M 𝐿𝑀 = {𝑠𝑡 | 𝑠 𝑖𝑠 𝑖𝑛 𝐿 𝑎𝑛𝑑 𝑡 𝑖𝑠 𝑖𝑛 𝑀 }
Written LM
Kleene closure of L 𝐿
∗
𝑑𝑒𝑛𝑜𝑡𝑒𝑠 “𝑧𝑒𝑟𝑜 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓” 𝐿.
Written L∗
Positive closure of L 𝐿
+
𝑑𝑒𝑛𝑜𝑡𝑒𝑠 “𝑜𝑛𝑒 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑖𝑜𝑛 𝑜𝑓” 𝐿.
Written L+
2. 0 or 11 or 111
𝐒𝐭𝐫𝐢𝐧𝐠𝐬: 𝟎, 𝟏𝟏, 𝟏𝟏𝟏 𝐑. 𝐄. = 𝟎 𝟏𝟏 𝟏𝟏𝟏
9
9/13/2023
17. All binary string with at least 3 characters and 3rd character should be zero
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝟎𝟎𝟎, 𝟏𝟎𝟎, 𝟏𝟏𝟎𝟎, 𝟏𝟎𝟎𝟏… 𝑹. 𝑬. = 𝟎 𝟏 𝟎 𝟏 𝟎(𝟎 | 𝟏) ∗
18. Language which consist of exactly two b’s over the set Σ = {𝑎, 𝑏}
∗ ∗ ∗
𝑺𝒕𝒓𝒊𝒏𝒈𝒔: 𝒃𝒃, 𝒃𝒂𝒃, 𝒂𝒂𝒃𝒃, 𝒂𝒃𝒃𝒂… 𝑹. 𝑬. = 𝒂 𝒃 𝒂 𝒃 𝒂
10
9/13/2023
11
9/13/2023
Regular definition
A regular definition gives names to certain regular expressions and uses those names in other
regular expressions.
Regular definition is a sequence of definitions of the form:
𝑑1 → 𝑟1
𝑑2 → 𝑟2
……
𝑑𝑛 → 𝑟𝑛
Where 𝑑𝑖 is a distinct name & 𝑟𝑖 is a regular expression.
Example: Regular definition for identifier
letter A|B|C|………..|Z|a|b|………..|z
digit 0|1|…….|9|
id letter (letter | digit)*
Transition Diagram
A stylized flowchart is called transition diagram.
is a state
is a transition
is a start state
is a final state
12
9/13/2023
Finite Automata
Finite Automata are recognizers.
FA simply say “Yes” or “No” about each possible input string.
Finite Automata is a mathematical model consist of:
1. Set of states 𝑺
2. Set of input symbol 𝜮
3. A transition function move
4. Initial state 𝑺𝟎
5. Final states or accepting states 𝐅
13
9/13/2023
R.E to DFA
14
9/13/2023
start
start 𝜖 𝑖 N(s) N(t) 𝑓
𝑖 𝑓
start a a b
𝑖 𝑓 1 2 3
15
9/13/2023
𝜖 N(t) 𝜖 𝜖
𝜖 𝜖 𝜖
4 5
b
𝜖 𝑎 𝜖 𝑏
1 2 3 4 5
b*ab
𝜖
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6
16
9/13/2023
OPERATION DESCRIPTION
− 𝑐𝑙𝑜𝑠𝑢𝑟𝑒(𝑠) Set of NFA states reachable from NFA state 𝑠 on
– transition alone.
− 𝑐𝑙𝑜𝑠𝑢𝑟𝑒(𝑇) Set of NFA states reachable from some NFA state 𝑠
in 𝑇 on – transition alone.
M𝒐𝒗𝒆 (𝑇, 𝑎) Set of NFA states to which there is a transition on
input symbol 𝑎 from some NFA state 𝑠 in 𝑇.
17
9/13/2023
(a|b)* abb 𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
18
9/13/2023
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
𝜖- Closure(0)= {0, 1, 7, 2, 4}
= {0,1,2,4,7} ---- A
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
𝜖 𝜖
4 5
b
𝜖
A= {0, 1, 2, 4, 7}
Move(A,a) = {3,8}
𝜖- Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
19
9/13/2023
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b
𝜖
A= {0, 1, 2, 4, 7}
Move(A,b) = {5}
𝜖- Closure(Move(A,b)) = {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b
𝜖
B = {1, 2, 3, 4, 6, 7, 8}
Move(B,a) = {3,8}
𝜖- Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
20
9/13/2023
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b
B= {1, 2, 3, 4, 6, 7, 8}
Move(B,b) = {5,9}
𝜖- Closure(Move(B,b)) = {5, 6, 7, 1, 2, 4, 9}
= {1,2,4,5,6,7,9} ---- D
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b
C= {1, 2, 4, 5, 6 ,7}
Move(C,a) = {3,8}
𝜖- Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
21
9/13/2023
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b
𝜖
C= {1, 2, 4, 5, 6, 7}
Move(C,b) = {5}
𝜖- Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B
b
D= {1, 2, 4, 5, 6, 7, 9}
Move(D,a) = {3,8}
𝜖- Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
22
9/13/2023
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10}
𝜖
D= {1, 2, 4, 5, 6, 7, 9}
Move(D,b) = {5,10}
𝜖- Closure(Move(D,b)) = {5, 6, 7, 1, 2, 4, 10}
= {1,2,4,5,6,7,10} ---- E
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,a) = {3,8}
𝜖- Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
23
9/13/2023
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B C
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,b)= {5}
𝜖- Closure(Move(E,b))= {5,6,7,1,2,4}
= {1,2,4,5,6,7} ---- C
a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B C
𝜖
{5}
{5,6,7,1,2,4}
24
9/13/2023
b
States a b B D
a
A = {0,1,2,4,7} B C a
B = {1,2,3,4,6,7,8} B D
A a a b
C = {1,2,4,5,6,7} B C
D = {1,2,4,5,6,7,9} B E b
C E
E = {1,2,4,5,6,7,10} B C b
Transition Table
b
Note:
• Accepting state in NFA is 10 DFA
• 10 is element of E
• So, E is acceptance state in DFA
Exercise
Convert following regular expression to DFA using subset construction method:
1. (a+b)* a b b (a+b)* (Nov-2011)
2. a a* (b | c) a* c# (Dec-2012)
3. a+ (c |d) b* f # (May-2015)
4. a (b | c)* a* c# (Nov-2016)
5. (a | b)* a b# (April-2017)
6. (a | b)* a b* a (May-2019, Dec-2021_NEW)
7. a a* a b* c # (Aug-2021)
8. (a/b) c a* c# (Jun-2023)
25
9/13/2023
DFA optimization
1. Construct an initial partition Π of the set of states with two groups: the accepting states 𝐹 and
the non-accepting states 𝑆 − 𝐹.
2. Apply the repartition procedure to Π to construct a new partition Π𝑛𝑒𝑤.
3. If Π 𝑛𝑒𝑤 = Π, let Π𝑓𝑖𝑛𝑎𝑙 = Π and continue with step (4). Otherwise, repeat step (2) with
Π = Π𝑛𝑒𝑤.
for each group 𝐺 of Π do begin
partition 𝐺 into subgroups such that two states 𝑠 and 𝑡
of 𝐺 are in the same subgroup if and only if for all
input symbols 𝑎, states 𝑠 and 𝑡 have transitions on 𝑎
to states in the same group of Π.
replace 𝐺 in Π𝑛𝑒𝑤 by the set of all subgroups formed.
end
26
9/13/2023
DFA optimization
4. Choose one state in each group of the partition Π𝑓𝑖𝑛𝑎𝑙 as the representative for that group.
The representatives will be the states of 𝑀′. Let s be a representative state, and suppose on
input a there is a transition of 𝑀 from 𝑠 to 𝑡. Let 𝑟 be the representative of 𝑡′s group. Then 𝑀′
has a transition from 𝑠 to 𝑟 on 𝑎. Let the start state of 𝑀′ be the representative of the group
containing start state 𝑠0 of 𝑀, and let the accepting states of 𝑀′ be the representatives that
are in 𝐹.
5. If 𝑀′ has a dead state 𝑑, then remove 𝑑 from 𝑀′. Also remove any state not reachable from the
start state.
DFA optimization
States a b
{𝐴, 𝐵, 𝐶, 𝐷, 𝐸}
A B C
B B D
Nonaccepting States Accepting States
{𝐴, 𝐵, 𝐶, 𝐷} {𝐸} C B C
D B E
E B C
{𝐴, 𝐵, 𝐶} {𝐷}
States a b
{𝐴, 𝐶} {𝐵}
A B A
B B D
Now no more splitting is possible. D B E
E B A
If we chose A as the representative for group
Optimized
(AC), then we obtain reduced transition table Transition Table
27
9/13/2023
followpos(i)
The set of positions that can follow position 𝑖 in the tree.
28
9/13/2023
2. If n is * node and i is position in lastpos(n), then all position in firstpos(n) are in followpos(i)
29
9/13/2023
𝟒
∗ 𝑎
𝟑
𝑎 𝑏
𝟏 𝟐
𝟒 c1 c2
{1,2} ∗ {3} 𝑎
n ∗
𝟑 firstpos(c1)
c1
{1,2} |
n if (nullable(c1))
.
𝑎 𝑏 thenfirstpos(c1)
{1} 𝟏 {2}𝟐 firstpos(c2)
c1 c2 else firstpos(c1)
30
9/13/2023
{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {5}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 6
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 5 = 6
{1} {1} {2} {2}
𝟏 𝟐
31
9/13/2023
{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {4}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 5
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 4 = 5
{1} {1} {2} {2}
𝟏 𝟐
{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {3}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 4
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 3 = 4
{1} {1} {2} {2}
𝟏 𝟐
32
9/13/2023
{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {1,2}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 3
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 3
{1} {1} {2} {2}
𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 3
{1,2} | {1,2}
𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑛) = {1,2}
𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑛 = 1,2
𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 1,2
{1} {1} {2} {2}
𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 1,2
33
9/13/2023
=(1,2,3) ----- A
States a b
A={1,2,3} B A
B={1,2,3,4}
State C
δ( (1,2,3,5),a) = followpos(1) U followpos(3) States a b
A={1,2,3} B A
=(1,2,3) U (4) = {1,2,3,4} ----- B
B={1,2,3,4} B C
C={1,2,3,5} B D
δ( (1,2,3,5),b) = followpos(2) U followpos(5) D={1,2,3,6}
34
9/13/2023
States a b
A={1,2,3} B A
B={1,2,3,4} B C
C={1,2,3,5} B D
D={1,2,3,6} B A
b
a States a b
A={1,2,3} B A
a b b B={1,2,3,4} B C
A B C D
C={1,2,3,5} B D
a
a D={1,2,3,6} B A
b
DFA
35
9/13/2023
36