0% found this document useful (0 votes)
27 views

A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis

The lexical analyzer is the first phase of a compiler. It reads the character stream of the source program and groups the characters into meaningful tokens. For each token, the lexical analyzer produces an output consisting of the token name and a pointer to its symbol table entry. This token stream is passed to the subsequent syntax analysis phase. Finite state automata are used to recognize tokens based on regular expressions that specify the patterns for different token types like identifiers, keywords, and operators. The lexical analyzer assists the parser by preprocessing the source code and correlating errors to their locations.

Uploaded by

sibhat mequanint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis

The lexical analyzer is the first phase of a compiler. It reads the character stream of the source program and groups the characters into meaningful tokens. For each token, the lexical analyzer produces an output consisting of the token name and a pointer to its symbol table entry. This token stream is passed to the subsequent syntax analysis phase. Finite state automata are used to recognize tokens based on regular expressions that specify the patterns for different token types like identifiers, keywords, and operators. The lexical analyzer assists the parser by preprocessing the source code and correlating errors to their locations.

Uploaded by

sibhat mequanint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

A typical lexical analyzer generator

NFA to DFA
DFA Analysis
The Structure of Compiler (Phases of Compiler)

Lexical Syntax Semantic


Analyzer Analyzer
Analyzer
Character Token Syntax
stream Stream Tree Syntax
tree
Intermediate
Symbol code
Table Generator
Intermediate
Machine- representation
Code
Dependent Code Generator Machine-
Optimizer Independent
IR
Target Machine Target machine Code Optimizer
Dr. Deepak K. Sinha, JIT, JU 2
code
Code
Review: Compiler Phases:
Source program

Lexical analyzer Front End

Syntax analyzer
Symbol table
manager Semantic analyzer Error handler

Intermediate code generator

Code optimizer
Backend
Code generator
Lexical Analysis
• The first phase of a compiler is called Lexical
Analysis or Scanning.
• The Lexical Analyzer reads the stream of characters
making up the source program and groups the
characters into meaningful sequences called
Lexemes.
• For each lexemes, the lexical analyzer produces an
output a token of the form
• <token-name, attribute-value>
• The same token is being passed to the subsequent
phase, syntax alalysis
Dr. Deepak K. Sinha, JIT, JU 4
• In the token, the first component token-name is an
abstract symbol that is used during syntax analysis,
And
• The second component attribute-value points to an
entry in the symbol table for this token.

(Information from the symbol table entry is needed


for semantic analysis and code generation)
Example:
p=i+r*60

Dr. Deepak K. Sinha, JIT, JU 5


p=i+r*60
• p is a lexeme that would be mapped into a token
<id, 1>, where id is an abstract symbol standing
for identifier and 1 points to the symbol-table
entry for p.
• The symbol-table entry for an identifier holds
information about the identifier, such as its name
and type 1 p --------
2 i ………
3 r ……….

symbol table
Dr. Deepak K. Sinha, JIT, JU 6
• The assignment symbol = is a lexeme that is
mapped into the token <=>
• i……………….<id, 2>
• +……………….<+>
• r.........................<id, 3>
• *…………………<*>
• 60…………………<60>

Dr. Deepak K. Sinha, JIT, JU 7


Interaction of Lexical analyzer with parser

token
Source Lexical parser
program analyzer Nexttoken()

symbol
table
• Two issues in lexical analysis.
– How to specify tokens (patterns)?
– How to recognize the tokens giving a token specification (how to implement
the nexttoken() routine)?

• How to specify tokens:


– all the basic elements in a language must be
tokens so that they can be recognized.
void main() {
int i, j;
for (i=0; i<50; i++) {
cout<<i;
}
}

• Token types: constant, identifier, reserved word, operator and misc.


symbol.

– Tokens are specified by regular expressions.


Overview

program text

token scanner
description generator lexical analysis
tokens
syntax analysis
• Generating a lexical analyzer
AST
– generic methods context handling
– specific tool lex
annotated AST
Token description

• lex: scanner generator for UNIX


– token description  C code

• format of the lex input file:


definitions regular descriptions
%%
rules regular expressions + actions
%%
user code auxiliary C-code
Lex description to recognize integers

• an integer is a non-zero sequence of digits optionally


followed by a letter denoting the base class (b for binary
and o for octal).

• base  [bo] %{
integer  digit+ base? #include
%}
"lex.h"

base [bo]
• rule = expr + action digit [0-9]
%%
{digit}+ {base}? {return INTEGER;}
• {} signal application
%%
of a description
automatic generation

program text

token scanner
description generator lexical analysis
tokens
syntax analysis
digit
digit AST
S0 S1

‘.’ context handling


‘.’ digit
digit
S2 S3
finite state automaton annotated AST
Finite-state automaton

• Recognize input character by character


• Transfer between states

‘i’ ‘f’
• FSA S0 S1 S2

– Initial state S0
– set of accepting states
FSA examples

• integral_number  [0-9]+ digit

digit
S0 S1

• fixed_point_number  [0-9]* ‘.’ [0-9]+

digit digit

‘.’ digit
S0 S2 S3
Concurrent recognition

• integral_number  [0-9]+
• fixed_point_number  [0-9]* ‘.’ [0-9]+
digit
• recognize both
digit
tokens in one pass S0 S1

digit digit

‘.’ digit
S0 S2 S3
Concurrent recognition

• integral_number  [0-9]+
• fixed_point_number  [0-9]* ‘.’ [0-9]+
digit digit

digit
S0 S1

• naïve approach: ‘.’


digit
merge initial states digit
S2 S3
Concurrent recognition

• integral_number  [0-9]+
• fixed_point_number  [0-9]* ‘.’ [0-9]+
digit

digit
S0 S1

‘.’ ‘.’
• correct approach: digit

share common digit


S2 S3
prefix transitions
FSA implementation:
transition table

• concurrent recognition of integers and fixed


point numbers

digit
character recognized
state token digit
digit dot other S0 S1
S0 S1 S2 -
S1 S1 S2 - integer ‘.’ ‘.’ digit
S2 S3 - - digit
S3 S3 - - fixed point S2 S3
The role of the lexical analyzer

First phase of a compiler


1、Main task
– To read the input characters
– To produce a sequence of tokens used by
the parser for syntax analysis
– As an assistant of parser
The role of the lexical analyzer
3、Processes in lexical analyzers
– Scanning
• Pre-processing
– Strip out comments and white space
– Macro functions
– Correlating error messages from compiler with
source program
• A line number can be associated with an
error message
– Lexical analysis
LEXICAL ANALYSIS
The role of the lexical analyzer
4、Terms of the lexical analyzer
– Token
• Types of words in source program
• Keywords, operators, identifiers, constants,
literal strings, punctuation symbols(such as
commas,semicolons)
– Lexeme
• Actual words in source program
– Pattern
• A rule describing the set of lexemes that can
represent a particular token in source program
• Relation {<.<=,>,>=,==,<>}
LEXICAL ANALYSIS
The role of the lexical analyzer
5、Attributes for Tokens
– A pointer to the symbol-table entry in which
the information about the token is kept
E.g E=M*C**2
<id, pointer to symbol-table entry for E>
<assign_op,>
<id, pointer to symbol-table entry for M>
<multi_op,>
<id, pointer to symbol-table entry for C>
<exp_op,>
<num,integer value 2>
LEXICAL ANALYSIS
The role of the lexical analyzer

6、Lexical Errors
– Deleting an extraneous character
– Inserting a missing character
– Replacing an incorrect character by a
correct character
– Transposing two adjacent characters(such
as , fi=>if)
– Pre-scanning
Specification of Tokens

1、Regular Definition of Tokens


– Defined in regular expression
e.g. Id  letter(letter|digit)
letter A|B|…|Z|a|b|…|z
digit 0|1|2|…|9
Notes: Regular expressions are an important
notation for specifying patterns. Each pattern
matches a set of strings, so regular expressions
will serve as as names for sets of strings.
LEXICAL ANALYSIS
Specification of Tokens
2、Regular Expression & Regular language
– Regular Expression
• A notation that allows us to define a pattern in a
high level language.
– Regular language
• Each regular expression r denotes a language
L(r) (the set of sentences relating to the regular
expression r)
Notes: Each word in a program can be expressed in a
regular expression
LEXICAL ANALYSIS
Recognition of Tokens

2、Methods to recognition of token


– Use Transition Diagram
LEXICAL ANALYSIS
Recognition of Tokens

3、Transition Diagram(Stylized flowchart)


– Depict the actions that take place when a
lexical analyzer is called by the parser to
get the next token
Accepting
state
start > =
0 6 7 return(relop,GE)
Start other
state 8 return(relop,GT)
LEXICAL ANALYSIS
Recognition of Tokens

4、Implementing a Transition Diagram


– Each state gets a segment of code
– If there are edges leaving a state, then its
code reads a character and selects an edge
to follow, if possible
– Use nextchar() to read next character from
the input buffer
LEXICAL ANALYSIS
Recognition of Tokens
4、Implementing a Transition Diagram
while (1) {
switch(state) {
case 0: c=nextchar();
if (c==blank || c==tab || c==newline){
state=0;lexeme_beginning++}
else if (c== ‘<‘) state=1;
else if (c==‘=‘) state=5;
else if(c==‘>’) state=6 else state=fail();
break
case 9: c=nextchar();
if (isletter( c)) state=10;
else state=fail(); break
… }}}
LEXICAL ANALYSIS
Recognition of Tokens
5、A generalized transition diagram
Finite Automation
– Deterministic or non-deterministic FA
– Non-deterministic means that more than
one transition out of a state may be
possible on the the same input symbol
LEXICAL ANALYSIS
Recognition of Tokens

e.g:The FA simulator for Identifiers is:


letter
letter
1 2
digit

– Which represent the rule:


identifier=letter(letter|digit)*
Finite automata

1、Usage of FA
– Precisely recognize the regular sets
– A regular set is a set of sentences relating
to the regular expression
2、Sorts of FA
– Deterministic FA
– Non-deterministic FA
LEXICAL ANALYSIS
Finite automata
3、Deterministic FA (DFA)
Note: 1) In a DFA, no state has an -transition;
2)In a DFA, for each state s and input
symbol a, there is at most one edge labeled a
leaving s
3)To describe a FA,we use the transition
graph or transition table
4)A DFA accepts an input string x if and
only if there is some path in the transition graph
from start state to some accepting state
e.g. DFA M=({0,1,2,3},{a,b},move,0,{3})
Move: move(0,a)=1 m(0,b)=2 m(1,a)=3 m(1,b)=2
m(2,a)=1 m(2,b)=3 m(3,a)=3 m(3,b)=3
Transition table
input a b
1 a
a
state a
0 b a 3
0 1 2 b
b b
1 3 2 2
2 1 3 Transition graph
3 3 3
e.g. Construct a DFA M,which can accept the
strings which begin with a or b, or begin with c
and contain at most one a。
b b
0 c 2 a 3
a b c c

c 1 a
b
So ,the DFA is
M=({0,1,2,3,},{a,b,c},move,0,{1,2,3})
move:move(0,a)=1 move(0,b)=1
move(0,c)=1 move(1,a)=1
move(1,b)=1 move(1,c)=1
move(2,a)=3 move(2,b)=2
move(2,c)=2 move(3,b)=3
move(3,c)=3
Definition of An Automata

• An automaton is defined as a System where


energy, materials and information are
transformed , transmitted and used for performing
some function without direct participation of
man. Examples are automatic machines,
automatic packing machines, and automatic photo
printing machines.
• In computer science the term “automaton” means
“discrete automaton” and is defined in more
abstract way as follows:
O
I1 Automaton 1O
I2
2
I O
q1, q2, q3, 3
3 O
I4 are:
Its Characteristics …, q4
4
1. Input (I1, I2, I3, I4,…., In) finite numbers of fixed value from the
Input Alphabet ∑
2. output (O1, O2, O3, …, On) Finite number of fixed value form
Output O.
3. State: at any instance of time the automaton can be in one of the
states q1, q2,…., qn.
4. State relation: the next state of an automaton at any instance of time
is determined by the present state and the present input.
5. Output relation
Description of a finite automata

• Analytically, a finite automaton (FA) can be


represented by 5-tuple (Q, S, d, q0, F) where
– Q is a finite nonempty set of states;
– S is an finite nonempty set of input symbols
called input alphabet;
– d: Q × S → Q is a transition function
– q0  Q is the initial state; and
– F  Q is a set of final states. It is assumed thet
there may be more than one final or accepting
states (or final states).
• In diagrams, the accepting states will be denoted by
double loops
Block Diagram of Finite Automata
• The finite-state automaton connected to a Read/Write
head.
• It has one tape which is divided into a number of cells.

a1 a2 a3 b b b

Reading
Head
Tape divided into
Finite-state
cells of finite
automaton
length

11/13/2017 Dr. Deepak K. Sinha, JIT, JU, 41


Ethiopia
• Each cell store only one symbol.
• The input and the output from the finite state
automaton are effected by the R/W head, which can
examine one cell at a time.
• In one move the machine examines the present
symbol under the R/W head on the tape and the
present state of an automaton to determine:
– a new symbol to be written on the tape in
the cell under R/W head.
– a motion of the R/W head along the tape;
either head will move one cell L or R .
– The next state of the automaton.
– Whether to halt or not.
11/13/2017 Dr. Deepak K. Sinha, JIT, JU, 42
Ethiopia
Transition Systems
• A Transition Graph or a Transition System is a finite
directed labeled graph in which each vertex (or
node) represents a state and the directed edges
indicate the transition of a state and the edges are
labeled with input/output.
0/0 1/1
1/0

q0 q1

0/0

A Transition System
11 1
0
0,1
1
0111 111 1
0 0

Read string left to right 1

The machine accepts a string if the process


ends in a double circle
A Deterministic Finite Automaton (DFA)

states accept states (F)


0 q1 1
0,1
1
q0 q2
0 0
1
start state (q0) q3 states

The machine accepts a string if the process


ends in a double circle
Acceptability of String by a FA
• Consider the finite state machine whose
transaction function d is given below, here
Q={q0, q1, q2, q3}, S={0,1}, F={q0}, give the
entire sequence of states for the input string
110101.
Inputs

states 0 1
q0 q2 q1
q1 q3 q0
q2 q0 q3
q3 q1 q2
Example
0 1 0,1
q0 1 q1 0 q2

alphabet S = {0, 1} transition function d:


start state Q = {q0, q1, q2} inputs
initial state q0 0 1
accepting states F = {q0, q1} q0 q0 q1

states
q1 q2 q1
q2 q2 q2
Non-deterministic Finite Automata (NFA)

• A Non-deterministic Finite Automaton


(NFA)
– is of course “non-deterministic”
• Implying that the machine can exist in
more than one state at the same time
• Transitions could be non-deterministic
1 qj
qi … • Each transition function therefore
1 maps to a set of states
qk
48
If the automaton is in a state {q0} and the
input symbol is 0, what will be the next state?

0 1

q0 q1

q
0
d(q0, 0100)={q0, q3, q4}

0
0
1 1 q2
q0 q
1 1 1
0 0
^

0 q
q3
4
1
Non-deterministic Finite Automata (NFA)

• A Non-deterministic Finite Automaton (NFA)


consists of:
– Q ==> a finite set of states
– ∑ ==> a finite set of input symbols (alphabet)
– q0 ==> a start state
– F ==> set of final states
– δ ==> a transition function, which is a mapping
between Q x ∑ ==> subset of Q
• An NFA is also defined by the 5-tuple:
– {Q, ∑ , q0,F, δ }

51
LEXICAL ANALYSIS
ND- Finite automata
4、Non-deterministic FA (NFA)
Note:1) In a NFA,the same character can label
two or more transitions out of one state;
2) In a NFA, is a legal input symbol.
3) A DFA is a special case of a NFA
4)A NFA accepts an input string x if and
only if there is some path in the transition graph
from start state to some accepting state. A path
can be represented by a sequence of state
transitions called moves.
5)The language defined by a NFA is the set
of input strings it accepts
e.g. An NFA M=
({q0,q1},{0,1},move,q0,{q1})

0
input 0 1 1
State q0 1 q1
0
q0 q0 q1 0

q1 q0, q1 q0
Regular expression: (0+1)*01(0+1)*

NFA for strings containing 01

Why is this non-deterministic?


• Q = {q0,q1,q2}
0,1 0,1 • S = {0,1}
• start state = q0
start 0 1
q0 q1 q2 • F = {q2}
Final • Transition table
state symbols
0 1
What will happen if at state q1 q0 {q0,q1} {q0}
states

an input of 0 is received? q1 Φ {q2}


*q2 {q2} {q2}
54
But, DFAs and NFAs are equivalent in their power to capture langauges !!
Differences: DFA vs. NFA
• DFA • NFA
1. All transitions are deterministic 1. Some transitions could be non-
– Each transition leads to exactly deterministic
one state – A transition could lead to a
2. For each state, transition on all subset of states
possible symbols (alphabet) 2. Not all symbol transitions need to
should be defined be defined explicitly (if
3. Accepts input if the last state is in undefined will go to a dead state
F – this is just a design
4. Sometimes harder to construct convenience, not to be confused
because of the number of states with “non-determinism”)
5. Practical implementation is 3. Accepts input if one of the last
feasible states is in F
4. Generally easier than a DFA to
construct
5. Practical implementation has to
be deterministic (convert to
DFA) or in the form of
parallelism
55
Equivalence of DFA & NFA

56
NFA to DFA construction: Example

• L = {w | w ends in 01}
1 0
NFA: DFA: 0 1
0,1 {q0} {q0,q1} {q0,q2}
0
0 1 1
q0 q1 q2

δD 0 1 δD 0 1
δN 0 1
Ø Ø Ø {q0} {q0,q1} {q0}
q0 {q0,q1} {q0} {q0} {q0,q1} {q0} {q0,q1} {q0,q1} {q0,q2}
q1 Ø {q2} {q1} Ø {q2} *{q0,q2} {q0,q1} {q0}
*q2 Ø Ø *{q2} Ø Ø
{q0,q1} {q0,q1} {q0,q2}
*{q0,q2} {q0,q1} {q0} 0. Enumerate all possible subsets
*{q1,q2} Ø {q2} 1. Determine transitions
*{q0,q1,q2} {q0,q1} {q0,q2} 2. Retain only those states
reachable from {q570}
LEXICAL ANALYSIS
Finite automata
6、 Minimizing the number of States of a DFA
a)Basic idea
Find all groups of states that can be distinguished by
some input string. At beginning of the process, we
assume two distinguished groups of states: the group of
non-accepting states and the group of accepting states.
Then we use the method of partition of equivalent class
on input string to partition the existed groups into
smaller groups .
• e.g. Minimize the following DFA.
a
a b
1 3 4
a b
b a a a b
0
b
2 b a
5 6
b
• 1. Initialization: ∏0={{0,1,2},{3,4,5,6}}
• 2.1 For Non-accepting states in ∏0 :
– a: move({0,2},a)={1} ; move({1},a)={3} . 1,3
do not in the same subgroup of ∏0.
– So ,∏1`={{1},{0,2},{3,4,5,6}}
– b: move({0},b)={2}; move({2},b)={5}. 2,5 do
not in the same subgroup of ∏1‘.
– So, ∏1``={{1},{0},{2},{3,4,5,6}}
2.2 For accepting states in ∏0 :
– a: move({3,4,5,6},a)={3,6}, which is the
subset of {3,4,5,6} in ∏1“
– b: move({3,4,5,6},b)={4,5}, which is the
subset of {3,4,5,6} in ∏1“
– So, ∏1={{1},{0},{2},{3,4,5,6}}.
3.Apply the step (2) again to ∏1 ,and get ∏2.
– ∏2={{1},{0},{2},{3,4,5,6}}= ∏1 ,
– So, ∏final = ∏1
4. Let state 3 represent the state group {3,4,5,6}
So, the minimized DFA is :

1 a
a a
0 b a 3
b b
b
2
Construction of Finite Automata Equivalent to A RE
• Construct an FA equivalent to the regular
expression: (0+1)* (00+11)(0+1)*
(0+1)* (00+11)(0+1)*
q0 qf

(0+1) * (00+11) (0+1) *


q0 q0 q0 qf
(0+1)*
(0+1)*
(00+11)
q0 ۸ ۸
q5 ۸ q1 q2 q6 ۸ qf
(00+11)
q7
(0+1)*
(0+1)* 0 0

q0 ۸ ۸
q5 ۸ q1 q2 q6 ۸ qf
1 1
q8

(0+1)*
(0+1)* 0 0

q5 qf
1 1

You might also like