0% found this document useful (0 votes)
30 views

Lexical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Lexical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 88

CHAPTER 2

Lexical Analysis
(Scanning)
1. THE ROLE OF LEXICAL ANALYZER
tokens
source lexical analyzer syntax analyzer
program (scanner) (parser)

symbol table
manager
 Main task: to read input characters and group them into
“tokens.”
 Secondary tasks:
 Skip comments and whitespace;
 Correlate error messages with source program (e.g., line number of error).
Different approaches for Implementing Lexical
Analyzers:
 Using a scanner generator, e.g., lex or flex. This automatically
generates a lexical analyzer from a high-level description of the tokens.
(easiest to implement; least efficient)
 Programming it in a language such as C, using the I/O facilities of the
language.
(intermediate in ease, efficiency)
 Writing it in assembly language and explicitly managing the input.
(hardest to implement, but most efficient)
 token: a name for a set of input strings with related
structure.
Example: “identifier,” “integer constant”
 pattern: a rule describing the set of strings
associated with a token.
Example: “a letter followed by zero or more letters, digits, or
underscores.”
 lexeme: the actual input string that matches a
pattern.
Example: count
Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by …”
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by …”
Lexeme: 123
 If more than one lexeme can match the pattern for a
token, the scanner must indicate the actual lexeme
that matched.
 This information is given using an attribute
associated with the token.
Example: The program statement
count = 123
yields the following token-attribute pairs:
identifier, pointer to the string “count”
assg_op, 
integer_const, the integer value 123
2. Input Buffering Scheme

Three approaches for Implementing Lexical


Analyzers:
•Using a scanner generator, e.g., lex or flex. This
automatically generates a lexical analyzer from a high-
level description of the tokens.
• (easiest to implement; least efficient)
•Programming it in a language such as C, using the I/O
facilities of the language.
• (intermediate in ease, efficiency)
•Writing it in assembly language and explicitly managing
the input.
(hardest to implement, but most efficient)
These three choices are listed in the increasing difficulty
for the implementer or compiler writer.
 Lexical Analyzer performance or Speed is
crucial, since
 This is the only part of the compiler that examines the entire input
program one character at a time.
 Disk input can be slow.
 The scanner accounts for considerable 25-30% of total compile
time.
 LA has to lookahead to determine when a match has been
found to announce a token.
 Scanners or LAs use and Inpput buffering technique called
double-buffering to minimize the overheads associated with
identification of tokens in a speed maner.
Code:
Input Buffering scheme with Sentinels
 Objective: Optimize the common case by reducing
the number of tests to one per advance of fwd.
 Idea: Extend each buffer half to hold a sentinel at
the end.
 This is a special character that cannot occur in a
program (e.g., EOF).
 It signals the need for some special action (fill
other buffer-half, or terminate processing).
3. Specification of Tokens: regular expressions
Terminology:
alphabet : a finite set of symbols
string : a finite sequence of alphabet symbols
language : a (finite or infinite) set of strings.
Regular Expressions
A pattern notation for describing certain kinds
of sets over strings:
Given an alphabet :
  is a regular exp. (denotes the language {})
 for each a  , a is a regular exp. (denotes the language
{a})
 if r and s are regular exps. denoting L(r) and L(s)
respectively, then so are:
 (r) | (s) ( denotes the language L(r)  L(s) )
 (r)(s) ( denotes the language L(r)L(s) )
 (r)* ( denotes the language L(r)* )
4. FINITE AUTOMATA – NFA and DFA
Finite Automata

A finite automaton is a 5-tuple


(Q, , T, q0, F), where:
  is a finite alphabet;
 Q is a finite set of states;
 T: Q    Q is the
transition function;
 q0  Q is the initial state;
and
 F  Q is a set of final
states.
NFA with € symbol for RE : (a/b)*abb
5.From Regular Expressions to NFA
The following algorithm is used to construct NFA for the
given RE.
Construct NFA for the given RE : (a/b)*abb

First decompose the given complex RE


into a simple REs .
The final NFA after applying algorithm
6.Conversion from NFA to DFA
Construct NFA with € symbol for RE (a/b)*abb and convert
it into DFA
NFA accepting the strings by the given RE is

First consider the staring state of NFA 0, and then


compute the €-closure(0) which is a starting state for
DFA and is set of states taken from NFA
DFA after conversion accepts the set of strings
represented by the RE (a/b)*abb
Simulating a DFA
7. Recognition of tokens
Considering the language generated by the
following grammar for the recognition of the tokens by
Lexical Analyzer. The grammar is
State9: c=nextchar();
if LETTWR(c) then STATE=10 else FAIL();
State10: c=nextchar();
if LETTWR(c) OR DIGIT(c) then STATE=10
else if OTHER(c) then STATE=11
else FAIL();
State11: Return (getToken(), install_ID())
8.Language for specifying Lexical analyzer (Lex,flex);

LEX tool
Considering the language generated by the
following grammar for the recognition of the tokens by
Lexical Analyzer. The grammar is
The following is a LEX program that recognizes the
tokens of various categories like white space,
identifier, number, relational operators, and
keywrods:if, then, else
9. Design of scanner or Lexical Analyzer generator
In next step convert this compound NFA to DFA for
recognizing the tokens by LA.
First construct NFA for each pattern Pi in LEX program.
Then construct compound NFA which recognizes all string
represented by all the patterns.

Pattern P1

Pattern P2

Pattern P3
Converting the above NFA to DFA –
The starting state A of DFA is composed of {0,1,3,7} NFA States as E-closure(0)
{0,1,3,7 }

A=
B=
C=
D=
E=
F=

You might also like