0% found this document useful (0 votes)
2 views

Unit-2.Compiler Design_ Lexical Analysis

Chapter 3 discusses lexical analysis in compiler design, covering the role of lexical analyzers, token specifications, and recognition processes. It explains the importance of separating lexical analysis from parsing for simplicity and efficiency, and introduces concepts such as tokens, patterns, lexemes, and finite automata. The chapter also details error recovery methods, input buffering techniques, and the implementation of lexical analyzers using tools like Lex.

Uploaded by

rashmikant2009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-2.Compiler Design_ Lexical Analysis

Chapter 3 discusses lexical analysis in compiler design, covering the role of lexical analyzers, token specifications, and recognition processes. It explains the importance of separating lexical analysis from parsing for simplicity and efficiency, and introduces concepts such as tokens, patterns, lexemes, and finite automata. The chapter also details error recovery methods, input buffering techniques, and the implementation of lexical analyzers using tools like Lex.

Uploaded by

rashmikant2009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter 3

Lexical Analysis

Compiler Design by Varun Arora 1


Outline
 Role of lexical analyzer
 Specification of tokens
 Recognition of tokens
 Lexical analyzer generator
 Finite automata
 Design of lexical analyzer generator

Compiler Design by Varun Arora 2


The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table

Compiler Design by Varun Arora 3


Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability

Compiler Design by Varun Arora 4


Tokens, Patterns and Lexemes
 A token is a pair a token name and an optional token
value
 A pattern is a description of the form that the lexemes
of a token may take
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token

Compiler Design by Varun Arora 5


Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);

Compiler Design by Varun Arora 6


Attributes for tokens
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>

Compiler Design by Varun Arora 7


Lexical errors
 Some errors are out of power of lexical analyzer to
recognize:
 fi (a == f(x)) …
 However it may be able to recognize errors like:
 d = 2r
 Such errors are recognized when no pattern for tokens
matches a character sequence

Compiler Design by Varun Arora 8


Error recovery
 Panic mode: successive characters are ignored until we
reach to a well formed token
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters

Compiler Design by Varun Arora 9


Input Buffering
 How to speed the reading of source program ?
 to look one additional character ahead
 e.g.
 to see the end of an identifieryou must see a character
 which is not a letter or a digit
 not a part of the lexeme for id
 in C
 -,= , <
 ->, ==, <=
 two buffer scheme that handles large lookaheadssafely
 sentinels –improvement which saves time checking
buffer ends
Compiler Design by Varun Arora 10
Buffer pairs
 Buffer size N
 N = size of a disk block (4096)
 read N characters into a buffer
 one system call
 not one call per character
 read < N characters we encounter eof
 two pointers to the input are maintained
 lexemeBegin–marks the beginning of the current lexeme
 forward–scans ahead until a pattern match is found

Compiler Design by Varun Arora 11


Sentinels
 Forward pointer
 to test if it is at the end of the buffer
 to determine what character is read (multiwaybranch)
 sentinel
 added at each buffer end
 can not be part of the source program
 character eofis a natural choice
 retains the role of entire input end
 when appears other than at the end of a buffer it means
that the input is at an end

Compiler Design by Varun Arora 12


Sentinels
E = M eof * C * * 2 eof eof

Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
} Compiler Design by Varun Arora 13
Specification of tokens
 In theory of compilation regular expressions are used
to formalize the specification of tokens
 Regular expressions are means for specifying regular
languages
 Example:
 Letter_(letter_ | digit)*
 Each regular expression is a pattern specifying the
form of strings

Compiler Design by Varun Arora 14


Regular expressions
 Ɛ is a regular expression, L(Ɛ) = {Ɛ}
 If a is a symbol in ∑then a is a regular expression, L(a)
= {a}
 (r) | (s) is a regular expression denoting the language
L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the language
L(r)L(s)
 (r)* is a regular expression denoting (L9r))*
 (r) is a regular expression denting L(r)

Compiler Design by Varun Arora 15


Regular definitions
d1 -> r1
d2 -> r2

dn -> rn

 Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*

Compiler Design by Varun Arora 16


Extensions
 One or more instances: (r)+
 Zero of one instances: r?
 Character classes: [abc]

 Example:
 letter_ -> [A-Za-z_]
 digit -> [0-9]
 id -> letter_(letter|digit)*

Compiler Design by Varun Arora 17


Recognition of tokens
 Starting point is the language grammar to understand
the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number

Compiler Design by Varun Arora 18


Recognition of tokens (cont.)
 The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
 We also need to handle whitespaces:
ws -> (blank | tab | newline)+

Compiler Design by Varun Arora 19


Transition diagrams
 Transition diagram for relop

Compiler Design by Varun Arora 20


Transition diagrams (cont.)
 Transition diagram for reserved words and identifiers

Compiler Design by Varun Arora 21


Transition diagrams (cont.)
 Transition diagram for unsigned numbers

Compiler Design by Varun Arora 22


Transition diagrams (cont.)
 Transition diagram for whitespace

Compiler Design by Varun Arora 23


Architecture of a transition-
diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Compiler Design by Varun Arora 24
Lexical Analyzer Generator - Lex
Lex Source program Lexical lex.yy.c
lex.l Compiler

lex.yy.c
C a.out
compiler

Sequence
Input stream a.out
of tokens

Compiler Design by Varun Arora 25


Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions

Compiler Design by Varun Arora 26


Example
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants
lexeme, whose first character is
LT, LE, EQ, NE, GT, GE, pointed to by yytext, and whose
IF, THEN, ELSE, ID, NUMBER, RELOP */ length is yyleng, into the symbol
%} table and return a pointer thereto
*/
/* regular definitions }
delim [ \t\n]
ws {delim}+ Int installNum() { /* similar to
installID, but puts numerical
letter [A-Za-z]
constants into a separate table */
digit [0-9]
}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}

Compiler Design by Varun Arora 27
Finite Automata
 Regular expressions = specification
 Finite automata = implementation

 A finite automaton consists of


 An input alphabet 
 A set of states S
 A start state n
 A set of accepting states F  S
 A set of transitions state input state

Compiler Design by Varun Arora 28


Finite Automata
 Transition
s1 a s2
 Is read
In state s1 on input “a” go to state s2

 If end of input
 If in accepting state => accept, othewise => reject
 If no transition possible => reject

Compiler Design by Varun Arora 29


Finite
 A state
Automata State Graphs

• The start state

• An accepting state

a
• A transition

Compiler Design by Varun Arora 30


A ASimple Example
finite automaton that accepts only “1”

 A finite automaton accepts a string if we can follow


transitions labeled with the characters in the string
from the start to some accepting state

Compiler Design by Varun Arora 31


Another Simple Example
 A finite automaton accepting any number of 1’s
followed by a single 0
 Alphabet: {0,1}

 Check that “1110” is accepted but “110…” is not

Compiler Design by Varun Arora 32


And Another
 Alphabet {0,1}
Example
 What language does this recognize?

1 0

0 0

1
1

Compiler Design by Varun Arora 33


And Another Example
 Alphabet still { 0, 1 }
1

 The operation of the automaton is not completely


defined by the input
 On input “11” the automaton could be in either state

Compiler Design by Varun Arora 34


Epsilon Moves
 Another kind of transition: -moves

A B

• Machine can move from state A to state B


without reading input

Compiler Design by Varun Arora 35


Deterministic and
Nondeterministic Automata
 Deterministic Finite Automata (DFA)
 One transition per input per state
 No -moves
 Nondeterministic Finite Automata (NFA)
 Can have multiple transitions for one input in a given
state
 Can have -moves
 Finite automata have finite memory
 Need only to encode the current state

Compiler Design by Varun Arora 36


Execution of Finite Automata
 A DFA can take only one path through the state graph
 Completely determined by input

 NFAs can choose


 Whether to make -moves
 Which of multiple transitions for a single input to take

Compiler Design by Varun Arora 37


Acceptance of NFAs
 An NFA can get into multiple states
1

0 1

• Input: 1 0 1

• Rule: NFA accepts if it can get in a final state

Compiler Design by Varun Arora 38


NFA vs. DFA (1)
 NFAs and DFAs recognize the same set of languages
(regular languages)

 DFAs are easier to implement


 There are no choices to consider

Compiler Design by Varun Arora 39


NFA vs. DFA (2)
 For a given language the NFA can be simpler than the
DFA
1
0 0
NFA
0

1 0
0 0
DFA
1
1

• DFA can be exponentially larger than NFA

Compiler Design by Varun Arora 40


Regular Expressions to Finite
Automata
 High-level sketch

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

Compiler Design by Varun Arora 41


Regular Expressions to NFA (1)
 For each kind of rexp, define an NFA
 Notation: NFA for rexp A

• For 

• For input a
a

Compiler Design by Varun Arora 42


Regular Expressions to NFA (2)
 For AB
A 
B

• For A | B


B 

 A

Compiler Design by Varun Arora 43


Regular Expressions to NFA (3)
 For A*

A

Compiler Design by Varun Arora 44


Example of RegExp -> NFA
conversion
 Consider the regular expression
(1 | 0)*1
 The NFA is

 C
1 
E
1
 0
A B G H  I J
 D F 

Compiler Design by Varun Arora 45


Next
NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

Compiler Design by Varun Arora 46


NFA to DFA. The Trick
 Simulate the NFA
 Each state of resulting DFA
= a non-empty subset of states of the NFA
 Start state
= the set of NFA states reachable through -moves from
NFA start state
 Add a transition S a S’ to DFA iff
 S’ is the set of NFA states reachable from the states in S
after seeing the input a
 considering -moves as well
Compiler Design by Varun Arora 47
NFA -> DFA Example

 C 1 E 
1
 0
A B G H  I J
 D F 

0
0 FGABCDHI
0 1
ABCDHI
1
1 EJGABCDHI

Compiler Design by Varun Arora 48


NFA to DFA. Remark
 An NFA may be in many states at any time

 How many different states ?

 If there are N states, the NFA must be in some subset


of those N states

 How many non-empty subsets are there?


 2N - 1 = finitely many, but exponentially many

Compiler Design by Varun Arora 49


Implementation
 A DFA can be implemented by a 2D table T
 One dimension is “states”
 Other dimension is “input symbols”
 For every transition Si a Sk define T[i,a] = k
 DFA “execution”
 If in state Si and input a, read T[i,a] = k and skip to state
Sk
 Very efficient

Compiler Design by Varun Arora 50


Table Implementation of a DFA
0
0 T
0 1
S
1
1 U

0 1
S T U
T T U
U T U

Compiler Design by Varun Arora 51


Implementation (Cont.)
 NFA -> DFA conversion is at the heart of tools such as
flex or jflex

 But, DFAs can be huge

 In practice, flex-like tools trade off speed for space in


the choice of NFA and DFA representations

Compiler Design by Varun Arora 52


Readings
 Chapter 3 of the book

Compiler Design by Varun Arora 53

You might also like