0% found this document useful (0 votes)
261 views

002chapter 2 - Lexical Analysis

Here are the tokens in the given program: 'int' 'max' '(' 'int' 'x' ',' 'int' 'y' ')' '{' 'int' 'x' ',' 'y' ';' '{' '/'*'/ 'find' 'max' 'of' 'x' 'and' 'y' '/*'*/' '{' 'return' '(' 'x' '>' 'y' '?' 'x' ':' 'y' ')' '}' '}'

Uploaded by

dawod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views

002chapter 2 - Lexical Analysis

Here are the tokens in the given program: 'int' 'max' '(' 'int' 'x' ',' 'int' 'y' ')' '{' 'int' 'x' ',' 'y' ';' '{' '/'*'/ 'find' 'max' 'of' 'x' 'and' 'y' '/*'*/' '{' 'return' '(' 'x' '>' 'y' '?' 'x' ':' 'y' ')' '}' '}'

Uploaded by

dawod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 114

Chapter- Two

Lexical Analysis
Basic Topics in chapter-Two

 Introduction of lexical analysis  Equivalence of DFA and NDF


 Minimizing the number of states of a
 Function of Lexical Analysis
DFA
 Role of lexical analyzer
 Regular Language
 Lexeme, Token, and patterns
 Regular expressions
 Finite automata  From regular expressions to finite
o Basic terms: Alphabet, empty string, automata
Strings and languages  Specification of tokens
• Types of finite automata (DFA and  Recognition of tokens
NFA)
 Implementation of a lexical analyzer
: Chapter-Two: Lexical Analysis
2.1. Introduction of lexical analysis
 Lexical analysis is the first phase of a compiler. It is also called Scanner
 It takes the modified source code from language preprocessors that are written in the
form of sentences.
 The lexical analyzer breaks these source codes into a series of tokens, by removing any
whitespace or comments in the source code.
 A token- describe a pattern of characters having some meaning in the source program.
Such as: identifier, operators, keywords, numbers, delimiter etc.

 If the lexical analyzer finds a token invalid, it generates an error.

 Lexical Analysis can be implemented with the Deterministic finite Automata.

 The output is a sequence of tokens that are sent to the parser for syntax analysis

: Chapter-Two: Lexical Analysis


2.2. Function of Lexical Analysis
 The main responsibility of LA/ Why LA is Important?
o Scans and reads the source code character by character

o Tokenization .i.e Grouping input characters into valid tokens

o Eliminates comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input).
o Report errors message with line number and display it

o Retrieve and update symbol table stores identifier ---etc.

: Chapter-Two: Lexical Analysis


2.3 The Role of a Lexical Analyzer

pass token
and attribute value
read char
Source
program
Lexical Parser
analyzer
put back get next
char id

Read entire
program into Symbol Table
memory

: Chapter-Two: Lexical Analysis


Lexical Analysis Versus Parsing

 There are a number of reasons why the analysis portion of a compiler is normally

separated into lexical analysis and parsing (syntax analysis) phases.

i. Simplicity of design

ii. Compiler efficiency is improved.

iii. Compiler portability is enhanced.

: Chapter-Two: Lexical Analysis


Tokens, Patterns, and Lexemes
 When discussing lexical analysis, we use three related but distinct terms:

 Those terms are:


i. Tokens

ii. Lexemes

iii. Patterns

: Chapter-Two: Lexical Analysis


i. Tokens
 What’s a token?
 A class of category of input stream
 Or a set of strings in the same pattern
 For example: In English: - noun, verb, adjective, …
 In a programming language: Identifier, operators, Keyword, Whitespace,

 The lexical analyzer scans the source program and produces as output a sequence of tokens, which are
normally passed, one at a time to the parser.

 A token is a pair consisting of a token name and an optional attribute value.


o The token name is an abstract symbol representing a kind of lexical unit,
o The token names are the input symbols that the parser processes.
 In what follows, we shall generally write the name of a token in boldface.
 We will often refer to a token by its token name.

: Chapter-Two: Lexical Analysis


What are Tokens For?

 Classify program substrings according to role

 Output of lexical analysis is a stream of tokens . . .which is input to the parser.

 Parser relies on token distinctions

 An identifier is treated differently than a keyword

: Chapter-Two: Lexical Analysis


Typical Tokens in a PL
 Tokens correspond to sets of strings.
o Symbols/operators: +, -, *, /, =, <, >, ->, …
o Keywords: if, else, while, struct, float, int, begin …
o Integer and Real (floating point) literals: 123, 123.45, 123E (+ or -), 123.45E (+ or -),
o Char (string) literals: a g f 7 8 + $ 3 * #
o Identifiers: strings of letters or digits, starting with a letter
o letter(letter| digits)*
o Comments: e.g. /* statement */
o White space: a non-empty sequence of blank, newline and tab

 Example of Non-Tokens:
o Comments, preprocessor directive, macros, blanks, tabs, newline etc

: Chapter-Two: Lexical Analysis


ii. lexeme
 Each time the lexical analyzer returns a token to the parser, it has an associated
lexeme
 A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token. i.e. the sequence of characters of a token
iii. pattern
 Each token has a pattern that describes which sequences of
characters can form the lexemes corresponding to that token.
 A pattern is a description of the form that the lexemes of a token may take. i.e. A rule that
describes a set of strings
 In the case of a keyword as a token, the pattern is just the sequence of characters that form
the keyword.
 The set of words, or strings of characters, that match a given pattern is called a language.
: Chapter-Two: Lexical Analysis
Example #1 (Tokens, Patterns and Lexemes)

Lexeme Token
 int MAX(int a, int b)
int keywords
this is source program
MAX identifier
( operator
int keyword
a identifier
, operator
int keyword
b identifier
) operator
: Chapter-Two: Lexical Analysis
Example #2 (Tokens, Patterns and Lexemes)

 gives some typical tokens, their informally described patterns, and some sample
lexemes.
 To see how these concepts are used in practice, in programming language.

printf ( " Total = %d\n", score) ;


 both printf and score are lexemes matching the pattern for token id, and

" Total = °/,d\n" is a lexeme matching literal.

: Chapter-Two: Lexical Analysis


Example #3 (Tokens, Patterns and Lexemes)

Token Sample Lexemes Pattern

keyword if if

id abc, n, count,… Letters(letters|digit)*

NUMBER 3.14, 1000 numerical constant

; ; ;

: Chapter-Two: Lexical Analysis


Example #4

 Consider the program


int main()
{
// 2 variables All the valid tokens are:
int a, b; 'int' 'main' '(' ')' '{' 'int' 'a' ','
a = 10; 'b' ';' 'a' '=' '10' ';' 'return' '0' ';'
return 0; '}'
}

: Chapter-Two: Lexical Analysis


Exercise #1 Exercise #2
 Consider the program Q1. Source code:
Printf (“%d has”, $ X);
int max (x, y)
How many tokens in this source
int x, y; code
{ Q2. Consider source code:
int max (int a, int b){
/* find max of x and y*/
if(a>b)
{
return a;
return (x>y ? X:y) else
} return b;
}
Q1. How many tokens in this source code
•How many tokens in this source
Q2. identify lexeme, token and pattern for the code?
given source code •identify lexeme and token
: Chapter-Two: Lexical Analysis
• Attributes for Tokens
 When more than one lexeme can match a pattern,

 the lexical analyzer must provide the subsequent compiler phases additional
information about the particular lexeme that matched.
 For example, the pattern for token number matches both 0 and 1,

 but it is extremely important for the code generator to know which lexeme was
found in the source program.

: Chapter-Two: Lexical Analysis


 Attributes for Tokens(cont’d)
 LA returns to the parser not only a token name,

 but an attribute value that describes the lexeme represented by the token;

 the token name influences parsing decisions, while the attribute value influences
translation of tokens after the parse.
 Normally, information about an identifier — e.g., its lexeme, its type, and

 the location at which it is first found — is kept in the symbol table.

: Chapter-Two: Lexical Analysis


Example: Attributes for Tokens
E = M * C ** 2

token Attribute

id <id, pointer to symbol table entry for E>


operator <assign-op>

id <id, pointer to symbol table entry for M>

operator <mult-op>

id <id, pointer to symbol table entry for C>

operator <exp-op>

NUM <number, integer value 2>

: Chapter-Two: Lexical Analysis


Error detection and Recovery in Compiler
 In this phase of compilation,

 all possible errors made by the user are detected and reported to the user in
form of error messages.
 This process of locating errors and reporting it to user is called Error Handling
process.
 Functions of Error handler
 Detection,
 Reporting,
 Recovery

: Chapter-Two: Lexical Analysis


Classification of Errors

Compile time errors are of three types

: Chapter-Two: Lexical Analysis


Lexical phase errors
 These errors are detected during the lexical analysis phase Typical lexical errors are
o Exceeding length of identifier or numeric constants.

o Appearance of illegal characters

o Unmatched string

Example:
#1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.
#2 : This is a comment */
 This is an lexical error since end of comment is present but beginning is not present.

: Chapter-Two: Lexical Analysis


Example #2
 Some errors are out of power of lexical analyzer to recognize:

 for example: fi (a == f(x)) …


• However it may be able to recognize errors like: d = 2r
• Such errors are recognized when no pattern for tokens matches a character sequence
 a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
 Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id
to the parser and let some other phase of the compiler
 probably the parser in this case handle an error due to transposition of the letters.

: Chapter-Two: Lexical Analysis


Error recovery Strategies

 Successive characters are ignored until we reach to a well formed token

 Delete one character from the remaining input

 Insert a missing character into the remaining input

 Replace a character by another character

 Transpose (change the order) two adjacent characters

: Chapter-Two: Lexical Analysis


Finite automata?
• What is a Finite automata?
• One of the powerful models of computation which are restricted model of actual
computer is called finite automata.
• These machines are very similar to CPU of a computer.

• They are restricted model as they lack memory.

• Finite automata is called finite


– because number of possible states and number of letter in alphabet are both finite and
automaton because the change of state is totally governed by the input.

Q. Why abstract model are required?


: to simplified
Chapter-Two: a things
Lexical Analysis
Basic terms used in automaton theory
• Everything in mathematics is based on symbols. This is also true for automata theory.

 Alphabets: is defined as a finite, non empty set of symbols.


» Denoted by ∑ (Sigma)

• Example of alphabet is :

• a set of decimal numbers. ∑={0,1,2,3,4,5,6,7,8,9},

• All Lower and upper letter alphabet ∑= {a, b, ...... zA…Z}.

• Binary Alphabet ∑= {0, 1}

• A set of ASCII character: 2^8=128


: Chapter-Two: Lexical Analysis
 Strings:
– A string over an alphabet is a finite sequence of symbols from that alphabet,
which is usually written next to one another and not separated by commas.
• Example of strings:
1. If Σa = {0 ,1 }then 001001 is a string over Σ a .
2. If Σ b = {a ,b , , , z) then axyrpqstcd is a string over Σ b.

 Empty String:
– The string of zero length. This is denoted by ∈ (epsilon).
– The empty string plays the role of 0 in a number system.
– The set of strings, including empty, over an alphabet ∑ is denoted by ∑*. or ∑+ = ∑* -{є}
– E.g {є}-means empty string of length zero
– {a}-one string with whose length is 1
: Chapter-Two: Lexical Analysis
 Length of String:
– The number of positions for symbols in the string.
– A string is generally denoted by w.
– The length of string is denoted by |w|.
– Example: |10011| = 5

 Suffix: If w = xv for some x, then v is a suffix of w.


 Prefix: If w = vy for some y, then v is a prefix of w

 Reverse String: If w= w1 w2…… wn where each wi ∈ Σ, the reverse of w is wn wn-1….w1.


 Substring: z is a substring of w if z appears consecutively within w.
As an example, ‘deck’ is a substring of ‘abcdeckabcjkl’

: Chapter-Two: Lexical Analysis


 Kleene Star:
– Another language operation is the “Kleene Star or kleene closure” of a language
L, which is denoted by ∑*.
– ∑*. is the set of all strings obtained by concatenating zero or more strings from ∑

– Example:

– {0,1}*={є,0,1,00,01,10,11….}

Note: ∑*= ∑0 U ∑1 U ∑2 U….

thus ∑*=∑+ U{є}


: Chapter-Two: Lexical Analysis


 Language:
– Any set of strings over an alphabet Σ is called a language.
– The set of all strings, including the empty string over an alphabet Σ is denoted as Σ*.
– If ∑ is an alphabet, and L subset of ∑*, then L is said to be language over alphabet ∑.
– For example the language of all strings consisting of n 0’s followed by n 1’s for
some n>=0: {є,01,0011,000111,-------}
– Langauge in set forms- {w|some logical view about w}
• e.g L= {anbn|n>=1}
– Infinite languages L are denoted as:
L = {w ∈ w Σ* : w has property P}

: Chapter-Two: Lexical Analysis


 Power of an Alphabet:
– if ∑ is an alphabet, we can express the set of all strings of a certain length from that
alphabet by using the exponential notation.
– ∑k the set of strings length k, each of whose is in ∑

– ∑0 ={є}, regardless of what alphabet ∑ is. That is є is the only string of length 0.

– If ∑={0,1} then
– ∑1={0,1}
– ∑2={00,01,10,11}

– ∑3={000,001,…111} and so on
: Chapter-Two: Lexical Analysis
 Concatenation Language:
– Assume a string x of length m and string y of length n, the concatenation of x and y
is written xy, which is the string obtained by appending y to the end of x, as in x1
x2 …x m y1 y2….. yn.
– To concatenate a string with itself many times we use the “superscript” notation:

: Chapter-Two: Lexical Analysis


Example #1
a. Given ∑ ={a, b} obtain ∑*.
b. Given examples of a finite language in ∑.
c. Given L ={anbn :n≥0}, check if the string aabb, aaaabbbb, abb are in the language L
• Solution:
a. ∑= {a,b} therefore we have ∑*={∈, a, b, ab, aa, bb, aaa, bbb,…..}
b. {a, aa, aab} is an example of a finite language in ∑
Solution c
– i. aabb → a string in L.(n =2)
– ii. aaaabbbb → a string in L.(n = 4)
– iii. abb → not a string in L (since there is no satisfying this condition)

: Chapter-Two: Lexical Analysis


Example #2
• Given: L= {anbn+1: n≥0}. It is true that L =L* for the following language L?

• Solution:

– we know L*= L0 U L1 U L2….. and L+ = L1 U L2 U L3……

– Now for given L = {anbn+1: n≥0}., we have

– Therefore, we have L*= L0 U L1 U L2........


– = bU ab2Ua2b3 ……
– Hence it is true that L = L*
: Chapter-Two: Lexical Analysis
Formal language vs Natural language
• Formal language-
– is a set of strings, each string composed of symbols from a finite symbol set called alphabet. (means
Define language in mathematical fashion)
– For example, the alphabet for the sheep language is the set ∑ = {a,b,!}. given a model m (such as a
particular FSA).
– We can use L(m) to mean “the formal language characterized by m”. L(m) = {baa!, baaa!......}

• Natural language:
– Which is a kind of language that real people speak
– Formal language is different from natural language
– We use a formal language to model part of a natural language such as parts of the phonology,
morphology or syntax
: Chapter-Two: Lexical Analysis
Types of Finite automata

: Chapter-Two: Lexical Analysis


a. Deterministic Finite Automata (DFA)
• Deterministic means that at each point in processing there is always one unique next
state (has no choice poit).
• What is DFA?

– It is the simplest model of computation, because we know certain inputs

– It has a very limited memory

– In DFA, given current state we know that the next state will be move (transition from one
state to another) is unequally determined by the current configuration.
– If the internal state, input and contents of the storage are known, it is possible to predict the
future behavior of the automaton.
– This is said to be DFA otherwise it is NFA.
: Chapter-Two: Lexical Analysis
Definition of DFA

: Chapter-Two: Lexical Analysis


DAF description or how it ’s work
• The input mechanism can move only from left to right and reads exactly one symbol
on each step and reads symbols of the input word in sequence.
• The transition from one internal state to another is governed by the transition
function δ.
• A sequence of states q0,q1,q2,...., qn, where qi∈ Q such that q0 is the start state and
qi = δ(qi-1,ai) for 0 < i ≤ n, is a run of the automaton on an input word w =
a1,a2,...., an ∈ Σ.

• If δ(q0 ,a ) = q1, then if the DFA is in state q0 and the current input symbol is a, the
DFA will go into state q1.
: Chapter-Two: Lexical Analysis
Example

: Chapter-Two: Lexical Analysis


DFA Representations

a. Transition diagram
– a diagram consisting of circles to represent states and directed line segments to
represent transitions between the states.
– One or more actions (outputs may be associated with each transition).

– If any state q in Q is the starting state then it is represented by

the circle with arrow as 


– Nodes corresponding to accepting states are marked by a double circle

: Chapter-Two: Lexical Analysis


DFA Representations…
b. Transition Table
 A state transition table is a table showing what state or finite state machine will
move to, based on the current state and other inputs.
 Example:

: Chapter-Two: Lexical Analysis


DFA Representations…

c. Transition detailed descriptions


– The description of state transitions by statement (words) is known as transition
detailed descriptions.
•Example:

: Chapter-Two: Lexical Analysis


Example: build a sheep talk recognizer
• The sheep language contains any string from the following infinite set:
baaa!
baaaa!
baaaaa! ………
• FSAs (graph notation)
• Direct graph:
– Finite set of vertices
– A set of directed links between pairs of vertices (aka arcs)

: Chapter-Two: Lexical Analysis


How machine the read input strings
• The machine starts in the start state (q0), and iterates the following process:
– Check the next letter of the input.

– If it matches the symbol on an arc leaving the current state, then cross that arc, move to
the next state, and also advance one symbol in the input.
– If we are in the accepting state (q4) when we run out of input, the machine has
successfully recognized an instance of sheeptalk.
– If the machine never gets to the final state, either because it runs out of input, or it gets
some input that doesn’t match an arc, or if it just happens to get stuck in some non-final
state, we say the machine REJECTS or fails to accept an input.

: Chapter-Two: Lexical Analysis


Example #1
• Q1. Determine the DFA schematic M = (Q, ∑, δ, q, F) , where Q ={q1, q2,q3}, ∑ ={0,1}, q1 is the
start state, F= {q2} and δ is given by the table below. Also determine the language L recognized by
the DFA.

• Solution:

From the given table for δ, the DFA is drawn, where q2 is the only final state. (It is to be noted that a
DFA can “accept” a string and it can “recognize” a language.
: Chapter-Two: Lexical Analysis
Example #2
Q2. Design a DFA, the language recognized by the automata being L = {anb :n≥0}
•Solution:

• For the given language L = {anb :n≥0} , the strings could be {b, ab, a2b, a3b,…}

• Therefore the DFA accepts all strings consisting of arbitrary number of a’s, followed
by a single b. all other input strings are rejected.

: Chapter-Two: Lexical Analysis


Example #3
• Q3. Determine the languages produced by the FA shown in figs. (a) and (b)

• Solution:
• For ∑ = {a, b}, languages generated = {a,b}* (∈ will be accepted when
initial state equal final state)
• For ∑ = {a, b}, languages generated = {a,b}+ (∈ is not accepted)
: Chapter-Two: Lexical Analysis
Exercise #1
Q1. Design a DFA, M which accepts the language L(M) = {w∈(a, b)*: w does not
contain three consecutive b’s.
•Let M = (Q, ∑, δ, q, F) where Q ={q0, q1, q2,q3}, ∑ ={a,b}, q0 is the initial state ,
F= {q0,q1 ,q2} are final state and δ is defined as follows:

: Chapter-Two: Lexical Analysis


Exercise #2
Q2. Given ∑ = {a, b}, construct a DFA that shall recognize the language L = {bmabn}:
m,n>0}

Q3. Given ∑ = {a, b}, construct a DFA that shall recognize the language L = {a mbn}:
m,n>0}

Q4. Determine the FA with the

a. Set of strings beginning with an ‘a’

b. Set of strings beginning with an ‘a’ and ending with ‘b’

c. Set of strings having ‘aaa’ as a subword


: Chapter-Two: Lexical Analysis
b. Non-Deterministic Finite Automata (NFA)

• A NFA can be different from a deterministic one in that for any input symbol,

• Non deterministic one can transit to more than one states.

• In NFA, given the current state there could be multiple next state

• epsilon (∈) transition.

• The questions is how do we choice to which state go to the second input because it
is multiple state?
– In NFA, the next state may be chosen at random or parallel.

: Chapter-Two: Lexical Analysis


Definition of NFA
• A NFA is defined by a 5-tuple M = (Q, ∑, δ, q0, F ) where Q, ∑, δ, q0, F are defined as follows:

– Q = Finite set of internal states

– ∑= Finite set of symbols called “Input alphabet”


– q0 ∈Q is the Initial states
– F ⊆ Q is a set of Final states

– δ = QX (∑ U {∈}) → 2Q A transition function, which is a total function from Q x Σ to 2 Q

– δ: (Q x Σ) → P(Q) . P(Q) is the power set of Q, the set of all subsets of Q


– δ(q,s) -The set of all states p such that there is a transition labeled s from q to p δ(q,s) is a
function from Q x S to P(Q) (but not to Q)

: Chapter-Two: Lexical Analysis


Difference between NFA and DFA
• NFA differs from DFA in that;
– The range of δ in NFA is in the power set 2Q.

– A string is accepted by an NFA if there is some sequence of possible moves that will
put the machine in the final state at the end of the string.
– In DFA, For a given state on a given input we reach to a deterministic and unique
state.
– In NFA or NDFA we may lead to more than one state for a given input.

– In NFA, Empty string can label transitions.

– We need to convert NFA to DFA for designing a compiler.


: Chapter-Two: Lexical Analysis
NFA: Example #1:
• Obtain an NFA for a language consisting of all strings over {0,1} containing a 1 in
the third position from the end.
• Solution:

• q1, q2, q3 are initial states

• q4 is the final state.

• Please note that this is an NFA as


: Chapter-Two: Lexical Analysis
NFA: Example #2
• Sketch the NFA state diagram for Solution:

: Chapter-Two: Lexical Analysis


ε- Transitions to Accepting States

An ε-transition can be made at any time


• For example, there are three sequences on the empty string
– No moves, ending in q0, rejecting
– From q0 to q1, accepting
– From q0 to q2, accepting
• Any state with an ε-transition to an accepting state ends up working like an accepting
state too

: Chapter-Two: Lexical Analysis


ε-transitions For NFA Combining

 ε-transitions are useful for combining smaller automata into larger ones
 This machine combines a machine for {a}* and a machine for {b}*
 It uses an ε-transition at the start to achieve the union of the two languages

: Chapter-Two: Lexical Analysis


Example: Incorrect Union

: Chapter-Two: Lexical Analysis


Example: Correct Union

: Chapter-Two: Lexical Analysis


NDF: Exercise #1

• Design a NFA for the language L=(ab Ս aba)*

• Draw the state diagram for NFA accepting language L=(ab)*(ba)* U aa*

• Design a NFA for language L={0101n U 0100| n>=0}

• Design a NFA to accept strings with a’s and b’s such that strings end with ‘aa’

: Chapter-Two: Lexical Analysis


EQUIVALENCE OF NFA AND DFA
• Definition:

– Two finite accepters M1 and M2 are equivalent iff

L(M1) = L (M 2)
– i.e., if both accept the same language.

– Both DFA and NFA recognize the same class of languages.

– It is important to note that every NFA has an equivalent DFA.

• Let us illustrate the conversion of NFA to DFA through an example.

: Chapter-Two: Lexical Analysis


Example #1

• Determine a deterministic Finite State Automaton from the


given Nondeterministic FSA.

: Chapter-Two: Lexical Analysis


Solution

: Chapter-Two: Lexical Analysis


Solution….

: Chapter-Two: Lexical Analysis


Example #2
• Given the NDA as shown in Fig. (a), with δ as shown in Fig. (b)

Determine the equivalent DFA for the above given NFA

: Chapter-Two: Lexical Analysis


Solution
• Conversion of NDA to DFA is done through subset construction as shown in the
State table diagram below.

 The corresponding DFA is shown below. Please note that here any subset containing
q2 is the final state.

: Chapter-Two: Lexical Analysis


Exercise #2
• Find the equivalence DFA for given NFA

: Chapter-Two: Lexical Analysis


Reducing number of states in Finite Automata
• DFA Minimization

– Can we transform a large automaton into a smaller one (provided a smaller one
exists)?

– If A is a DFA, is there an algorithm for constructing an equivalent minimal


automaton Amin from A?

: Chapter-Two: Lexical Analysis


For Example
• A is equivalent to A‘ i.e., L(A) = L(A‘)
A‘ is smaller than A i.e., |A| > |A‘|
• A can be transformed to A‘:
– States 2 and 3 in A “do the same job”: once A is in state 2 or 3, it accepts the same
suffix string. Such states are called equivalent.
– Thus, we can eliminate state 3 without changing the language of A, by redirecting all
arcs leading to 3 to 2, instead.

: Chapter-Two: Lexical Analysis


DFA Minimization using Equivalence Theorem
• If A and B are two states in a DFA, we can combine these two states into {A, B} if
they are not distinguishable.
• Two states are distinguishable, if there is at least one string S, such that one of δ (A,
S ) and δ (B, S) is accepting and another is not accepting.
• Hence, a DFA is minimal if and only if all the states are distinguishable.

: Chapter-Two: Lexical Analysis


DFA Minimization :Algorithm
• Step 1 All the states Q are divided in two partitions - final states and non-final
states and are denoted by P0. All the states in a partition are 0th equivalent. Take a
counter k and initialize it with 0.
• Step 2 Increment k by 1. For each partition in Pk, divide the states in Pk into two
partitions if they are k-distinguishable.
• Two states within this partition X and Y are k distinguishable if there is an input S
such that δ(A,S) and δ(B,S) are k-1-distinguishable.
• Step 3 If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
• Step 4 Combine kth equivalent sets and make them the new states of the reduced
DFA.

: Chapter-Two: Lexical Analysis


DFA Minimization :Example
• Let us consider the following DFA -
• We will use the Partitioning Method

: Chapter-Two: Lexical Analysis


Solution
• Start by separating final and non-final states: {1,2}, {0,3,4,5}
• All of 0,3,4 go to final states by a b transition, 5 does not (5 is a dead state). So we
have 3 groups: {1,2}, {0,3,4}, {5}
• An a- transition takes 3 to 5 and 0,4 to 3. So we split {0,3,4} into two groups,
getting: {1,2}, {0,4}, {3}, {5}
• Solution:
• We cannot make any further sub-partitions and so the minimal DFA is:

: Chapter-Two: Lexical Analysis


Exercise: #3

: Chapter-Two: Lexical Analysis


2.4 Specification of Tokens
 In theory of compilation regular expressions are an important notation for specifying
lexeme patterns.
 Regular expressions are means for specifying regular languages.
o Because, tokens are specified with the help of RE

 Example: Suppose we wanted to describe the set of valid C identifiers

 Letter(letter | digit)*

 Each regular expression is a pattern specifying the form of strings

 Because, they cannot express all possible patterns, they are very effective in
specifying those types of patterns
: that we
Chapter-Two: actually
Lexical Analysis need for tokens.
 Strings and Languages
 An alphabet is any finite set of symbols such as letters, digits, and punctuation.

– The set {0,1} is the binary alphabet

– If x and y are strings, then the concatenation of x and y is also string, denoted xy,

– For example, if x = dog and y = house, then xy = doghouse.


– A string over an alphabet is a finite sequence of symbols drawn from that alphabet.

 In language theory, the terms "sentence" and "word" are often used as synonyms
for "string."

 |s| represents the length of a string s,


:
Ex: banana is a string of length 6
Chapter-Two: Lexical Analysis
 Strings and Languages (cont ’d)
 The empty string, is the string of length zero.

 The empty string is the identity under concatenation; that is, for any string s, ∈S = S∈ = s.

 A language is any countable set of strings over some fixed alphabet.

 Def. Let Σ be a set of characters. A language over is a set of strings of characters drawn
from Σ.
 Let L = {A, . . . , Z}, then [“A”,”B”,”C”, “BF”…,”ABZ”,…] is consider the language
defined by L
 Abstract languages like ∅, the empty set, or {},the set containing only the empty
string, are languages under this definition.
: Chapter-Two: Lexical Analysis
 Terms for Parts of Strings
 The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example: prefixes of banana
 ∈,a, ba,ban, bana, babab, banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the end
of s. For example: suffix of banana
 banana,banan,baba,ban,ba,b, ∈
3. A substring of s is obtained by deleting any prefix and any suffix from s.
 For instance, banana, nan, and ∈ are substrings of banana.
4. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
 For example: baan is a subsequence of banana.
: Chapter-Two: Lexical Analysis
Operations on Languages

Example:
 Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z } and
 let D be the set of digits {0,1,.. .9}.
 L and D are, respectively, the alphabets of uppercase and lowercase letters and
of digits.
 other languages can be constructed from L and D, using the operators illustrated
above as shown next slide.
: Chapter-Two: Lexical Analysis
 Operations on Languages (cont.)

1. L U D is the set of letters and digits –

2. LD is set of strings consisting of a letter followed by a digit

Ex: A1, a1,B0,etc

3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)

4. L* is the set of all strings of letters, including є, the empty string.

5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.

6. D+ is the set of all strings of one or more digits.

: Chapter-Two: Lexical Analysis


 Regular Expressions (RE)

 A RE is a special text string for describing a search pattern.

 RE were designed to represent regular languages with a mathematical tool,

 a tool built from a set of primitives and operations.

 This representation involves a combination of strings of symbols from some alphabet


∑, parentheses.
 and the operators +, ⋅ and *. Where (union, concatenation and Kleene star).

: Chapter-Two: Lexical Analysis


Building Regular Expressions

 Assume that Σ = {a ,b}

 Zero or more: a* means “zero or more a’s”, To say “zero or more a’s,”

 Example: {∈, , , ab abab …} you need to say (ab)*.

 One or more: Since a+ means “one or more a’s”, you can use aa* (or
equivalently a*a) to mean “one or more a’s”.
 Zero or one: (a)? It can be described as an optional ‘a’ with (a + ∈).

: Chapter-Two: Lexical Analysis


Example #1

 A regular expression is obtained from the symbol {0,1}, empty string ∈, and empty-
set ∅ perform the operations +, ⋅ and *

: Chapter-Two: Lexical Analysis


 Languages defined by Regular Expressions

 There is a very simple correspondence between regular expressions and the languages
they denote:

: Chapter-Two: Lexical Analysis


Example #2
 Let Σ = {a, b}
o The RE a|b denotes the set {a, b}

o The RE (a|b)(a|b) denotes {aa, ab, ba, bb}

o The RE a* denotes the set of all strings{є, a, aa, aaa,.}

o The RE (a|b)* denotes the set of all strings containing є and all strings of a’s and
b’s
o The RE a|a*b denotes the set containing the string a and all strings consisting of
zero or more a’s followed by a b
: Chapter-Two: Lexical Analysis
Precedence and Associativity

 *, concatenation, and | are left associative

 * has the highest precedence

 . Concatenation has the second highest precedence

 | has the lowest precedence

: Chapter-Two: Lexical Analysis


 Regular Definitions
 How to specify tokens?
 Regular definitions

– Let ri be a regular expression and di be a distinct name

– Regular definition is a sequence of definitions of the form

d1  r1

d2  r2

…..

dn  rn

Where each ri is a regular expression over Σ U {d1, d2, , di-1}


: Chapter-Two: Lexical Analysis
Ex #1:  C identifiers are strings of letters, digits, and underscores.
 The regular definition for the language of C identifiers.
– LetterA | B | C|…| Z | a | b | … |z| -
– digit  0|1|2 |… | 9 id letter( letter | digit )*
Ex #2:
 Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or
1.89E-4. in pascal
 The regular definition:
• digit  0|1|2 |… | 9
• digits  digit digit*
• optionalFraction  .digits | 
• optionalExponent  ( E( + |- | ) digits ) | 
• number  digits optionalFraction optionalExponent

: Chapter-Two: Lexical Analysis


Example #3: Example #4:
 My fax number
 My email address : [email protected]
 91-(512)-259-7586
 Σ = letter U {@, . }
 Σ = digits U {-, (, ) }
 Letter  a| b| …| z| A| B| …| Z
 Country  digit+
 Name  letter+

 Area  ‘(‘ digit+ ‘)’  Address  name ‘@’ name ‘.’ name ‘.’
name
 Exchange  digit+

 Phone  digit+

 Number  country ‘-’ area ‘-’


exchange ‘-’ phone : Chapter-Two: Lexical Analysis
2.5 Recognition of Tokens

 In the previous section we learned how to express


patterns using regular expressions.
 Now, we study how to take the patterns for all the
needed tokens and
 build a piece of code that examines the input
string and finds a prefix that is a lexeme matching
one of the patterns.
 Given the grammar of branching statement:

: Chapter-Two: Lexical Analysis


Recognition of Tokens (cont ’d)
 The patterns for the given tokens:

: Chapter-Two: Lexical Analysis


Recognition of Tokens (cont ’d)
 The terminals of the grammar, which are if, then, else, relop, id, and number, are
the names of tokens as used by the lexical analyzer.
 The lexical analyzer also has the job of stripping out whitespace, by recognizing the
"token" ws defined by:

 Tokens are specified with the help of regular expression, then we recognize with
transition diagram.
: Chapter-Two: Lexical Analysis
Recognition of Tokens: Transition Diagram

 Tokens are recognized with transition diagram:

a. Recognition of Relational operators Ex :

RELOP : < | <= | = | <> | > | >=

b. Recognition of identifier: ID = letter(letter | digit) *

c. Recognition of numbers (integer | floating points)

d. Recognition of delimiter or word space EX: blank, tap, newline

e. Recognition of keywords such as if ,else, for

: Chapter-Two: Lexical Analysis


a. Recognition of Relational
operators :Transition Diagram
Ex :RELOP = < | <= | = | <> | > | >=

= 2 return(relop,LE)

1 3
> return(relop,NE)
<
start other *
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)

* indicates input retraction other * return(relop,GT)


8
: Chapter-Two: Lexical Analysis
Mapping transition diagrams into C code

: Chapter-Two: Lexical Analysis


b. Recognition of Identifiers

Ex2: ID = letter(letter | digit) *


Transition Diagram:
letter or digit

letter other
*
start
9 10 11
return(id)

* indicates input retraction

: Chapter-Two: Lexical Analysis


Mapping transition diagrams into C code

switch (state) {
case 9:
if (isletter( c) ) state = 10; else state =
failure();
break;
case 10: c = nextchar();
if (isletter( c) || isdigit( c) ) state = 10; else
state =11
case 11: retract(1);
insert(id);
return;
: Chapter-Two: Lexical Analysis
c.Recognition of numbers (integer | floating points)
Unsigned number in Pascal

Accepting float
Accepting integer Accepting float e.g. 12.31E4
e.g. 12 e.g. 12.31

: Chapter-Two: Lexical Analysis


Example #1
 The lexeme for a given token must be the longest possible

 Assume input to be 12.34E56

 Starting in the third diagram the accept state will be reached after 12

 Therefore, the matching should always start with the first transition diagram

 If failure occurs in one transition diagram then retract the forward pointer to the
start state and activate the next diagram.
 If failure occurs in all diagrams then a lexical error has occurred

: Chapter-Two: Lexical Analysis


d. Recognition of delimiter

 Whitespace characters, represented by delim in that diagram

 typically these characters would be blank, tab, newline, and

 perhaps other characters that are not considered by the language design to be part
of any token.

: Chapter-Two: Lexical Analysis


e. Recognition of keyword
 Install the reserved words in the symbol table initially.

 A field of the symbol-table entry indicates that these strings are never ordinary
identifiers, and tells which token they represent.

 Create separate transition diagrams for each keyword; the transition diagram for the

reserved word: then

: Chapter-Two: Lexical Analysis


Exercise #1

Q. Sketch transition diagram for Recognition of keywords such


as: if, else, for

: Chapter-Two: Lexical Analysis


 Tokens, their patterns, and attribute values

 Token information stored in symbol table

 Token contains two components: token name and attribute value.

 For example

: Chapter-Two: Lexical Analysis


LEX (Lexical Analyzer Generator)
• Lex is a program designed to generate scanners, also known as tokenizers, which recognize
lexical patterns in text.
• Lex is an acronym that stands for "lexical analyzer generator."

• The main purpose of LEX is to facilitate lexical analysis, the processing of character
sequences such as source code to produce symbol sequences called tokens for use as input to
other programs such as parsers.
 The input notation for the Lex tool is referred to as the Lex language and the tool itself is the
Lex compiler.
 The Lex compiler transforms the input patterns into a transition diagram and generates code, in
a file called lex.yy.c, that simulates this transition diagram.
: Chapter-Two: Lexical Analysis
How Lex tool works?

: Chapter-Two: Lexical Analysis


How Lex tool works?...

• Figure 3.22 suggests how Lex is used:

– Firstly lexical analyzer creates a program lex.1 in the Lex language.

– Then Lex compiler runs the lex.1 program and produces a C program lex.yy.c.

– Finally C compiler runs the lex.yy.c program and produces an object program a.out.

– a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

: Chapter-Two: Lexical Analysis


Structure of Lex Programs
 A Lex program is separated into three sections by %% delimiters.

 The general format of Lex source is as follows:


%{
Declarations
%}
%%
translation rules // form: pattern {action}
%%
{
auxiliary functions
}
: Chapter-Two: Lexical Analysis
Declarations sections

 Declarations of ordinary C variables, constants and Libraries.

%{

#include<math.h>

#include<stdio.h>

#include<stdlib.h>

%}

: Chapter-Two: Lexical Analysis


Translation Rules
 It contains regular expressions and code segments.
 Translation Rules define the statement of form
pattern1 {action1}
pattern2 {action2}
.... Pattern n {action}.
• Where

– pattern describes the regular expression or regular definition

– action describes the actions what action the lexical analyzer should take when pattern
pi matches a lexeme. For example:

: Chapter-Two: Lexical Analysis


Auxiliary functions

 This section holds additional functions which are used in actions.

 These functions are compiled separately and loaded with lexical analyzer.

 Lexical analyzer produced by lex starts its process by reading one character at a time

until a valid match for a pattern is found.

 Once a match is found, the associated action takes place to produce token.

 The token is then given to parser for further processing.

: Chapter-Two: Lexical Analysis


Lex Predefined Variables

 yytext -- a string containing the lexeme

 yyleng -- the length of the lexeme

 yyin -- the input stream pointer

 the default input of default main() is stdin

 yyout -- the output stream pointer

 the default output of default main() is stdout.

: Chapter-Two: Lexical Analysis


Lex Library Routines
 yylex()
– The default main() contains a call of yylex()
 yymore()
– return the next token
 yyless(n)
– retain the first n characters in yytext

 yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1

: Chapter-Two: Lexical Analysis


Example #1: Lex program to identify the identifier
%{ {
#include<stdio.h> return 1;
%} }
%% int main()
[a-zA-Z]([a-zA-Z]|[0-9])* {
{printf("Identifier\n");} printf("Enter a string\n");
[0-9]+ {printf("Not a Identifier\n");} yylex();
%%
/*call the yywrap function*/
int yywrap() }

: Chapter-Two: Lexical Analysis

You might also like