0% found this document useful (0 votes)
341 views

Unit 2-LEXICAL ANALYSIS

The document discusses lexical analysis in compiler design. It defines the role of a lexical analyzer as the first phase of a compiler that reads source code characters and produces a sequence of tokens. It describes how regular expressions are used to specify patterns for valid tokens, and finite automata are used to recognize these patterns in the input stream. The lexical analyzer identifies tokens, eliminates whitespace, tracks line numbers for error reporting, and reports any lexical errors encountered during tokenization.

Uploaded by

Buvana Muruga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
341 views

Unit 2-LEXICAL ANALYSIS

The document discusses lexical analysis in compiler design. It defines the role of a lexical analyzer as the first phase of a compiler that reads source code characters and produces a sequence of tokens. It describes how regular expressions are used to specify patterns for valid tokens, and finite automata are used to recognize these patterns in the input stream. The lexical analyzer identifies tokens, eliminates whitespace, tracks line numbers for error reporting, and reports any lexical errors encountered during tokenization.

Uploaded by

Buvana Muruga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CS6660 – COMPILER DESIGN

UNIT II LEXICAL ANALYSIS

Need and Role of Lexical Analyzer-Lexical Errors-Expressing Tokens by Regular


Expressions-Converting Regular Expression to DFA- Minimization of DFA-
Language for Specifying Lexical Analyzers-LEX-Design of Lexical Analyzer for a
sample Language

OVER VIEW OF LEXICAL ANALYSIS

 To identify the tokens we need some method of describing the possible


tokens that can appear in the input stream. For this purpose we introduce
regular expression, a notation that can be used to describe essentially all the
tokens of programming language.
 Secondly , having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers,
which are designed using transition diagrams and finite automata.
THE ROLE OF THE LEXICAL ANALYZER:
 The lexical analyzer is the first phase of a compiler.
 Its main task is to read the input characters of the source program, and
produce as output a sequence of tokens that the parser uses for syntax
analysis.
 The lexical analyzer is implemented as a sub-routine or co-routine of the
parser.
 Upon receiving a “get next token” command from the parser, the lexical
analyzer reads input characters until it can identify the next token.

1
CS-6660 COMPILER DESIGN VI SEM CSE
Interaction of Lexical Analyzer with a Parser
Functions of Lexical analyzer:
 The Lexical analyzer performs certain secondary tasks at the user interface.
 It eliminates comments and white space.
 Another task is correlating error messages from the compiler with the source
program. Example: the lexical analyzer keeps track of the number of
newline characters seen, so that the line number can be associated with the
error messages.
 It reports the error encountered while generating tokens.
Two Phases of Lexical analyzer:
a) Scanning:
It consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis:
It is the more complex portion, where the scanner produces the sequence of
tokens as output.
+

 The scanner is responsible for doing simple tasks.


 The lexical analyzer does the more complex operations.
2
CS-6660 COMPILER DESIGN VI SEM CSE
ISSUES IN LEXICAL ANALYSIS:
Lexical Analysis versus Parsing:
There are several issues in separating the analysis phase of compiling into
lexical analysis and parsing.
Simpler Design:
1. The separation of lexical analysis from syntax analysis allows us to simplify
one or the other of these phases. If we are designing a new language,
separating the lexical and syntactic analysis can lead to a cleaner overall
language design.
Improving compiler Efficiency:
Compiler efficiency is improved. A separate lexical analyzer allows us to
construct a specialized and potentially more efficient processor for the task.
Enhancing Compiler Portability:
2. Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
Lexical analysis Parsing
A Scanner simply turns an input String A parser converts this list of tokens into
(say a file) into a list of tokens. These a Tree-like object to represent how the
tokens represent things like identifiers, tokens fit together to form a cohesive
parentheses, operators etc. whole (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code meaning beyond structural cohesion.
file into tokens. From there, the "parser" The next thing to do is extract meaning
proper turns those whole tokens into from this structure (sometimes called
sentences of your grammar. contextual analysis).
3
CS-6660 COMPILER DESIGN VI SEM CSE
TOKENS, PATTERNS, LEXEMES
Token:
 It describes the class or category of input string.
 It consisting of a token name and an optional attribute value.
 Token is a sequence of characters that can be treated as a single logical
entity.
 Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern
 A pattern is a description of the form that the lexemes of a token may
take.
 It is a set of rules that describe the token.
 Example: for a keyword as a token, the pattern is a sequence of
characters that form the keyword.
Lexeme:
 A Lexeme is a sequence of characters in the source program that are
matched with the patterns for a token and is identified by the lexical
analyzer as an instance of that token.
 There is a set of strings in the input for which the same token is
produced as output. This set of strings is described by a rule called a
pattern associated with the token.
Example: int i=10;
where i is the lexeme for the token “identifier”.

4
CS-6660 COMPILER DESIGN VI SEM CSE
Token Sample Lexeme Informal Description of Pattern
const const const
if if Characters i,f
comparison <,<=,>,>=,== < or <=or > or >= or ==
id i, a, pi, d2 Letter followed by letters and digits
number 3.1416, 0, 6.456 Any numeric constant
Examples of Tokens
 Tokens are treated as terminal symbols in the grammar for the source language
using boldface names to represent tokens.
 The lexemes matched by the pattern for the token represent strings of characters in
the source program that can be treated together as a lexical unit.
 In most of the programming languages, keywords, operators, identifiers, constants,
literal strings and punctuation symbols are treated as tokens.
 A Pattern is a rule describing the set of lexemes that can represent a particular token
in source programs. Example: The pattern for the keyword const is the single string
const that spells the keyword.
 In many languages, certain strings are reserved; i.e., their meanings are predefined
and cannot be changed by the user.
 If the keywords are not reserved, then the lexical analyzer must distinguish between
a keyword and a user-defined identifier.
Attributes for Tokens:
 When more than one pattern matches a lexeme, the lexical analyzer must
provide additional information about the particular lexeme that matched to
the subsequent phases of the compiler.

5
CS-6660 COMPILER DESIGN VI SEM CSE
 For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found
in the source program.
 The lexical analyzer collects information about tokens into their associated
attributes.
 The token name influences parsing decisions; the attributes influence the
translation of tokens.
 A token has only a single attribute – a pointer to the symbol-table entry in
which the information about the token is kept. The pointer becomes the
attribute for the token.
Example:
The tokens and associated attribute-values for the FORTRAN statement
E = M * C **2
are written below as a sequence of pairs:
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<num, integer value 2>
The token num has been given an integer-valued attribute. The compiler stores the
character string that forms a number in a symbol table and let the attribute of token
num be a pointer to the table entry.

LEXICAL ERRORS:

6
CS-6660 COMPILER DESIGN VI SEM CSE
 Lexical errors are the errors thrown by your lexer when unable to continue.
Which means that there's no way to recognise a lexeme as a valid token for
you lexer.
 Few errors are discernible during lexical analysis phase.
 If the string fi is encountered in a C program for the first time in the context
fi (a== f(x)) …
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or
an undeclared function identifier.
 Since fi is a valid identifier, the lexical analyzer returns the token for the
identifier and this error is handled by some other phase of the compiler.
Lexical Errors are:
1. Spelling errors.
2. Exceeding length of identifier or numeric constants.
3. Appearance of illegal characters.
Error-Recovery Actions (or) Strategies:
Panic Mode:
Sometimes, lexical analyzer is unable to proceed because patterns for tokens
do not match a prefix of the remaining input. The lexical analyzer deletes
successive characters from the remaining input until the lexical analyzer can find a
well-formed token. This recovering technique is called panic mode.
Other possible error-recovery actions are:
a. Deleting an extraneous character
b. Inserting a missing character
c. Replacing an incorrect character by a correct character
d. Transposing two adjacent characters

7
CS-6660 COMPILER DESIGN VI SEM CSE
o Error transformations like these may be tried in an attempt to repair the input. The
simplest strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by just a single error transformation.

INPUT BUFFERING
Buffer PairMethod:
 During lexical analyzing, to identify a lexeme, it is important to look ahead at
least one additional character.
 Specialized buffering techniques have been developed to reduce the amount
of overhead required to process a single input character.
 An important scheme involves two buffers that are alternately reloaded.

Using a pair of Input Buffers


 Each buffer is of the same size N and N is usually the size of a disk block.
(Example: 4096 bytes)
 Using one system read command we can read N characters into the buffer.
 If fewer than N characters remain in the input file, then a special character
represented by eof marks the end of the source file.
Two pointers to the input are maintained.

8
CS-6660 COMPILER DESIGN VI SEM CSE
a) Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to begin.
b) Pointer forward scans ahead until a pattern match is found.
 Once the next lexeme is determined, forward is set to the character at its
right end
 After the lexeme is recorded as an attribute value of a token returned to the
parser, lexemebegin is set to the character immediately after the lexeme just
found.
 Example: In the figure given above, forward has passed the end of the next
lexeme, **. It must be retracted one position to its left.
Sentinel Method:
o A sentinel is a special character that cannot be part of the source program.
o Example for the special character is eof.

Sentinels at end of each buffer half

o A sentinal eof is added at the end of the buffer. If the input string is less than
the buffer size, an eof is added into the buffer where the input string
terminates.
o By adding a sentinel value, we can easily find the end of the buffer.

EXPRESSING TOKENS BY REGULAR EXPRESSION:


SPECIFICATION OF TOKENS:

9
CS-6660 COMPILER DESIGN VI SEM CSE
 Regular expressions are an important notation for specifying lexeme
patterns.

Strings and Languages:


 An alphabet is any finite set of symbols.
Examples: Letters, digits and punctuation.
 The set {0,1} is the binary alphabet.
 A string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
 The length of the string s, represented as | s |, is the number of occurrences
of symbols in s.
o For example, banana is a string of length six.
o The empty string, denoted as ε is the string of length 0.
 A language is any countable set of strings over some fixed alphabet.
Example: Abstract languages like φ, the empty set, or { ε }.
 If x and y are strings, then the concatenation of x and y, denoted xy, is the
string formed by appending y to x.
Example: if x = cse, y=department, then xy = csedepartment.
Operations on Languages:
 In lexical analysis, the most important operations on languages are union,
concatenation and closure.
 The concatenation of languages is all strings formed by taking a string
from the first language and a string from the second language in all possible
ways and concatenating them.
 The (kleene) closure of a language L, denoted by L* is the set of strings we
get by concatenating L zero or more times.

10
CS-6660 COMPILER DESIGN VI SEM CSE
 The positive closure, denoted by L+ is the set of strings we get by
concatenating L one or more times.
 Union of L and M-LUM
S.No. Operation Definition & Notation
1. Union Of L and M L U M = { s / s is in L or s is in M}
2. Concatenation of L and M LM = { st/ s is in L and t is in M}
3. Kleene Closure of L
4. Positive Closure of L
Definitions of operations on languages

Regular Expressions:
 .Example: In the regular expression notation, if letter- is established to
stand for any letter or the underscore, and digit- is established to stand for
any digit, then the language of C identifiers can be described as:
letter- ( letter- | digit )*
 The vertical bar | above means union, the parantheses ( ) are used to group
subexpressions. The * means “zero or more occurances of”.
 The regular expressions are built recursively out of smaller regular
expressions, using the rules described below.
BASIS:
 There are two rules that form the basis:
1. ε is a regular expression and L(ε) is {ε}, that is, the language whose
sole member is the empty string.
2. If a is a symbol in Σ, then a is a regular expression, and L(a) = {a},
that is, the language with one string, of length one, with a in its one
position.
Induction:
11
CS-6660 COMPILER DESIGN VI SEM CSE
There are four parts to the induction whereby larger regular expressions are
built from smaller ones. Suppose r and s are regular expressions denoting
languages L(r) and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r) L(s) .
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
Regular expressions often contain unnecessary pairs of parentheses. Certain
conventions are followed to remove pair of parentheses:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.

Example: Consider the expression:


(a) | ((b) * (c))
This can be rewritten as
a| b*c
Both expressions denote the set of strings that are either a single a or are zero or
more b's followed by one c.
Example :
Let ∑ = {a, b}.
1. The regular expression a | b denotes the language {a, b}.
2. (a| b) (a|b) denotes {aa, ab, ba, bb), the language of all strings of length two over the
alphabet ∑. Another regular expression for the same language is
aa|ab|ba|bb.

12
CS-6660 COMPILER DESIGN VI SEM CSE
3. a* denotes the language consisting of all strings of zero or more a's, that
is, {ε, a, aa, aaa, . . . }.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that
is, all strings of a's and b's: { ε , a, b, aa, ab, ba, bb, aaa, . . .}.
Another regular expression for the same language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the string a
and all strings consisting of zero or more a's and ending in b.
 A language that can be defined by a regular expression is called a regular set.
 If two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.
 For instance, (a|b) = (b|a)
 There are a number of algebraic laws for regular expressions; each law asserts that
expressions of two different forms are equivalent.

Algebraic Laws for Regular Expressions

Regular Definitions
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:

13
CS-6660 COMPILER DESIGN VI SEM CSE
Where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di-l).

By restricting ri to ∑ and the previously defined d's, we avoid recursive definitions,


and we can construct a regular expression over ∑ alone, for each ri. We do so by
first replacing uses of d l in r2 then replacing uses of d l and d2 in r3 by rl and r2, and
so on.

Example: C identifiers are strings of letters, digits, and underscores.


The regular definitions for the language of C identifiers are given below:
Letter_ -> A | B| ....| Z | a | b | ….|z| _
digit -> 0|1|…| 9
id -> letter_ ( letter- | digit )*

Extensions (or) Notations of Regular Expressions


1. One or more instances.
 The unary, postfix operator + represents the positive closure of a regular
expression and its language.
 That is, if r is a regular expression, then (r)+ denotes the language
(L(r) )+.

14
CS-6660 COMPILER DESIGN VI SEM CSE
 The operator + has the same precedence and associativity as the operator
*. Two useful algebraic laws, r* = r+|c and r+ = rr* = r*r relate the
Kleene closure and positive closure.
2. Zero or one instance.
 The unary postfix operator ? means "zero or one occurrence."
 That is, r? is equivalent to r|ε, or put another way,
 L(r?) =L(r) U {ε}.
 The ? operator has the same precedence and associativity as * and +.
3. Character classes.
 A regular expression a l|a2|. .. |an, where the ai's are each symbols of the
alphabet, can be replaced by the shorthand [a la2 . . . an].
 When a1, a2, . . . , an, form a logical sequence, e.g., consecutive
uppercase letters, lowercase letters, or digits, we can replace them by
al-an, that is, just the first and last separated by a hyphen. Thus, [abc]
is shorthand for a|b|c, and [a-z] is shorthand for a|b|. . . |z.

Example: Using these shorthands, we can rewrite the regular definition for the C
identifiers as follows:
Letter_ -> [A-Za-z_]
Digit -> [0-9]
id -> letter_ ( letter | digit )*

RECOGNITION OF TOKENS:
Consider the example grammar for branching statement given below:

Stmt -> if expr then stmt


If expr then stmt else stmt
15
CS-6660 COMPILER DESIGN VI SEM CSE
E
Expr ->term relop term
Term
Term-> id
number
A grammar for branching statements
The above grammar fragment describes a simple form of branching statements and
conditional expressions.
 For relop, we use the comparison operators of languages like Pascal or SQL,
where = is "equals" and <> is "not equals".
 The terminals of the grammar, which are if, then, else, relop, id, and number,
are the names of tokens as far as the lexical analyzer is concerned.

The patterns for these tokens are described using regular definitions as below:

Patterns for tokens

For this language, the lexical analyzer will recognize the keywords if, then, and
else, as well as lexemes that match the patterns for relop, id, and number.
In addition, we assign the lexical analyzer the job of stripping out whitespace,
16
CS-6660 COMPILER DESIGN VI SEM CSE
by recognizing the "token" ws defined by:
ws -> ( blank | tab | newline ) +
Here, blank, tab, and newline are abstract symbols that we use to express the
ASCII characters of the same names.
Tokens, their patterns, and attribute values are summarized below:

Transition Diagrams
 As an intermediate step in the construction of a lexical analyzer, we first convert
patterns into stylized flowcharts, called "transition diagrams."
 Transition diagrams have a collection of nodes or circles, called states.
o Each state represents a condition that could occur during the
process of scanning the input looking for a lexeme that matches
one of several patterns.
 Edges are directed from one state of the transition diagram to another. Each
edge is labeled by a symbol or set of symbols. If we are in some state s, and
the next input symbol is a, we look for an edge out of state s labeled by a.
17
CS-6660 COMPILER DESIGN VI SEM CSE
 If we find such an edge, we advance the forward pointer and enter the state of
the transition diagram to which that edge leads.
 We shall assume that all our transition diagrams are deterministic, meaning that
there is never more than one edge out of a given state with a given symbol
among its labels.

Some important conventions about transition diagrams are:


1. Certain states are said to be accepting, or final. These states indicate that a
lexeme has been found, although the actual lexeme may not consist of all
positions between the lexemeBegin and forward pointers.
 We indicate an accepting state by a double circle, and if there is an action to
be taken - typically returning a token and an attribute value to the parser - we
shall attach that action to the accepting state.
2. In addition, if it is necessary to retract the forward pointer one position then
we shall additionally place a * near that accepting state. In our example, it is
never necessary to retract forward by more than one position, but if it were,
we could attach any number of *'s to the accepting state.
3. One state is designated the start state, or initial state; it is indicated by an
edge, labeled "start ," entering from nowhere. The transition diagram always
begins in the start state before any input symbols have been read.
The Figure below is a transition diagram that recognizes the lexemes matching the
token relop.

18
CS-6660 COMPILER DESIGN VI SEM CSE
Recognition of Reserved Words and Identifiers
 Usually, keywords like if or then are reserved, so they are not identifiers
even though they look like identifiers.
The transition diagram for id‟s and keywords are given below:

A transition diagram for id's and keywords

There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. A field of the
symbol-table entry indicates that these strings are never ordinary identifiers,
and tells which token they represent. When we find an identifier, a call to

19
CS-6660 COMPILER DESIGN VI SEM CSE
installID places it in the symbol table if it is not already there and returns a
pointer to the symbol-table entry for the lexeme found.
 The function getToken examines the symbol table entry for the lexeme
found, and returns whatever token name the symbol table says this
lexeme represents - either id or one of the keyword tokens that was
initially installed in the table.

2. Create separate transition diagrams for each keyword; an example for the
keyword then is shown below.

Hypothetical transition diagram for the keyword then

 The transition diagram consists of states representing the situation


after each successive letter of the keyword is seen, followed by a test
for a "nonletter-or-digit". The tokens are prioritized so that the
reserved-word tokens are recognized in preference to id, when the
lexeme matches both patterns.

The transition diagram for token number is shown below:

A transition diagram for unsigned numbers

20
CS-6660 COMPILER DESIGN VI SEM CSE
 If a dot is seen, we have an “optional fraction”.
 State 14 is entered and we look for one or more additional digits.
 State 15 is used for this purpose.
 If we see an E, we have an “optional exponent”. States 16 through 19 are
used to recognize the exponent value.
 In state 15, if we see anything other than E or digit, then 21 is the end or the
accepting state.

The final transition diagram for whitespace is shown below:

In that diagram, we look for one or more "whitespace" characters, represented by


delim in that diagram - typically these characters would be blank, tab, newline.

THE LEXICAL-ANALYZER GENERATOR LEX


Lex is a tool that is used to specify a lexical analyzer by specifying regular
expressions to describe patterns for tokens.
The input notation for the Lex tool is referred to as the Lex Language and the tool
itself is the Lex compiler.
The Lex compiler transforms the input patterns into a transition diagram and
generates the code in the file called lex.yy.c.
Use of Lex:

21
CS-6660 COMPILER DESIGN VI SEM CSE
An input file lex.l is written in the Lex language and describes the lexical analyzer
to be generated.
The Lex compiler transforms lex.l to a C program in a file that is always named
lex.yy.c.

Lex source program Lex Compiler lex.yy.c

The lex.yy.c is compiled using C compiler into a output file called a.out

lex.yy.c C Compiler a.out

The output file a.out is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.

Input Stream a.out Sequence of tokens


Structure of Lex Program
A Lex program has the following form:

declarations
%%
translation rules
%%
auxiliary functions

Declarations section
 The declarations section includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name of a
token), and regular definitions.
22
CS-6660 COMPILER DESIGN VI SEM CSE
Translation rules
 The translation rules each have the form
o Pattern { Action )
 Each pattern is a regular expression, which may use the regular definitions
of the declaration section.
 The actions are fragments of code, typically written in C.
Auxiliary functions
 The third section holds whatever additional functions are used in the actions.
 Alternatively, these functions can be compiled separately and loaded with
the lexical analyzer.
The lexical analyzer created by Lex behaves as follows.
 When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns Pi.
 It then executes the associated action Ai. Typically, Ai will return to the parser,
but if it does not, then the lexical analyzer proceeds to find additional lexemes,
until one of the corresponding actions causes a return to the parser.
 The lexical analyzer returns a single value, the token name, to the parser, but
uses the shared, integer variable yylval to pass additional information about the
lexeme found, if needed.

Lex Program for the Tokens

23
CS-6660 COMPILER DESIGN VI SEM CSE
 In the declarations section we see a pair of special brackets, %{ and %}.
Anything within these brackets is copied directly to the file lex . yy . c, and
24
CS-6660 COMPILER DESIGN VI SEM CSE
is not treated as a regular definition. The manifest constants are placed inside
it.
 Also in the declarations section is a sequence of regular definitions.
 Regular definitions that are used in later definitions or in the patterns of the
translation rules are surrounded by curly braces. Thus, for instance, delim is
defined to be shorthand for the character class consisting of the blank, the
tab, and the newline; the latter two are represented, as in all UNIX
commands, by backslash followed by t or n, respectively.
 In the definition of id and number, parentheses are used as grouping
metasymbols and do not stand for themselves. In contrast, E in the definition
of number stands for itself.
 In the auxiliary-function section, we see two functions, installID() and
installNum(). Everything in the auxiliary section is copied directly to file
lex.yy.c, but may be used in the actions.
 First, ws, an identifier declared in the first section, has an associated empty
action. If we find whitespace, we do not return to the parser, but look for
another lexeme.
 The second token has the simple regular expression pattern if . If there are
only two characters namely i and f, then the lexical analyzer returns the
token name IF. Keywords then and else are treated similarly.
 The action taken when id is matched as given as follows:
a. Function installID() is called to place the lexeme found in the symbol
table.
b. This function returns a pointer to the symbol table, which is placed in
global variable yylval. InstallID() has two variables that are
automatically set by lexical analyzer. They are:
i. yytext is a pointer to the beginning of the lexeme.
25
CS-6660 COMPILER DESIGN VI SEM CSE
ii. yyleng is the length of the lexeme found.
c. The token name ID is returned to the parser.

FINITE AUTOMATA:

 Heart of the transition is the finite automata.


Automation is defined as a system where information is transmitted and used for
performing some functions without direct participation of man.
 These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no" about each
possible input string.
2. Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on
the labels of their edges.
It has one or more transitions for single input symbol from one
state.
(b) Deterministic finite automata (DFA) have, for each state, and for
each symbol of its input alphabet exactly one edge with that
symbol leaving that state.
(i.e) It has single transition for an input symbol from one state.
Both deterministic and nondeterministic finite automata are capable of recognizing
the same languages. In fact these languages are exactly the same languages, called
the regular languages that regular expressions can describe.

OPTIMIZATION OF DFA – BASED PATTERN MATCHER:

CONVERTING A REGULAR EXPRESSION DIRECTLY TO A DFA

26
CS-6660 COMPILER DESIGN VI SEM CSE
The Finite Automata is called DFA if there is only one path for a specific
input from current state to next state.

From state S0 for input „a‟ there is only one path going to S2. Similarly from
S0 there is only one path for input going to S1.

Algorithm:

• The “important states” of an NFA are those without an -transition, that is if


move({s},a)   for some a then s is an important state
• The subset construction algorithm uses only the important states when it
determines -closure(move(T,a))
• Augment the regular expression r with a special end symbol # to make
accepting states important: the new expression is r#
Construct a syntax tree for r#
• Traverse the tree to construct functions nullable, firstpos, lastpos, and
followpos

27
CS-6660 COMPILER DESIGN VI SEM CSE
Example: From Regular Expression to DFA Directly:

Syntax Tree of (a|b)*abb#

• nullable(n): the sub tree at node n generates languages including the empty
string.

• firstpos(n): set of positions that can match the first symbol of a string
generated by the sub tree at node n.

• lastpos(n): the set of positions that can match the last symbol of a string
generated be the sub tree at node n.

28
CS-6660 COMPILER DESIGN VI SEM CSE
• followpos(i): the set of positions that can follow position i in the tree

Annotating the Tree (OR) Rules for computing nullable, firstpos and lastpos:

Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| (or-node)
nullable(c1) firstpos(c1) lastpos(c1)
or  
/\
nullable(c2) firstpos(c2) lastpos(c2)
c1 c2
if nullable(c1) then
• (cat-node) nullable(c1) if nullable(c2) then
firstpos(c1) 
/\ and lastpos(c1)  lastpos(c2)
firstpos(c2)
c1 c2 nullable(c2) else lastpos(c2)
else firstpos(c1)
* (star-node)
| true firstpos(c1) lastpos(c1)
c1

29
CS-6660 COMPILER DESIGN VI SEM CSE
Syntax Tree of (a|b)*abb#

From Regular Expression to DFA Directly: followpos

30
CS-6660 COMPILER DESIGN VI SEM CSE
From Regular Expression to DFA Directly: Algorithm

From Regular Expression to DFA Directly: Example

31
CS-6660 COMPILER DESIGN VI SEM CSE
MINIMIZATION OF DFA:

(Q, Σ, δ, q0, F)

 Q – (finite) set of states


 Σ – alphabet – (finite) set of input symbols
 δ – transition function
 q0 – start state
 F– set of final / accepting states
Often representing as a diagram:

32
CS-6660 COMPILER DESIGN VI SEM CSE
Some states can be redundant:

The following DFA accepts (a|b)+

State s1 is not necessary

33
CS-6660 COMPILER DESIGN VI SEM CSE
This is a state-minimized (or just minimized) DFA

 Every remaining state is necessary

The task of DFA minimization, then, is to automatically transform a given DFA


into a state-minimized DFA

DFA Minimization Algorithm

Recall that a DFA M=(Q, Σ, δ, q0, F)

 Two states p and q are distinct if


 p in F and q not in F or vice versa, or

34
CS-6660 COMPILER DESIGN VI SEM CSE
 for some α in Σ, δ(p, α) and δ(q, α) are distinct

Using this inductive definition, we can calculate which states are distinct

CONVERT NFA TO DFA (SUBSET CONSTRUCTION METHOD)

35
CS-6660 COMPILER DESIGN VI SEM CSE
DFA transition table:
STATE a b
A B C
B C D
C B C
D B E
*E B C

36
CS-6660 COMPILER DESIGN VI SEM CSE
Minimum state DFA:

[ABCDE]
[ABCD][E]
[ABC][D][E]
[AC][B][D][E]

So Eliminate C and Replace C by A


Minimum state DFA Table

STATE a b
A B A
B B D
D B E
*E B A

37
CS-6660 COMPILER DESIGN VI SEM CSE
DFA Minimization is a fairly understandable process, and is useful in several
areas.

 Regular expression matching implementation

 Very similar algorithm is used for compiler optimization to eliminate


duplicate computations

DFA Minimization using Myphill-Nerode Theorem

Algorithm

Input: DFA
Output: Minimized DFA
Step 1 Draw a table for all pairs of states (Qi, Qj) not necessarily connected directly [All
are unmarked initially]
Step 2 Consider every state pair (Qi, Qj) in the DFA where Qi ∈ F and Qj ∉ F or vice versa
and mark them. [Here F is the set of final states].
Step 3 Repeat this step until we cannot mark anymore states −

If there is an unmarked pair (Qi, Qj), mark it if the pair {δ(Qi, A), δ (Qi, A)} is
marked for some input alphabet.
Step 4 Combine all the unmarked pair (Q i, Qj) and make them a single state in the reduced
DFA.

Example

Let us use above algorithm to minimize the DFA shown below.

38
CS-6660 COMPILER DESIGN VI SEM CSE
Step 1 − We draw a table for all pair of states.

a b c d e f
a
b
c
d
e
f

Step 2 − We mark the state pairs –

a b c d e f
a
b
c ✔ ✔
d ✔ ✔
e ✔ ✔

39
CS-6660 COMPILER DESIGN VI SEM CSE
f ✔ ✔ ✔

Step 3 − We will try to mark the state pairs, with green colored check mark, transitively. If we
input 1 to state „a‟ and „f‟, it will go to state „c‟ and „f‟ respectively. (c, f) is already marked,
hence we will mark pair (a, f). Now, we input 1 to state „b‟ and „f‟; it will go to state „d‟ and „f‟
respectively. (d, f) is already marked, hence we will mark pair (b, f).

a b c d e f
a
b
c ✔ ✔
d ✔ ✔
e ✔ ✔
f ✔ ✔ ✔ ✔ ✔

After step 3, we have got state combinations {a, b} {c, d} {c, e} {d, e} that are unmarked.

We can recombine {c, d} {c, e} {d, e} into {c, d, e}

Hence we got two combined states as − {a, b} and {c, d, e}

So the final minimized DFA will contain three states {f}, {a, b} and {c, d, e}

40
CS-6660 COMPILER DESIGN VI SEM CSE
DFA Minimization using Equivalence Theorem
If X and Y are two states in a DFA, we can combine these two states into {X, Y} if they are not
distinguishable. Two states are distinguishable, if there is at least one string S, such that one of δ
(X, S) and δ (Y, S) is accepting and another is not accepting. Hence, a DFA is minimal if and
only if all the states are distinguishable.

Algorithm

Step 1 All the states Q are divided in two partitions − final states and non-final states and
are denoted by P0. All the states in a partition are 0th equivalent. Take a counter k and
initialize it with 0.
Step 2 Increment k by 1. For each partition in P k, divide the states in Pk into two partitions if
they are k-distinguishable. Two states within this partition X and Y are k-
distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1)-
distinguishable.
Step 3 If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
Step 4 Combine kth equivalent sets and make them the new states of the reduced DFA.

Example

Let us consider the following DFA −

q δ(q,0) δ(q,1)
a b c
b a d
c e f
d e f
e e f
f f f

41
CS-6660 COMPILER DESIGN VI SEM CSE
Let us apply above algorithm to the above DFA −

 P0 = {(c,d,e), (a,b,f)}
 P1 = {(c,d,e), (a,b),(f)}
 P2 = {(c,d,e), (a,b),(f)}

Hence, P1 = P2.

There are three states in the reduced DFA. The reduced DFA is as follows −

The State table of DFA is as follows −

Q δ(q,0) δ(q,1)
(a, b) (a, b) (c,d,e)
(c,d,e) (c,d,e) (f)
(f) (f) (f)

Its graphical representation would be as follows −

42
CS-6660 COMPILER DESIGN VI SEM CSE
The Structure of the Generated Analyzer

The figure overviews the architecture of a lexical analyzer generated by Lex. The
program that serves as the lexical analyzer includes a fixed program that simulates
an automaton; at this point we leave open whether that automaton is deterministic
or nondeterministic. The rest of the lexical analyzer consists of components that
are created from the Lex program by Lex itself.

43
CS-6660 COMPILER DESIGN VI SEM CSE
A Lex program is turned into a transition table and actions, which are used by a
finite-automaton simulator are used by a finite-automaton simulator

These components are:


1. A transition table for the automaton.
2. Those functions that are passed directly through Lex to the output .
3. The actions from the input program, which appear as fragments of code to be
invoked at the appropriate time by the automaton simulator.

To construct the automaton, we begin by taking each regular-expression pattern in


the Lex program and converting it to an NFA. We need a single automaton that
will recognize lexemes matching any of the patterns in the program, so we
combine all the NFA's into one by introducing a new start state with finite
transitions to each of the start states of the NFA's Ni for pattern pi. This
construction is shown in Figure.

44
CS-6660 COMPILER DESIGN VI SEM CSE
An NFA constructed from a Lex program

Figure shows three NFA's that recognize the three patterns.

45
CS-6660 COMPILER DESIGN VI SEM CSE
Combined NFA

Sequence of sets of states entered when processing input aaba

We look backwards in the sequence of sets of states, until we find a set that
includes one or more accepting states. If there are several accepting states in that
set, pick the one associated with the earliest pattern pi in the list from the Lex
program. Move the forward pointer back to the end of the lexeme, and perform the
action Ai associated with pattern p i.

46
CS-6660 COMPILER DESIGN VI SEM CSE

You might also like