Unit 2-LEXICAL ANALYSIS
Unit 2-LEXICAL ANALYSIS
1
CS-6660 COMPILER DESIGN VI SEM CSE
Interaction of Lexical Analyzer with a Parser
Functions of Lexical analyzer:
The Lexical analyzer performs certain secondary tasks at the user interface.
It eliminates comments and white space.
Another task is correlating error messages from the compiler with the source
program. Example: the lexical analyzer keeps track of the number of
newline characters seen, so that the line number can be associated with the
error messages.
It reports the error encountered while generating tokens.
Two Phases of Lexical analyzer:
a) Scanning:
It consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis:
It is the more complex portion, where the scanner produces the sequence of
tokens as output.
+
4
CS-6660 COMPILER DESIGN VI SEM CSE
Token Sample Lexeme Informal Description of Pattern
const const const
if if Characters i,f
comparison <,<=,>,>=,== < or <=or > or >= or ==
id i, a, pi, d2 Letter followed by letters and digits
number 3.1416, 0, 6.456 Any numeric constant
Examples of Tokens
Tokens are treated as terminal symbols in the grammar for the source language
using boldface names to represent tokens.
The lexemes matched by the pattern for the token represent strings of characters in
the source program that can be treated together as a lexical unit.
In most of the programming languages, keywords, operators, identifiers, constants,
literal strings and punctuation symbols are treated as tokens.
A Pattern is a rule describing the set of lexemes that can represent a particular token
in source programs. Example: The pattern for the keyword const is the single string
const that spells the keyword.
In many languages, certain strings are reserved; i.e., their meanings are predefined
and cannot be changed by the user.
If the keywords are not reserved, then the lexical analyzer must distinguish between
a keyword and a user-defined identifier.
Attributes for Tokens:
When more than one pattern matches a lexeme, the lexical analyzer must
provide additional information about the particular lexeme that matched to
the subsequent phases of the compiler.
5
CS-6660 COMPILER DESIGN VI SEM CSE
For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found
in the source program.
The lexical analyzer collects information about tokens into their associated
attributes.
The token name influences parsing decisions; the attributes influence the
translation of tokens.
A token has only a single attribute – a pointer to the symbol-table entry in
which the information about the token is kept. The pointer becomes the
attribute for the token.
Example:
The tokens and associated attribute-values for the FORTRAN statement
E = M * C **2
are written below as a sequence of pairs:
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<num, integer value 2>
The token num has been given an integer-valued attribute. The compiler stores the
character string that forms a number in a symbol table and let the attribute of token
num be a pointer to the table entry.
LEXICAL ERRORS:
6
CS-6660 COMPILER DESIGN VI SEM CSE
Lexical errors are the errors thrown by your lexer when unable to continue.
Which means that there's no way to recognise a lexeme as a valid token for
you lexer.
Few errors are discernible during lexical analysis phase.
If the string fi is encountered in a C program for the first time in the context
fi (a== f(x)) …
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or
an undeclared function identifier.
Since fi is a valid identifier, the lexical analyzer returns the token for the
identifier and this error is handled by some other phase of the compiler.
Lexical Errors are:
1. Spelling errors.
2. Exceeding length of identifier or numeric constants.
3. Appearance of illegal characters.
Error-Recovery Actions (or) Strategies:
Panic Mode:
Sometimes, lexical analyzer is unable to proceed because patterns for tokens
do not match a prefix of the remaining input. The lexical analyzer deletes
successive characters from the remaining input until the lexical analyzer can find a
well-formed token. This recovering technique is called panic mode.
Other possible error-recovery actions are:
a. Deleting an extraneous character
b. Inserting a missing character
c. Replacing an incorrect character by a correct character
d. Transposing two adjacent characters
7
CS-6660 COMPILER DESIGN VI SEM CSE
o Error transformations like these may be tried in an attempt to repair the input. The
simplest strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by just a single error transformation.
INPUT BUFFERING
Buffer PairMethod:
During lexical analyzing, to identify a lexeme, it is important to look ahead at
least one additional character.
Specialized buffering techniques have been developed to reduce the amount
of overhead required to process a single input character.
An important scheme involves two buffers that are alternately reloaded.
8
CS-6660 COMPILER DESIGN VI SEM CSE
a) Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to begin.
b) Pointer forward scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its
right end
After the lexeme is recorded as an attribute value of a token returned to the
parser, lexemebegin is set to the character immediately after the lexeme just
found.
Example: In the figure given above, forward has passed the end of the next
lexeme, **. It must be retracted one position to its left.
Sentinel Method:
o A sentinel is a special character that cannot be part of the source program.
o Example for the special character is eof.
o A sentinal eof is added at the end of the buffer. If the input string is less than
the buffer size, an eof is added into the buffer where the input string
terminates.
o By adding a sentinel value, we can easily find the end of the buffer.
9
CS-6660 COMPILER DESIGN VI SEM CSE
Regular expressions are an important notation for specifying lexeme
patterns.
10
CS-6660 COMPILER DESIGN VI SEM CSE
The positive closure, denoted by L+ is the set of strings we get by
concatenating L one or more times.
Union of L and M-LUM
S.No. Operation Definition & Notation
1. Union Of L and M L U M = { s / s is in L or s is in M}
2. Concatenation of L and M LM = { st/ s is in L and t is in M}
3. Kleene Closure of L
4. Positive Closure of L
Definitions of operations on languages
Regular Expressions:
.Example: In the regular expression notation, if letter- is established to
stand for any letter or the underscore, and digit- is established to stand for
any digit, then the language of C identifiers can be described as:
letter- ( letter- | digit )*
The vertical bar | above means union, the parantheses ( ) are used to group
subexpressions. The * means “zero or more occurances of”.
The regular expressions are built recursively out of smaller regular
expressions, using the rules described below.
BASIS:
There are two rules that form the basis:
1. ε is a regular expression and L(ε) is {ε}, that is, the language whose
sole member is the empty string.
2. If a is a symbol in Σ, then a is a regular expression, and L(a) = {a},
that is, the language with one string, of length one, with a in its one
position.
Induction:
11
CS-6660 COMPILER DESIGN VI SEM CSE
There are four parts to the induction whereby larger regular expressions are
built from smaller ones. Suppose r and s are regular expressions denoting
languages L(r) and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r) L(s) .
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
Regular expressions often contain unnecessary pairs of parentheses. Certain
conventions are followed to remove pair of parentheses:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
12
CS-6660 COMPILER DESIGN VI SEM CSE
3. a* denotes the language consisting of all strings of zero or more a's, that
is, {ε, a, aa, aaa, . . . }.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that
is, all strings of a's and b's: { ε , a, b, aa, ab, ba, bb, aaa, . . .}.
Another regular expression for the same language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the string a
and all strings consisting of zero or more a's and ending in b.
A language that can be defined by a regular expression is called a regular set.
If two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.
For instance, (a|b) = (b|a)
There are a number of algebraic laws for regular expressions; each law asserts that
expressions of two different forms are equivalent.
Regular Definitions
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:
13
CS-6660 COMPILER DESIGN VI SEM CSE
Where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di-l).
14
CS-6660 COMPILER DESIGN VI SEM CSE
The operator + has the same precedence and associativity as the operator
*. Two useful algebraic laws, r* = r+|c and r+ = rr* = r*r relate the
Kleene closure and positive closure.
2. Zero or one instance.
The unary postfix operator ? means "zero or one occurrence."
That is, r? is equivalent to r|ε, or put another way,
L(r?) =L(r) U {ε}.
The ? operator has the same precedence and associativity as * and +.
3. Character classes.
A regular expression a l|a2|. .. |an, where the ai's are each symbols of the
alphabet, can be replaced by the shorthand [a la2 . . . an].
When a1, a2, . . . , an, form a logical sequence, e.g., consecutive
uppercase letters, lowercase letters, or digits, we can replace them by
al-an, that is, just the first and last separated by a hyphen. Thus, [abc]
is shorthand for a|b|c, and [a-z] is shorthand for a|b|. . . |z.
Example: Using these shorthands, we can rewrite the regular definition for the C
identifiers as follows:
Letter_ -> [A-Za-z_]
Digit -> [0-9]
id -> letter_ ( letter | digit )*
RECOGNITION OF TOKENS:
Consider the example grammar for branching statement given below:
The patterns for these tokens are described using regular definitions as below:
For this language, the lexical analyzer will recognize the keywords if, then, and
else, as well as lexemes that match the patterns for relop, id, and number.
In addition, we assign the lexical analyzer the job of stripping out whitespace,
16
CS-6660 COMPILER DESIGN VI SEM CSE
by recognizing the "token" ws defined by:
ws -> ( blank | tab | newline ) +
Here, blank, tab, and newline are abstract symbols that we use to express the
ASCII characters of the same names.
Tokens, their patterns, and attribute values are summarized below:
Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert
patterns into stylized flowcharts, called "transition diagrams."
Transition diagrams have a collection of nodes or circles, called states.
o Each state represents a condition that could occur during the
process of scanning the input looking for a lexeme that matches
one of several patterns.
Edges are directed from one state of the transition diagram to another. Each
edge is labeled by a symbol or set of symbols. If we are in some state s, and
the next input symbol is a, we look for an edge out of state s labeled by a.
17
CS-6660 COMPILER DESIGN VI SEM CSE
If we find such an edge, we advance the forward pointer and enter the state of
the transition diagram to which that edge leads.
We shall assume that all our transition diagrams are deterministic, meaning that
there is never more than one edge out of a given state with a given symbol
among its labels.
18
CS-6660 COMPILER DESIGN VI SEM CSE
Recognition of Reserved Words and Identifiers
Usually, keywords like if or then are reserved, so they are not identifiers
even though they look like identifiers.
The transition diagram for id‟s and keywords are given below:
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. A field of the
symbol-table entry indicates that these strings are never ordinary identifiers,
and tells which token they represent. When we find an identifier, a call to
19
CS-6660 COMPILER DESIGN VI SEM CSE
installID places it in the symbol table if it is not already there and returns a
pointer to the symbol-table entry for the lexeme found.
The function getToken examines the symbol table entry for the lexeme
found, and returns whatever token name the symbol table says this
lexeme represents - either id or one of the keyword tokens that was
initially installed in the table.
2. Create separate transition diagrams for each keyword; an example for the
keyword then is shown below.
20
CS-6660 COMPILER DESIGN VI SEM CSE
If a dot is seen, we have an “optional fraction”.
State 14 is entered and we look for one or more additional digits.
State 15 is used for this purpose.
If we see an E, we have an “optional exponent”. States 16 through 19 are
used to recognize the exponent value.
In state 15, if we see anything other than E or digit, then 21 is the end or the
accepting state.
21
CS-6660 COMPILER DESIGN VI SEM CSE
An input file lex.l is written in the Lex language and describes the lexical analyzer
to be generated.
The Lex compiler transforms lex.l to a C program in a file that is always named
lex.yy.c.
The lex.yy.c is compiled using C compiler into a output file called a.out
The output file a.out is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.
declarations
%%
translation rules
%%
auxiliary functions
Declarations section
The declarations section includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name of a
token), and regular definitions.
22
CS-6660 COMPILER DESIGN VI SEM CSE
Translation rules
The translation rules each have the form
o Pattern { Action )
Each pattern is a regular expression, which may use the regular definitions
of the declaration section.
The actions are fragments of code, typically written in C.
Auxiliary functions
The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with
the lexical analyzer.
The lexical analyzer created by Lex behaves as follows.
When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns Pi.
It then executes the associated action Ai. Typically, Ai will return to the parser,
but if it does not, then the lexical analyzer proceeds to find additional lexemes,
until one of the corresponding actions causes a return to the parser.
The lexical analyzer returns a single value, the token name, to the parser, but
uses the shared, integer variable yylval to pass additional information about the
lexeme found, if needed.
23
CS-6660 COMPILER DESIGN VI SEM CSE
In the declarations section we see a pair of special brackets, %{ and %}.
Anything within these brackets is copied directly to the file lex . yy . c, and
24
CS-6660 COMPILER DESIGN VI SEM CSE
is not treated as a regular definition. The manifest constants are placed inside
it.
Also in the declarations section is a sequence of regular definitions.
Regular definitions that are used in later definitions or in the patterns of the
translation rules are surrounded by curly braces. Thus, for instance, delim is
defined to be shorthand for the character class consisting of the blank, the
tab, and the newline; the latter two are represented, as in all UNIX
commands, by backslash followed by t or n, respectively.
In the definition of id and number, parentheses are used as grouping
metasymbols and do not stand for themselves. In contrast, E in the definition
of number stands for itself.
In the auxiliary-function section, we see two functions, installID() and
installNum(). Everything in the auxiliary section is copied directly to file
lex.yy.c, but may be used in the actions.
First, ws, an identifier declared in the first section, has an associated empty
action. If we find whitespace, we do not return to the parser, but look for
another lexeme.
The second token has the simple regular expression pattern if . If there are
only two characters namely i and f, then the lexical analyzer returns the
token name IF. Keywords then and else are treated similarly.
The action taken when id is matched as given as follows:
a. Function installID() is called to place the lexeme found in the symbol
table.
b. This function returns a pointer to the symbol table, which is placed in
global variable yylval. InstallID() has two variables that are
automatically set by lexical analyzer. They are:
i. yytext is a pointer to the beginning of the lexeme.
25
CS-6660 COMPILER DESIGN VI SEM CSE
ii. yyleng is the length of the lexeme found.
c. The token name ID is returned to the parser.
FINITE AUTOMATA:
26
CS-6660 COMPILER DESIGN VI SEM CSE
The Finite Automata is called DFA if there is only one path for a specific
input from current state to next state.
From state S0 for input „a‟ there is only one path going to S2. Similarly from
S0 there is only one path for input going to S1.
Algorithm:
27
CS-6660 COMPILER DESIGN VI SEM CSE
Example: From Regular Expression to DFA Directly:
• nullable(n): the sub tree at node n generates languages including the empty
string.
• firstpos(n): set of positions that can match the first symbol of a string
generated by the sub tree at node n.
• lastpos(n): the set of positions that can match the last symbol of a string
generated be the sub tree at node n.
28
CS-6660 COMPILER DESIGN VI SEM CSE
• followpos(i): the set of positions that can follow position i in the tree
Annotating the Tree (OR) Rules for computing nullable, firstpos and lastpos:
Leaf true
| (or-node)
nullable(c1) firstpos(c1) lastpos(c1)
or
/\
nullable(c2) firstpos(c2) lastpos(c2)
c1 c2
if nullable(c1) then
• (cat-node) nullable(c1) if nullable(c2) then
firstpos(c1)
/\ and lastpos(c1) lastpos(c2)
firstpos(c2)
c1 c2 nullable(c2) else lastpos(c2)
else firstpos(c1)
* (star-node)
| true firstpos(c1) lastpos(c1)
c1
29
CS-6660 COMPILER DESIGN VI SEM CSE
Syntax Tree of (a|b)*abb#
30
CS-6660 COMPILER DESIGN VI SEM CSE
From Regular Expression to DFA Directly: Algorithm
31
CS-6660 COMPILER DESIGN VI SEM CSE
MINIMIZATION OF DFA:
(Q, Σ, δ, q0, F)
32
CS-6660 COMPILER DESIGN VI SEM CSE
Some states can be redundant:
33
CS-6660 COMPILER DESIGN VI SEM CSE
This is a state-minimized (or just minimized) DFA
34
CS-6660 COMPILER DESIGN VI SEM CSE
for some α in Σ, δ(p, α) and δ(q, α) are distinct
Using this inductive definition, we can calculate which states are distinct
35
CS-6660 COMPILER DESIGN VI SEM CSE
DFA transition table:
STATE a b
A B C
B C D
C B C
D B E
*E B C
36
CS-6660 COMPILER DESIGN VI SEM CSE
Minimum state DFA:
[ABCDE]
[ABCD][E]
[ABC][D][E]
[AC][B][D][E]
STATE a b
A B A
B B D
D B E
*E B A
37
CS-6660 COMPILER DESIGN VI SEM CSE
DFA Minimization is a fairly understandable process, and is useful in several
areas.
Algorithm
Input: DFA
Output: Minimized DFA
Step 1 Draw a table for all pairs of states (Qi, Qj) not necessarily connected directly [All
are unmarked initially]
Step 2 Consider every state pair (Qi, Qj) in the DFA where Qi ∈ F and Qj ∉ F or vice versa
and mark them. [Here F is the set of final states].
Step 3 Repeat this step until we cannot mark anymore states −
If there is an unmarked pair (Qi, Qj), mark it if the pair {δ(Qi, A), δ (Qi, A)} is
marked for some input alphabet.
Step 4 Combine all the unmarked pair (Q i, Qj) and make them a single state in the reduced
DFA.
Example
38
CS-6660 COMPILER DESIGN VI SEM CSE
Step 1 − We draw a table for all pair of states.
a b c d e f
a
b
c
d
e
f
a b c d e f
a
b
c ✔ ✔
d ✔ ✔
e ✔ ✔
39
CS-6660 COMPILER DESIGN VI SEM CSE
f ✔ ✔ ✔
Step 3 − We will try to mark the state pairs, with green colored check mark, transitively. If we
input 1 to state „a‟ and „f‟, it will go to state „c‟ and „f‟ respectively. (c, f) is already marked,
hence we will mark pair (a, f). Now, we input 1 to state „b‟ and „f‟; it will go to state „d‟ and „f‟
respectively. (d, f) is already marked, hence we will mark pair (b, f).
a b c d e f
a
b
c ✔ ✔
d ✔ ✔
e ✔ ✔
f ✔ ✔ ✔ ✔ ✔
After step 3, we have got state combinations {a, b} {c, d} {c, e} {d, e} that are unmarked.
So the final minimized DFA will contain three states {f}, {a, b} and {c, d, e}
40
CS-6660 COMPILER DESIGN VI SEM CSE
DFA Minimization using Equivalence Theorem
If X and Y are two states in a DFA, we can combine these two states into {X, Y} if they are not
distinguishable. Two states are distinguishable, if there is at least one string S, such that one of δ
(X, S) and δ (Y, S) is accepting and another is not accepting. Hence, a DFA is minimal if and
only if all the states are distinguishable.
Algorithm
Step 1 All the states Q are divided in two partitions − final states and non-final states and
are denoted by P0. All the states in a partition are 0th equivalent. Take a counter k and
initialize it with 0.
Step 2 Increment k by 1. For each partition in P k, divide the states in Pk into two partitions if
they are k-distinguishable. Two states within this partition X and Y are k-
distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1)-
distinguishable.
Step 3 If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
Step 4 Combine kth equivalent sets and make them the new states of the reduced DFA.
Example
q δ(q,0) δ(q,1)
a b c
b a d
c e f
d e f
e e f
f f f
41
CS-6660 COMPILER DESIGN VI SEM CSE
Let us apply above algorithm to the above DFA −
P0 = {(c,d,e), (a,b,f)}
P1 = {(c,d,e), (a,b),(f)}
P2 = {(c,d,e), (a,b),(f)}
Hence, P1 = P2.
There are three states in the reduced DFA. The reduced DFA is as follows −
Q δ(q,0) δ(q,1)
(a, b) (a, b) (c,d,e)
(c,d,e) (c,d,e) (f)
(f) (f) (f)
42
CS-6660 COMPILER DESIGN VI SEM CSE
The Structure of the Generated Analyzer
The figure overviews the architecture of a lexical analyzer generated by Lex. The
program that serves as the lexical analyzer includes a fixed program that simulates
an automaton; at this point we leave open whether that automaton is deterministic
or nondeterministic. The rest of the lexical analyzer consists of components that
are created from the Lex program by Lex itself.
43
CS-6660 COMPILER DESIGN VI SEM CSE
A Lex program is turned into a transition table and actions, which are used by a
finite-automaton simulator are used by a finite-automaton simulator
44
CS-6660 COMPILER DESIGN VI SEM CSE
An NFA constructed from a Lex program
45
CS-6660 COMPILER DESIGN VI SEM CSE
Combined NFA
We look backwards in the sequence of sets of states, until we find a set that
includes one or more accepting states. If there are several accepting states in that
set, pick the one associated with the earliest pattern pi in the list from the Lex
program. Move the forward pointer back to the end of the lexeme, and perform the
action Ai associated with pattern p i.
46
CS-6660 COMPILER DESIGN VI SEM CSE