Lexical Analysis
Lexical Analysis
Lexical Analysis
Role of Lexical Analyzer:
It is the first phase of a compiler.
The main task of the lexical analyzer is to read the input characters of the source program, group
them into lexemes, and produce as output a sequence of tokens for each lexeme in the source
program and send them to the parser for syntax analysis.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table. This information stored in the symbol table regarding the kind of
identifier may be used by the lexical analyzer in determining the proper token it must pass to the
parser.
The interaction is implemented by having the parser call the lexical analyzer. The call, suggested
by the getNextToken command, causes the lexical analyzer to read characters from its input until
it can identify the next lexeme and produce for it the next token, which it returns to the parser.
1.Stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that
are used to separate tokens in the input).
2.Another task is correlating error messages generated by the compiler with the source program.
keep track of the number of newline characters seen, so it can associate a line number with each
error message. In some compilers, the lexical analyzer makes a copy of the source program with
the error messages inserted at the appropriate positions.
3.The expansion of macros may also be performed by the lexical analyzer.
a) Scanning consists of the simple processes that do not require tokenization of the input, such as
deletion of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence
of tokens as output
Lexical Analysis Versus Parsing
Examples of tokens
Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the particular
lexeme that matched.
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-
code error.
The simplest recovery strategy is "panic mode" recovery. We delete successive characters from
the remaining input, until the lexical analyzer can find a well-formed token at the beginning of
what input is left.
we often have to look one or more characters beyond the next lexeme before we can be sure we
have the right lexeme.
we need to look at least one additional character ahead. For instance, we cannot be sure we've
seen the end of an identifier until we see a character that is not a letter or digit, and therefore is
not part of the lexeme for id
In C, single-character operators like -, =, or < could also be the beginning of a two-character
operator like ->, ==, or <=.
Thus, we shall introduce a Two-buffer scheme that handles large lookaheads safely.
Position=initial+rate*60
Buffer Pairs
Specialized buffering techniques have been developed to reduce the amount of overhead
required to process a single input character.
An important scheme involves two buffers that are alternately reloaded
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
Using one system read command we can read N characters into a buffer, rather than using one
system call per character.
If fewer than N characters remain in the input file, then a special character, represented by eof
marks the end of the source file and is different from any possible character of the source
program.
Sentinels
We must check, each time we advance forward, that we have not moved off one of the buffers; if
we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read (the latter may be a multiway branch).
We can combine the buffer-end test with the test for the current character if we extend each buffer
to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice
is the character eof.
Any eof that appears other than at the end of a buffer means that the input is at an end.
Two pointers to the input are maintained:
I. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to
determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this
determination is made
Once the next lexeme is determined, forward is set to the character at its right end.
Then, after the lexeme is recorded as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found.
In Fig, we see forward has passed the end of the next lexeme,** (the Fortran exponentiation
operator), and must be retracted one position to its left.
Advancing forward requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer.
As long as we never need to look so far ahead of the actual lexeme that the sum of the lexeme's
length plus the distance we look ahead is greater than N, we shall never overwrite the lexeme in its
buffer before determining it.
Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns.
While they cannot express all possible patterns, they are very effective in specifying those types
of patterns that we actually need for tokens.
Ex:(a|b)*bb
abb,bbb,aabb,abbb
Automata are used to recognize the tokens.
They are abstract mathematical models of machines that perform computations on an input by
moving through a series of states or configurations.
If the computation of an automaton reaches an accepting configuration, it accepts that input.
At each stage of computation a transition function determines the next configuration on the basis
of a finite portion of the present configuration.
It is a self –operating machine.
Language
A Language is a system of signs used to communicate information to others.
The language of computation is a combination of both English and mathematic
The languages we consider for our discussion is an abstraction of natural languages.
That is, our focus here is on formal languages that need precise and formal definitions.
Programming languages belong to this category.
Symbols
A symbol is a object and is an abstract entity that has no meaning by itself.
It can be a character, letter, digit , or any special character.
Example: a, =, 1, *, # ,$
Alphabet
An alphabet is a finite, nonempty set of symbols.
The alphabet of a language is normally denoted by ∑ .
When more than one alphabets are considered for discussion, then subscripts may be
used (e.g. ∑1, ∑2 etc) or sometimes other symbol like G may also be introduced.
Example :
0110, 11, 001 ,10are three strings over the binary alphabet { 0, 1 } .
aab, abcb, b, cc,abbc, are four strings over the alphabet { a, b, c }.
x=1001 is string over alphabet{0 ,1}
It is not the case that a string over some alphabet should contain all the symbols from
the alphabet.
For example, the string cc over the alphabet { a, b, c } does not contain the symbols a
and b. Hence, it is true that a string over an alphabet is also a string over any
superset of that alphabet.
Length of a string :
Example :
| 0101 | = 4
|11| = 2
|b|=1
x=aabbcc
|x|=6
A null string is a string with no symbols and whose length is zero is denoted by λ
(Lambda) or ε(epsilon).
String concatenation:
For any string x and integer n≥0, we use xn to denote the string formed by sequentially
concatenating n copies of x. We can also give an inductive definition of x n as follows:
xn = ε if n = 0 ;
xn =x xn -1 otherwise
Example :
If x = 011, then x3= 011011011,
x1= 011 and
x 0= ε
Powers of Alphabets :
We write ∑k(for some integer k) to denote the set of strings of length k with symbols from
∑.
In other words, ∑k= { w | w is a string over ∑ and | w | = k}.
Hence, for any alphabet, ∑0denotes the set of all strings of length zero. That is, ∑ 0 = {ε }.
For the binary alphabet { 0, 1 } we have the following
∑0 = {ε }
∑1={0,1}
∑2={00,01,10,11}
∑3={000,001,010,011,100,101,110,111}
The set of all strings over an alphabet ∑ is denoted by ∑*. That is,
∑*= ∑0 U ∑1 U ∑2 U ∑3U ……
= U ∑k
The set ∑*contains all the strings that can be generated by iteratively concatenating symbols
from ∑ any number of times.
Example : If ∑ = { a, b }, then
∑* = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, …}
∑+ = { a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, …}
Language
A language is a subset of ∑* for some alphabet ∑. It can be finite or infinite.
A language is set of strings over some alphabet.
Example
If the language L takes all possible strings of length 2 over ∑ = {a, b}, then
L = { ab, aa, ba, bb }
If the language L takes all possible strings of length 3 where it ends with 0 over ∑ = {0, 1}
then L={ 000 , 010 , 100 , 110 }
Set operations on Languages
Union
If L and M are two languages then LUM={s/s is in L or s is in M}
{ 0, 11, 01, 011 } U { 11, 01, 110 } = { 0, 11, 01, 011, 110 }
Concatenation
If L and M are two languages then LM={st/s is in L and t is in M}
{ 0, 11, 01, 011 } . { 1, 01} = { 01, 001,111, 1101, 011, 0101,0111,01101 }
Reversal
The reversal of a language L, denoted as LR, is defined as:
LR = {wR | w belong to R}
Let L = { 0, 11, 01, 011 }. Then LR = { 0, 11, 10, 110 }
The Kleene closure (star*)
The Kleene closure, is a unary operator that gives the infinite set of all possible strings of all
possible lengths over ∑ including λ or e(empty string).
It is denoted by ∑* or L*
• Representation : ∑* = ∑0 ∪ ∑1 ∪ ∑2 ∪……. where ∑p is the set of all possible strings of
length p.
• Example If ∑ = {a, b}, ∑* or L* = {ε, a, b, aa, ab, ba, bb, aaa , aab………..}
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one,
each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
{A0,A1,A2…A9,B1,B2..
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
{a0,abc1,a12,df,k7ph,…}
regular expressions has come into common use for describing all the languages that can be built
from these operators applied to the symbols of some alphabet.
if
letter_ is to stand for any letter or the underscore,
digit is to stand for any digit,
then we could describe the language of C identifiers by:
The vertical bar above means union, the parentheses are used to group subexpressions, the star
means "zero or more occurrences of," and the juxtaposition of letter_ with the remainder of the
expression signifies concatenation.
The regular expressions are built recursively out of smaller regular expressions, using the rules
described below.
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's subexpressions.
BASIS: There are two rules that form the basis:
1. ε is a regular expression, and L (ε) is {ε} , that is, the language whose sole member is the empty
string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a},
i.e, the language with one string, of length one, with a in its one position.
INDUCTION: There are four parts to the induction whereby larger regular expressions are built from
smaller ones.
Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r)| (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around expressions without
changing the language they denote.
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) I has lowest precedence and is left associative.
Example: (a) I ((b) * (c)) or a| b*c. Both expressions denote the set of strings that are either
a single a or are zero or more b's followed by one c.
{ac,bbc,c,abc}
A language that can be defined by a regular expression is called a regular set or regular language.
If two regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s. For instance, (alb) = (bla)
Algebraic laws for regular expressions
Example : Let C = {a, b}.
1. The regular expression a|b denotes the language {a, b}.
2. (a|b) (alb) denotes {aa, ab, ba, bb), the language of all strings of length two over the alphabet C.
Another regular expression for the same language is aalablba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, that is, {ε, a, aa, aaa, . . . }.
4. (alb)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all strings
of a's and b's: {ε, a, b, aa, ab, ba, bb, aaa, . . .}. Another regular expression for the same language is
(a*b*)*.
5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the string a and all strings consisting
of zero or more a's and ending in b.
Regular Definitions
we may wish to give names to certain regular expressions and use those names in subsequent
expressions,
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form:
where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di-l).
we can construct a regular expression over ∑ alone, for each ri.
We do so by first replacing uses of di in r2 then replacing uses of dl and d2 in r3 by rl and r2 and so
on.
Finally, in rn we replace each di, for i = 1,2,. . . ,n - 1, by the substituted version of ri, each of
which has only symbols of ∑.
Example: C identifiers are strings of letters, digits, and underscores.
Ab1,_one,t3es,_01a
(A|B|C…a|b…|_)(A|B|C…a|b….|_|0|1…|9)*
letter->a|b|…z
digit->0|1…|9
gmail->letter(letter|digit|_|.)*@gmail.com
Email-> letter(letter|digit|_|.)*@(letter)+ .((letter)3|(letter)2)(.|e)(e|(letter)2)
(a|b)*b
a(a|b)*a
Aa,aaa,aaaba
RE
for even no of a’s
b*(ab* ab*)*
All strings of a’s and b’s that do not contain the subsequence abb
All strings of a’s and b’s with and odd number of b’s.
All strings of a’s and b’s that contain at least two b's.
Extensions of Regular Expressions
1. One or more instances. The unary, postfix operator + represents the positive closure of a
regular expression and its language.
That is, if r is a regular expression, then (r)+ denotes the language (L(r))+.
The operator has the same precedence and associativity as the operator *.
Algebraic laws:
r* = r+| ε and r+ = r r* = r* r
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence.
i.e, r? is equivalent to rl ε,
or
L(r?) =L(r) U {ε}.
The ? operator has the same precedence and associativity as * and +.
3. Character classes. A regular expression a1la2l. .. lan, where the ai's are each symbols of the
alphabet, can be replaced by the shorthand [a 1a2 . . . an]. More importantly, when a1 , a2, . . . , a,
form a logical sequence,
e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a
hyphen.
Thus, [abc] is shorthand for alblc, and
[a-z] is shorthand for a|b|. . . |z.
Example : Using these shorthands, we can rewrite the regular definition of identifier as:
letter- [A-Za-z-]
digit [0-9]
id letter- ( letter | digit )*
digit [0-9]
digits = digit+
number digits ( . digits)? ( E [+-]? digits )?
Recognition of Tokens
Using the patterns that recognize all the needed tokens we build a piece of code that examines the
input string and finds a prefix that is a lexeme matching one of the patterns.
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as far as the lexical analyzer is concerned.
The patterns for these tokens are described using regular definitions
The patterns for id and number are
For this language, the lexical analyzer will recognize the keywords i f , then, and else, as well as
lexemes that match the patterns for relop, id, and number. In addition, we assign the lexical
analyzer the job of stripping out whitespace,by recognizing the "token" ws defined by:
Token ws is different from the other tokens in that, when we recognize it, we do not return it to the
parser, but rather restart the lexical analysis from the character that follows the whitespace.
It is the following token that gets returned to the parser.
The goal for the lexical analyzer is summarized as
If(a<=b)
Recognition of Reserved Words and Identifiers
keywords like if or then are reserved so they are not identifiers even though they look like
identifiers.
empid=1;
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially.
• When we find an identifier, a call to installID places it in the symbol table if it is not already there
and returns a pointer to the symbol-table entry for the lexeme found .
• The function getToken examines the symbol table entry for the lexeme found, and returns
whatever token name the symbol table says this lexeme represents - either id or one of the
keyword tokens that was initially installed in the table.
2. Create separate transition diagrams for each keyword
• If we adopt this approach, then we must prioritize the tokens so that the reserved-word tokens
are recognized in preference to id, when the lexeme matches both patterns.
The transition diagram for token number is shown in Fig.