0% found this document useful (0 votes)
12 views

Lexical Analysis

Uploaded by

neetquizforu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lexical Analysis

Uploaded by

neetquizforu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Phases of a compiler

Lexical Analysis
Role of Lexical Analyzer:
 It is the first phase of a compiler.
 The main task of the lexical analyzer is to read the input characters of the source program, group
them into lexemes, and produce as output a sequence of tokens for each lexeme in the source
program and send them to the parser for syntax analysis.
 When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table. This information stored in the symbol table regarding the kind of
identifier may be used by the lexical analyzer in determining the proper token it must pass to the
parser.
 The interaction is implemented by having the parser call the lexical analyzer. The call, suggested
by the getNextToken command, causes the lexical analyzer to read characters from its input until
it can identify the next lexeme and produce for it the next token, which it returns to the parser.

Interactions between the lexical analyzer and the parser


 The lexical analyzer may perform certain other tasks besides identification of lexemes.

1.Stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that
are used to separate tokens in the input).
2.Another task is correlating error messages generated by the compiler with the source program.
keep track of the number of newline characters seen, so it can associate a line number with each
error message. In some compilers, the lexical analyzer makes a copy of the source program with
the error messages inserted at the appropriate positions.
3.The expansion of macros may also be performed by the lexical analyzer.

 Sometimes, lexical analyzers are divided into a cascade of two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input, such as
deletion of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence
of tokens as output
Lexical Analysis Versus Parsing

1. Simplicity of design is the most important consideration.


2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
Tokens, Patterns, and Lexemes

 A token is a pair consisting of a token name and an optional attribute value.


 A pattern is a description of the form that the lexemes of a token may take.
 A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token.
printf ("Total = %d\n”, score) ;
 printf and score are lexemes matching the pattern for token id, and "Total = %d\n” is a
lexeme matching literal.

Examples of tokens
Attributes for Tokens

When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the particular
lexeme that matched.
Lexical Errors

 It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-
code error.

 The simplest recovery strategy is "panic mode" recovery. We delete successive characters from
the remaining input, until the lexical analyzer can find a well-formed token at the beginning of
what input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.


2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Input Buffering

 we often have to look one or more characters beyond the next lexeme before we can be sure we
have the right lexeme.
 we need to look at least one additional character ahead. For instance, we cannot be sure we've
seen the end of an identifier until we see a character that is not a letter or digit, and therefore is
not part of the lexeme for id
 In C, single-character operators like -, =, or < could also be the beginning of a two-character
operator like ->, ==, or <=.
 Thus, we shall introduce a Two-buffer scheme that handles large lookaheads safely.

 Position=initial+rate*60
Buffer Pairs
 Specialized buffering techniques have been developed to reduce the amount of overhead
required to process a single input character.
 An important scheme involves two buffers that are alternately reloaded

 Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
 Using one system read command we can read N characters into a buffer, rather than using one
system call per character.
 If fewer than N characters remain in the input file, then a special character, represented by eof
marks the end of the source file and is different from any possible character of the source
program.
Sentinels
 We must check, each time we advance forward, that we have not moved off one of the buffers; if
we do, then we must also reload the other buffer.
 Thus, for each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read (the latter may be a multiway branch).
 We can combine the buffer-end test with the test for the current character if we extend each buffer
to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a natural choice
is the character eof.
 Any eof that appears other than at the end of a buffer means that the input is at an end.
Two pointers to the input are maintained:
I. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to
determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this
determination is made

 Once the next lexeme is determined, forward is set to the character at its right end.
 Then, after the lexeme is recorded as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found.
 In Fig, we see forward has passed the end of the next lexeme,** (the Fortran exponentiation
operator), and must be retracted one position to its left.
 Advancing forward requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer.
 As long as we never need to look so far ahead of the actual lexeme that the sum of the lexeme's
length plus the distance we look ahead is greater than N, we shall never overwrite the lexeme in its
buffer before determining it.
Specification of Tokens
 Regular expressions are an important notation for specifying lexeme patterns.
 While they cannot express all possible patterns, they are very effective in specifying those types
of patterns that we actually need for tokens.
 Ex:(a|b)*bb
 abb,bbb,aabb,abbb
 Automata are used to recognize the tokens.
 They are abstract mathematical models of machines that perform computations on an input by
moving through a series of states or configurations.
 If the computation of an automaton reaches an accepting configuration, it accepts that input.
 At each stage of computation a transition function determines the next configuration on the basis
of a finite portion of the present configuration.
 It is a self –operating machine.
Language
 A Language is a system of signs used to communicate information to others.
 The language of computation is a combination of both English and mathematic
 The languages we consider for our discussion is an abstraction of natural languages.
 That is, our focus here is on formal languages that need precise and formal definitions.
 Programming languages belong to this category.
Symbols
 A symbol is a object and is an abstract entity that has no meaning by itself.
 It can be a character, letter, digit , or any special character.
Example: a, =, 1, *, # ,$

Alphabet
 An alphabet is a finite, nonempty set of symbols.
 The alphabet of a language is normally denoted by ∑ .
 When more than one alphabets are considered for discussion, then subscripts may be
used (e.g. ∑1, ∑2 etc) or sometimes other symbol like G may also be introduced.

• ∑={a,b,c} is the alphabet of 3 symbols.

• ∑={0,1} is the binary alphabet, consisting of symbols 0 and 1

• ∑={a,b,c,…,z} is the lowercase English alphabet.


Strings or Words over Alphabet
 A String is a finite sequence of symbols chosen from some alphabet.
 A string or word over an alphabet ∑ is a finite sequence of concatenated symbols of

 It is usually denoted by u , v , w , x , y , z

Example :
 0110, 11, 001 ,10are three strings over the binary alphabet { 0, 1 } .
 aab, abcb, b, cc,abbc, are four strings over the alphabet { a, b, c }.
 x=1001 is string over alphabet{0 ,1}

 It is not the case that a string over some alphabet should contain all the symbols from
the alphabet.
 For example, the string cc over the alphabet { a, b, c } does not contain the symbols a
and b. Hence, it is true that a string over an alphabet is also a string over any
superset of that alphabet.
Length of a string :

 The number of symbols in a string w is called its length, denoted by |w|.

Example :

| 0101 | = 4

|11| = 2

|b|=1
x=aabbcc
|x|=6

 A null string is a string with no symbols and whose length is zero is denoted by λ
(Lambda) or ε(epsilon).
String concatenation:

 If x and y are strings, then the (.) concatenation of x and y is denoted by xy


Example: x=dog y=house xy=x.y=doghouse
 Concatenation is associative (xy)z=x(yz) and |xy|=|x|+|y|
 Empty string is the identity under concatenation.ie εs=sε=s
Powers of Strings :

 For any string x and integer n≥0, we use xn to denote the string formed by sequentially
concatenating n copies of x. We can also give an inductive definition of x n as follows:
xn = ε if n = 0 ;
xn =x xn -1 otherwise
Example :
If x = 011, then x3= 011011011,
x1= 011 and
x 0= ε
Powers of Alphabets :
 We write ∑k(for some integer k) to denote the set of strings of length k with symbols from
∑.
 In other words, ∑k= { w | w is a string over ∑ and | w | = k}.
 Hence, for any alphabet, ∑0denotes the set of all strings of length zero. That is, ∑ 0 = {ε }.
For the binary alphabet { 0, 1 } we have the following
∑0 = {ε }
∑1={0,1}
∑2={00,01,10,11}
∑3={000,001,010,011,100,101,110,111}
 The set of all strings over an alphabet ∑ is denoted by ∑*. That is,
∑*= ∑0 U ∑1 U ∑2 U ∑3U ……
= U ∑k
 The set ∑*contains all the strings that can be generated by iteratively concatenating symbols
from ∑ any number of times.
 Example : If ∑ = { a, b }, then
∑* = { ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, …}
∑+ = { a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, …}
Language
 A language is a subset of ∑* for some alphabet ∑. It can be finite or infinite.
 A language is set of strings over some alphabet.
Example
If the language L takes all possible strings of length 2 over ∑ = {a, b}, then
L = { ab, aa, ba, bb }
If the language L takes all possible strings of length 3 where it ends with 0 over ∑ = {0, 1}
then L={ 000 , 010 , 100 , 110 }
Set operations on Languages
 Union
If L and M are two languages then LUM={s/s is in L or s is in M}
{ 0, 11, 01, 011 } U { 11, 01, 110 } = { 0, 11, 01, 011, 110 }
 Concatenation
If L and M are two languages then LM={st/s is in L and t is in M}
{ 0, 11, 01, 011 } . { 1, 01} = { 01, 001,111, 1101, 011, 0101,0111,01101 }
 Reversal
The reversal of a language L, denoted as LR, is defined as:
LR = {wR | w belong to R}
Let L = { 0, 11, 01, 011 }. Then LR = { 0, 11, 10, 110 }
 The Kleene closure (star*)
The Kleene closure, is a unary operator that gives the infinite set of all possible strings of all
possible lengths over ∑ including λ or e(empty string).
It is denoted by ∑* or L*
• Representation : ∑* = ∑0 ∪ ∑1 ∪ ∑2 ∪……. where ∑p is the set of all possible strings of
length p.
• Example If ∑ = {a, b}, ∑* or L* = {ε, a, b, aa, ab, ba, bb, aaa , aab………..}

 Positive Closure( Plus+)


The set ∑+ or L+ is the infinite set of all possible strings of all possible lengths over ∑
excluding ε.
• Representation : ∑+ = ∑1 ∪ ∑2 ∪ ∑3 ∪…….
• ∑+ = ∑* − {ε }
• Example If ∑ = { a, b } , ∑+ = { a, b, aa, ab, ba, bb, aaa, aab………..}
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z} and let D be the set of digits {0,1,.. .9}.

1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one,
each of which strings is either one letter or one digit.

2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
{A0,A1,A2…A9,B1,B2..

3. L4 is the set of all 4-letter strings.

4. L* is the set of all strings of letters, including ε, the empty string.

5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
{a0,abc1,a12,df,k7ph,…}

6. D+ is the set of all strings of one or more digits.


Grammar Language Automata
generates recognizes
Regular Expressions

 regular expressions has come into common use for describing all the languages that can be built
from these operators applied to the symbols of some alphabet.
 if
letter_ is to stand for any letter or the underscore,
digit is to stand for any digit,
then we could describe the language of C identifiers by:

letter_ ( letter_ I digit )*

 The vertical bar above means union, the parentheses are used to group subexpressions, the star
means "zero or more occurrences of," and the juxtaposition of letter_ with the remainder of the
expression signifies concatenation.
 The regular expressions are built recursively out of smaller regular expressions, using the rules
described below.
 Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's subexpressions.
BASIS: There are two rules that form the basis:
1. ε is a regular expression, and L (ε) is {ε} , that is, the language whose sole member is the empty
string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a},
i.e, the language with one string, of length one, with a in its one position.

INDUCTION: There are four parts to the induction whereby larger regular expressions are built from
smaller ones.
Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r)| (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around expressions without
changing the language they denote.
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) I has lowest precedence and is left associative.

Example: (a) I ((b) * (c)) or a| b*c. Both expressions denote the set of strings that are either
a single a or are zero or more b's followed by one c.
{ac,bbc,c,abc}
 A language that can be defined by a regular expression is called a regular set or regular language.
 If two regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s. For instance, (alb) = (bla)
Algebraic laws for regular expressions
Example : Let C = {a, b}.
1. The regular expression a|b denotes the language {a, b}.
2. (a|b) (alb) denotes {aa, ab, ba, bb), the language of all strings of length two over the alphabet C.
Another regular expression for the same language is aalablba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, that is, {ε, a, aa, aaa, . . . }.
4. (alb)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all strings
of a's and b's: {ε, a, b, aa, ab, ba, bb, aaa, . . .}. Another regular expression for the same language is
(a*b*)*.
5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the string a and all strings consisting
of zero or more a's and ending in b.
Regular Definitions
 we may wish to give names to certain regular expressions and use those names in subsequent
expressions,
 If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form:

where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di-l).
 we can construct a regular expression over ∑ alone, for each ri.
 We do so by first replacing uses of di in r2 then replacing uses of dl and d2 in r3 by rl and r2 and so
on.
 Finally, in rn we replace each di, for i = 1,2,. . . ,n - 1, by the substituted version of ri, each of
which has only symbols of ∑.
Example: C identifiers are strings of letters, digits, and underscores.

Ab1,_one,t3es,_01a
(A|B|C…a|b…|_)(A|B|C…a|b….|_|0|1…|9)*

Example : Unsigned numbers (integer or floating point) are strings such


as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular definition

Example:Write regular definition for your RollNo,email


Regular definition of email
[email protected]

letter->a|b|…z
digit->0|1…|9
gmail->letter(letter|digit|_|.)*@gmail.com
Email-> letter(letter|digit|_|.)*@(letter)+ .((letter)3|(letter)2)(.|e)(e|(letter)2)

Re for all string {a,b} end with b

(a|b)*b

RE starts with a and ends with a

a(a|b)*a

Aa,aaa,aaaba
RE
for even no of a’s
b*(ab* ab*)*
All strings of a’s and b’s that do not contain the subsequence abb
All strings of a’s and b’s with and odd number of b’s.
All strings of a’s and b’s that contain at least two b's.
Extensions of Regular Expressions

1. One or more instances. The unary, postfix operator + represents the positive closure of a
regular expression and its language.
That is, if r is a regular expression, then (r)+ denotes the language (L(r))+.
The operator has the same precedence and associativity as the operator *.
Algebraic laws:
r* = r+| ε and r+ = r r* = r* r
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence.
i.e, r? is equivalent to rl ε,
or
L(r?) =L(r) U {ε}.
The ? operator has the same precedence and associativity as * and +.
3. Character classes. A regular expression a1la2l. .. lan, where the ai's are each symbols of the
alphabet, can be replaced by the shorthand [a 1a2 . . . an]. More importantly, when a1 , a2, . . . , a,
form a logical sequence,
e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a
hyphen.
Thus, [abc] is shorthand for alblc, and
[a-z] is shorthand for a|b|. . . |z.
Example : Using these shorthands, we can rewrite the regular definition of identifier as:

letter- [A-Za-z-]
digit  [0-9]
id  letter- ( letter | digit )*

The regular definition of numbers can also be simplified:

digit  [0-9]
digits  = digit+
number  digits ( . digits)? ( E [+-]? digits )?
Recognition of Tokens

 Using the patterns that recognize all the needed tokens we build a piece of code that examines the
input string and finds a prefix that is a lexeme matching one of the patterns.

Example. A grammar for branching statements

 The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as far as the lexical analyzer is concerned.
 The patterns for these tokens are described using regular definitions
 The patterns for id and number are

 For this language, the lexical analyzer will recognize the keywords i f , then, and else, as well as
lexemes that match the patterns for relop, id, and number. In addition, we assign the lexical
analyzer the job of stripping out whitespace,by recognizing the "token" ws defined by:

 Token ws is different from the other tokens in that, when we recognize it, we do not return it to the
parser, but rather restart the lexical analysis from the character that follows the whitespace.
 It is the following token that gets returned to the parser.
The goal for the lexical analyzer is summarized as

Tokens, their patterns, and attribute values


Transition Diagrams

 Transition diagrams have a collection of nodes or circles, called states.


 Each state represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns.
 We may think of a state as summarizing all we need to know about what characters we have
seen between the lexemeBegin pointer and the forward pointer
 Edges are directed from one state of the transition diagram to another.
 Each edge is labeled by a symbol or set of symbols
 If we are in some state s, and the next input symbol is ‘a’, we look for an edge out of state s
labeled by a (and perhaps by other symbols, as well).
 If we find such an edge, we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are:
1. Certain states are said to be accepting, or final These states indicate that a lexeme has been
found.
We always indicate an accepting state by a double circle, and if there is an action to be taken -
typically returning a token and an attribute value to the parser - we shall attach that action to the
accepting state.
2. In addition, if it is necessary to retract the forward pointer one position then we shall additionally
place a * near that accepting state.
3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start ,"
entering from nowhere.
The transition diagram always begins in the start state before any input symbols have been read.
Figure below is a transition diagram that recognizes the lexemes matching the token relop

If(a<=b)
Recognition of Reserved Words and Identifiers

 keywords like if or then are reserved so they are not identifiers even though they look like
identifiers.

A transition diagram for id's and keywords

 empid=1;
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially.
• When we find an identifier, a call to installID places it in the symbol table if it is not already there
and returns a pointer to the symbol-table entry for the lexeme found .
• The function getToken examines the symbol table entry for the lexeme found, and returns
whatever token name the symbol table says this lexeme represents - either id or one of the
keyword tokens that was initially installed in the table.
2. Create separate transition diagrams for each keyword
• If we adopt this approach, then we must prioritize the tokens so that the reserved-word tokens
are recognized in preference to id, when the lexeme matches both patterns.
The transition diagram for token number is shown in Fig.

The transition diagram, shown in Fig. is for whitespace.


Architecture of a Transition-Diagram-Based Lexical Analyzer

 Each state is represented by a piece of code.


 A switch based on the value of state takes us to code for each of the possible states, where
we find the action of that state.
 Often, the code for a state is itself a switch statement or multiway branch that determines the
next state by reading and examining the next input character.
Example :sketch of getRelop(), a C++ function
 The job of getRelop() is to simulate the transition diagram and return an object of type
TOKEN, that is, a pair consisting of the token name (which must be relop in this case) and an
attribute value (the code for one of the six comparison operators in this case).
 getRelop() first creates a new object retToken and initializes its first component to RELOP,
the symbolic code for token relop.
Sketch of implementation of relop transition diagram

You might also like