Chapter 3 Lexical Analysis
Chapter 3 Lexical Analysis
The Role of Lexical Analyzer /Scanner: The lexical analyzer is the first phase of a compiler. Its main task is to read the
input characters and produce as output a sequence of tokens that the parser uses for syntax analysis.
There are several reasons for separating the analysis phase of compiler into lexical analysis and parsing.
1. The separation of lexical analysis from syntax analysis often allow us to simplify one or the other of these
phases.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
Tokens, Patterns, Lexemes: A set of strings in the input for which the same token is produced as output. This set of
strings is described by a rule called a pattern associated with the token. A lexeme is a sequence of characters in the
source program that is matched by the pattern for a token. A token is a syntactic category in a sentence of a language.
Attributes for Tokens: The lexical analyzer collects information about tokens into their associated attributes. As a
practical matter, a token has usually only a single attribute-a pointer to the symbol table entry in which the information
about the token is kept. For diagnostic purposes, we may be interested in both the lexeme for an identifier and the line
number on which it was first seen. Both these items of information can be stored in the symbol-table entry for the
identifier.
Lexical Errors: If the string fi is encountered in a C program for the first time in the context fi(a==f(x))… , a lexical
analyzer cannot tell whether fi is a misspelling of the keyword if.
In interactive computing environment possible error-recovery actions are:
Deleting an extraneous character
Inserting a missing character
Replacing an incorrect by a correct character
Transposing two adjacent characters.
Input buffering: A buffer contains data that is stored for a short period of time, typically in the computer’s memory. The
purpose of a buffer is to hold data right before it is used.
There are three general approaches to the implementation of a lexical analyzer.
1. Use a lexical analyzer generator, such as Flex (Fast lexical Analyzer) compiler.
2. Write the lexical analyzer in a system programming language, using the I/O facilities of that language to read the
input.
3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.
1. The lexical analyzer scans the input from left to right one character at a time. It uses two pointers begin
ptr(bp) and forward to keep track of the pointer of the input scanned.
2. Initially both the pointers point to the first character of the input string as shown below.
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered, it indicates end
of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is identified. Then both the
begin ptr(bp) and forward ptr(fp) are set at next token.
Two Buffer input scheme/Buffer Pairs: A buffer can be divided into two halves. If the look Ahead pointer moves
towards halfway in First Half, the second half is filled with new characters to be read. If the look Ahead pointer moves
towards the right end of the buffer of the second half, the first half will be filled with new characters, and it goes on.
Specification of Tokens:Regular Languages are the most popular for specifying tokens because · These are based on
simple and useful theory, · Are easy to understand and · Efficient implementations exist for generating lexical analyzers
based on such languages.
Languages: Let ∑ be a set of characters. ∑ is called the alphabet. A language over ∑ is set of strings of characters drawn
from ∑. Here are some examples of languages:
Alphabet=English characters : Language=English sentences
Alphabet = ASCII : Language = C++, Java, C# program
Regular Expression: A regular expression (RE) is define as:
a ordinary character from ∑
ε the empty string
R|S either R or S
RS R followed by S (concatenation)
R* concatenation of R zero or more times (R* = ε |R|RR|RRR...)
Regular expression extensions are used as convenient notation of complex RE:
R? ε | R (zero or one R)
R+ RR* (one or more R)
(R) R (grouping)
[abc] a|b|c (any of listed)
[a-z] a|b|....|z (range)
[^ab] c|d|... (anything but ‘a’‘b’)
Regular Definitions: For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
d1 r1 where d1 is distinct name and r is a regular expression.
Example: letter A|B|……|Z|a|…….|z
digit 0|1|………|9
idletter(letter|digit)*
Example: numbers such as 39.37, 6.336E4 1.894E-4
digit 0|1|………|9 // Notational Shorthand: digit 0|1|………|9
digitsdigit digit* // digitsdigit+
optional_fraction.digits| ε // optional_fraction(.digits)?
optional_exponent(E(+|-|ε)digits) | ε // optional_exponent(E(+|-|)? digits) ?
numdigits optional_fraction optional_exponent
Character Classes: The notation [abc] denotes the regular expression a|b|c. An abbreviated character class such as [a-z]
denotes the regular expression a|b…|z.
We can use character classes to describe identifiers as [a-zA- Z_][a- zA-Z0-9_]*
Recognition of Tokens: Transition Diagrams. Example: Transition diagram of relational operator
Use a * to indicate states on which this input retraction must take place.
Example: Transition diagram for identifiers.
A simple technique for separating keywords from identifiers is to initialize appropriately the symbol table in which
information about identifiers is saved. gettoken() and install_id() are used to obtain the token and attribute-value,
respectively, to be returned. The procedure install_id() has access to input buffer, where the identifier lexeme has been
located. The symbol table is examined and if the lexeme is found there marked as a keyword, install_id() return 0. If the
lexeme is found as identifier, a pointer to the symbol table entry is returned.
S=a+b+c;
Avg=S/3;
Flex takes as input a specification file that contains amongst other things a set of regular expressions. Flex will generate
C code defining a function yylex(). This is the lexical analyzer itself and it is subsequently linked with the remainder of the
code for the compiler. Each time the function yylex() is called, it finds the next token from the input, the tokens being
defined by the regular expressions originally used as the specification file for flex. The automated generation of the
recognizing code makes the construction of a lexical analyzer much easier.
Structure of a Flex program
Definition Section
%%
Rules Section
%%
User Code
Definition Section
This section contains global declarations, literal block and header files. Literal block is C code delimited by %{ and % }.it
also contains variable declarations and function prototype.
Rules Section
This section contains the pattern and corresponding action. The pattern part contains a regular expression of the lexical
analyzer and the action part is a C code, which will be executed when a pattern matches with the input.
User Defined Section(Auxiliary Section)
This section contains any valid C code. It may contain main() function and other user defined functions.
The function yylex() is automatically generated by the flex when it is provided with a .l file and this yylex() function is
expected by parser to call to retrieve tokens from current/this token stream.
Note: The function yylex() is the main flex function that runs the Rule Section and extension (.l) is the extension used
to save the programs.
Program example 1. Create your file in notepad and save it filename.l
%{
#include<stdio.h>
%}
letter [a-zA-Z]
digit [0-9]
del[ \t]
%%
{del}+
"if"|"int"|"else"|"main" {printf("Keyword: %s\n", yytext);}
%%
int yywrap(void){}
int main()
yylex();
return 0;
}