L4 - Lexical Analysis (Introduction)
L4 - Lexical Analysis (Introduction)
Introduction
Lecture - 4
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What's a token?
The token is a sequence of characters which represents a unit of information in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses as a token,
the pattern is a sequence of characters.
Issues Of Lexical Analyzer
Consider the following code that is fed to Lexical Analyzer Examples of Tokens created
#include <stdio.h> Lexeme Token
# define NUMS 8 int Keyword
int maximum(int x, int y) { maximum Identifier
// This will compare 2 numbers ( Operator
if (x > y) int Keyword
return x; x Identifier
else { , Operator
return y; int Keyword
} Y Identifier
} ) Operator
{ Operator
Examples of Nontokens
} Operator
Type Examples > Operator
if Keyword
Comment // This will compare 2 numbers
return Keyword
Pre-processor directive #include <stdio.h> else Keyword
Pre-processor directive #define NUMS 8 ; Operator
Macro NUMS
Whitespace /n /b /t
Initially both the pointers point to the first character of the input string as shown below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank
space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 5
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string “Mithun Roy” is 10 and is denoted by |Mithun Roy| = 10.
A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
stmt→if expr then stmt |if expr then stmt else stmt |ε
expr→term relop term |term
term→id |num
where the terminals if , then, else, relop, id and num generate sets of strings given by the following
regular definitions:
if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter | digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as
the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved;
that is, they cannot be used as identifiers.
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 7
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a lexical analyzer
is called by the parser to get the next token. It is used to keep track of information about the
characters that are seen as the forward pointer scans the input.
• Lexical errors are not very common, but it should be managed by a scanner.
• Misspelling of identifiers, operators, keyword are considered as lexical errors.
• Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a
token.
Error Recovery in Lexical Analyzer
Here, are a few most common error recovery techniques:
• It is used by web browsers to format and display a web page with the help of parsed data from
JavsScript, HTML, CSS.
• More effort is needed to develop and debug the lexer and its token descriptions.
• Additional runtime overhead is required to generate the lexer tables and construct the tokens.
Find out all the TOKENs and NONTOKENs from the following “C” programming code.
# include<stdio.h>
# define N 10
int main(){
int sum=0,i;
// this is the code where we add N naturals numbers.
for (i=1;i<=N;i++){
sum=sum + i;
}
return(0);
}