002chapter 2 - Lexical Analysis
002chapter 2 - Lexical Analysis
Lexical Analysis
Basic Topics in chapter-Two
The output is a sequence of tokens that are sent to the parser for syntax analysis
o Eliminates comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input).
o Report errors message with line number and display it
pass token
and attribute value
read char
Source
program
Lexical Parser
analyzer
put back get next
char id
Read entire
program into Symbol Table
memory
There are a number of reasons why the analysis portion of a compiler is normally
i. Simplicity of design
ii. Lexemes
iii. Patterns
The lexical analyzer scans the source program and produces as output a sequence of tokens, which are
normally passed, one at a time to the parser.
Example of Non-Tokens:
o Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme Token
int MAX(int a, int b)
int keywords
this is source program
MAX identifier
( operator
int keyword
a identifier
, operator
int keyword
b identifier
) operator
: Chapter-Two: Lexical Analysis
Example #2 (Tokens, Patterns and Lexemes)
gives some typical tokens, their informally described patterns, and some sample
lexemes.
To see how these concepts are used in practice, in programming language.
keyword if if
; ; ;
the lexical analyzer must provide the subsequent compiler phases additional
information about the particular lexeme that matched.
For example, the pattern for token number matches both 0 and 1,
but it is extremely important for the code generator to know which lexeme was
found in the source program.
but an attribute value that describes the lexeme represented by the token;
the token name influences parsing decisions, while the attribute value influences
translation of tokens after the parse.
Normally, information about an identifier — e.g., its lexeme, its type, and
token Attribute
operator <mult-op>
operator <exp-op>
all possible errors made by the user are detected and reported to the user in
form of error messages.
This process of locating errors and reporting it to user is called Error Handling
process.
Functions of Error handler
Detection,
Reporting,
Recovery
o Unmatched string
Example:
#1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.
#2 : This is a comment */
This is an lexical error since end of comment is present but beginning is not present.
• Example of alphabet is :
Empty String:
– The string of zero length. This is denoted by ∈ (epsilon).
– The empty string plays the role of 0 in a number system.
– The set of strings, including empty, over an alphabet ∑ is denoted by ∑*. or ∑+ = ∑* -{є}
– E.g {є}-means empty string of length zero
– {a}-one string with whose length is 1
: Chapter-Two: Lexical Analysis
Length of String:
– The number of positions for symbols in the string.
– A string is generally denoted by w.
– The length of string is denoted by |w|.
– Example: |10011| = 5
– Example:
– {0,1}*={є,0,1,00,01,10,11….}
– ∑0 ={є}, regardless of what alphabet ∑ is. That is є is the only string of length 0.
– If ∑={0,1} then
– ∑1={0,1}
– ∑2={00,01,10,11}
– ∑3={000,001,…111} and so on
: Chapter-Two: Lexical Analysis
Concatenation Language:
– Assume a string x of length m and string y of length n, the concatenation of x and y
is written xy, which is the string obtained by appending y to the end of x, as in x1
x2 …x m y1 y2….. yn.
– To concatenate a string with itself many times we use the “superscript” notation:
• Solution:
• Natural language:
– Which is a kind of language that real people speak
– Formal language is different from natural language
– We use a formal language to model part of a natural language such as parts of the phonology,
morphology or syntax
: Chapter-Two: Lexical Analysis
Types of Finite automata
– In DFA, given current state we know that the next state will be move (transition from one
state to another) is unequally determined by the current configuration.
– If the internal state, input and contents of the storage are known, it is possible to predict the
future behavior of the automaton.
– This is said to be DFA otherwise it is NFA.
: Chapter-Two: Lexical Analysis
Definition of DFA
• If δ(q0 ,a ) = q1, then if the DFA is in state q0 and the current input symbol is a, the
DFA will go into state q1.
: Chapter-Two: Lexical Analysis
Example
a. Transition diagram
– a diagram consisting of circles to represent states and directed line segments to
represent transitions between the states.
– One or more actions (outputs may be associated with each transition).
– If it matches the symbol on an arc leaving the current state, then cross that arc, move to
the next state, and also advance one symbol in the input.
– If we are in the accepting state (q4) when we run out of input, the machine has
successfully recognized an instance of sheeptalk.
– If the machine never gets to the final state, either because it runs out of input, or it gets
some input that doesn’t match an arc, or if it just happens to get stuck in some non-final
state, we say the machine REJECTS or fails to accept an input.
• Solution:
From the given table for δ, the DFA is drawn, where q2 is the only final state. (It is to be noted that a
DFA can “accept” a string and it can “recognize” a language.
: Chapter-Two: Lexical Analysis
Example #2
Q2. Design a DFA, the language recognized by the automata being L = {anb :n≥0}
•Solution:
• For the given language L = {anb :n≥0} , the strings could be {b, ab, a2b, a3b,…}
• Therefore the DFA accepts all strings consisting of arbitrary number of a’s, followed
by a single b. all other input strings are rejected.
• Solution:
• For ∑ = {a, b}, languages generated = {a,b}* (∈ will be accepted when
initial state equal final state)
• For ∑ = {a, b}, languages generated = {a,b}+ (∈ is not accepted)
: Chapter-Two: Lexical Analysis
Exercise #1
Q1. Design a DFA, M which accepts the language L(M) = {w∈(a, b)*: w does not
contain three consecutive b’s.
•Let M = (Q, ∑, δ, q, F) where Q ={q0, q1, q2,q3}, ∑ ={a,b}, q0 is the initial state ,
F= {q0,q1 ,q2} are final state and δ is defined as follows:
Q3. Given ∑ = {a, b}, construct a DFA that shall recognize the language L = {a mbn}:
m,n>0}
• A NFA can be different from a deterministic one in that for any input symbol,
• In NFA, given the current state there could be multiple next state
• The questions is how do we choice to which state go to the second input because it
is multiple state?
– In NFA, the next state may be chosen at random or parallel.
– A string is accepted by an NFA if there is some sequence of possible moves that will
put the machine in the final state at the end of the string.
– In DFA, For a given state on a given input we reach to a deterministic and unique
state.
– In NFA or NDFA we may lead to more than one state for a given input.
ε-transitions are useful for combining smaller automata into larger ones
This machine combines a machine for {a}* and a machine for {b}*
It uses an ε-transition at the start to achieve the union of the two languages
• Draw the state diagram for NFA accepting language L=(ab)*(ba)* U aa*
• Design a NFA to accept strings with a’s and b’s such that strings end with ‘aa’
L(M1) = L (M 2)
– i.e., if both accept the same language.
The corresponding DFA is shown below. Please note that here any subset containing
q2 is the final state.
– Can we transform a large automaton into a smaller one (provided a smaller one
exists)?
Letter(letter | digit)*
Because, they cannot express all possible patterns, they are very effective in
specifying those types of patterns
: that we
Chapter-Two: actually
Lexical Analysis need for tokens.
Strings and Languages
An alphabet is any finite set of symbols such as letters, digits, and punctuation.
– If x and y are strings, then the concatenation of x and y is also string, denoted xy,
In language theory, the terms "sentence" and "word" are often used as synonyms
for "string."
The empty string is the identity under concatenation; that is, for any string s, ∈S = S∈ = s.
Def. Let Σ be a set of characters. A language over is a set of strings of characters drawn
from Σ.
Let L = {A, . . . , Z}, then [“A”,”B”,”C”, “BF”…,”ABZ”,…] is consider the language
defined by L
Abstract languages like ∅, the empty set, or {},the set containing only the empty
string, are languages under this definition.
: Chapter-Two: Lexical Analysis
Terms for Parts of Strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example: prefixes of banana
∈,a, ba,ban, bana, babab, banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the end
of s. For example: suffix of banana
banana,banan,baba,ban,ba,b, ∈
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For instance, banana, nan, and ∈ are substrings of banana.
4. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
For example: baan is a subsequence of banana.
: Chapter-Two: Lexical Analysis
Operations on Languages
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z } and
let D be the set of digits {0,1,.. .9}.
L and D are, respectively, the alphabets of uppercase and lowercase letters and
of digits.
other languages can be constructed from L and D, using the operators illustrated
above as shown next slide.
: Chapter-Two: Lexical Analysis
Operations on Languages (cont.)
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
Zero or more: a* means “zero or more a’s”, To say “zero or more a’s,”
One or more: Since a+ means “one or more a’s”, you can use aa* (or
equivalently a*a) to mean “one or more a’s”.
Zero or one: (a)? It can be described as an optional ‘a’ with (a + ∈).
A regular expression is obtained from the symbol {0,1}, empty string ∈, and empty-
set ∅ perform the operations +, ⋅ and *
There is a very simple correspondence between regular expressions and the languages
they denote:
o The RE (a|b)* denotes the set of all strings containing є and all strings of a’s and
b’s
o The RE a|a*b denotes the set containing the string a and all strings consisting of
zero or more a’s followed by a b
: Chapter-Two: Lexical Analysis
Precedence and Associativity
d1 r1
d2 r2
…..
dn rn
Area ‘(‘ digit+ ‘)’ Address name ‘@’ name ‘.’ name ‘.’
name
Exchange digit+
Phone digit+
Tokens are specified with the help of regular expression, then we recognize with
transition diagram.
: Chapter-Two: Lexical Analysis
Recognition of Tokens: Transition Diagram
= 2 return(relop,LE)
1 3
> return(relop,NE)
<
start other *
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)
letter other
*
start
9 10 11
return(id)
switch (state) {
case 9:
if (isletter( c) ) state = 10; else state =
failure();
break;
case 10: c = nextchar();
if (isletter( c) || isdigit( c) ) state = 10; else
state =11
case 11: retract(1);
insert(id);
return;
: Chapter-Two: Lexical Analysis
c.Recognition of numbers (integer | floating points)
Unsigned number in Pascal
Accepting float
Accepting integer Accepting float e.g. 12.31E4
e.g. 12 e.g. 12.31
Starting in the third diagram the accept state will be reached after 12
Therefore, the matching should always start with the first transition diagram
If failure occurs in one transition diagram then retract the forward pointer to the
start state and activate the next diagram.
If failure occurs in all diagrams then a lexical error has occurred
perhaps other characters that are not considered by the language design to be part
of any token.
A field of the symbol-table entry indicates that these strings are never ordinary
identifiers, and tells which token they represent.
Create separate transition diagrams for each keyword; the transition diagram for the
For example
• The main purpose of LEX is to facilitate lexical analysis, the processing of character
sequences such as source code to produce symbol sequences called tokens for use as input to
other programs such as parsers.
The input notation for the Lex tool is referred to as the Lex language and the tool itself is the
Lex compiler.
The Lex compiler transforms the input patterns into a transition diagram and generates code, in
a file called lex.yy.c, that simulates this transition diagram.
: Chapter-Two: Lexical Analysis
How Lex tool works?
– Then Lex compiler runs the lex.1 program and produces a C program lex.yy.c.
– Finally C compiler runs the lex.yy.c program and produces an object program a.out.
– a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
%{
#include<math.h>
#include<stdio.h>
#include<stdlib.h>
%}
– action describes the actions what action the lexical analyzer should take when pattern
pi matches a lexeme. For example:
These functions are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time
Once a match is found, the associated action takes place to produce token.
yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1