Chapter-2
Chapter-2
by…
Mr. Birku L.
B.Sc. CS, M.Sc. SE
The Academic year of 2015 E.C
Chapter 2 Outline
Lexical Analysis?
Role of lexical analysis
Issue in lexical analysis
Token ,Patter, Lexeme
Attributes of a token
Input Buffering
Buffer pairs
Sentinels (Guards)
Specification of token using Regular expression
Regular Expression
Recognizing of token using Transition diagram
Transition diagram
Lexical Analysis
Error Messages
Role of Lexical Analysis
1. To the characters
2. To them into
3. To a sequence of used by the
for syntax analysis as an of parser
4. To Interact with the
◦ To Insert
token
Source Lexical To semantic
Parser
program Analyzer analysis
getNextToken
Symbol table
Role of Lexical Analysis
5. To
• Comments : such as // or /* -------*/
• Whitespaces : such as blank, newline, tab, …
Keyword(if) If characters i, f
Id X Letter followed by seq. of alphanumeric l
(l+d)*
Relation <,<=,=,<>,>=,> < or <= or = or < > or >= or letter
Operator followed by letters & digi (l+d)*
Special symbols :,:,(,),{,},’,”,., : or : or ( or ) or { or } or’ or ” or . or ,
Number 3.14 Any numeric constant
Literal/Constant "core" Anything but “, surrounded by “
Cont…
Token is the sequence of character that includes identifiers,
keywords, operations, special symbol, constraints. These
are typical token. E.g int a=20; (keyword , identifier ,
operator , constant , symbols)
Non tokens comments , tabs, blanks, newlines etc.
Lexeme is the sequence of character in the source program
that are matched by the patter for the token. Int, a,=,20,;
It is words in the source of program.
Pattern : it is the a rule describing all the lexemes that can
represent a particular token in the source language
It is the description of class of tokens
Attributes of a token
When lexemes match the then
the must provide with the
token.
is placed in a symbol-table entry and the
lexical analyzer represents these lexemes in the form of tokens
as:< , >
: an abstract symbol is used during syntax analysis,
: points to an entry in the symbol table for this
token.
. E = M * C ** 2 has 7 tokens and associated attribute-values:
E :- <(id,1) reference to symbol-table entry for E>
= :- <assign_op,>
M :- < (id,2) reference to symbol-table entry for M>
* :- <mult_op,>
C :- < (id,3) reference to symbol-table entry for C>
** :- <exp_op,>
2 :- <num, integer value 2>
Attributes of a token
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token-name
Token(var)
(token attribute) Parser
12
Some example of token
1. #include <iostream.h>
Handle by
2. #include <conio.h>
1 2 3 4
3. void main()
void main ( )
4. // this is a hello world program X Comment, Invalid token
5. { 5
{
So The of is 11
Token of a Hello World program
void main ( )
{ cout < <
“Hello world!\n” ; }
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols
1. #include <iostream.h>
Handle by
2. #include <conio.h>
9. cout <<“The value of c = " << c << endl; cout < < “The…"
< < c < < ;
10. } }
So The of is
46
Tokens of Adding two Numbers
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols
void a 10 “Enter...” ( <<
int b “Sub....” ) >>
cout c { +
cin main } *
; =
,
4 4 1 2 6 5
So The of is 60
1. #include <iostream.h>
Handle by
2. #include <conio.h>
3. void main() void main ( )
4. { {
5. int n,m, r; int m , n , r ;
6. cout << "Enter two No to get G,C,F\n"; cout < < "Enter two ….." ;
7. cin >> n >> m;
Cin > > m > > n ;
8. while (n!=0)
while ( n ! = 0 ) ;
9. { {
10. r = n % m; r = n % m ;
11. m=n; m = n ;
12. n=r; n = r ;
13. } }
14. cout <<“G.C.F of entered numbers = " << m << endl; cout < < “Sum."
15. } } < < M < < ;
E.g. Token of a G.C.F program
Keyword Identifier Numbers Literal/ Special Operators
Constant Symbols
void n 0 “Enter...” ( <<
int m “C.F.G....” ) >>
cout r { %
cin Main } !=
while ; =
,
4 4 1 2 6 5
% != = <<
>>
2 1 0 0 1 2
Input Buffer
contains data that is stored for a short amount of time
typically in computers memory (RAM) for the purpose of holding
the data before used
is also commonly known as the
When referring to computer memory,
is a location that holds all incoming information
before it continues to the .
• This is the of the that the entire
one a
• There are general to the implementation of a
:
such as Lex or Flex compiler to
produce the lexical analyzer from a regular expression
i ,i ,a b, a b, b, a b, a b, a b, a b
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
Mark the stream with a special character
.
Maintain two pointers into the buffer marking
of the current lexeme.
two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
Mark the with a special character
. Maintain two pointers into the buffer marking
of the current lexeme.
two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Simple Input Buffering algorithm
both to the of the
next to be found.
The is until a for a
pattern is found. the lexeme is set both
pointers to the following the lexeme.
Code to advance forward pointer:
forward at end of half
second half;
:= forward + 1
forward at end of half
reload first half;
forward := forward + 1;
Sentinels (Guards)
• To test if it is at the
• To determine what character is read ( )
added at each
: Optimize the common case by the
of to one per of fwd.
: Extend each to hold a at the end.
• This is a that cannot occur in a
( )
• It the need for some (fill
, or processing).
• when other than at the of a it means that the
is at
Simple Sentinels algorithm
• For almost character perform :
1. Is the character ?
2. Is the at the end of the ?
3. Is the at the end of the ?
• This can be to by using .
• Add an eof character past the end of .
• Use the code from next slide to .
:= forward + 1;
forward = then
forward at end of then
second half;
:= forward + 1 end
forward at the then
first half;
move forward to of
end
else end;
Specification of Tokens
There are Two Dual Notions to specification and
recognition of tokens.
• (automaton)
Any is called a .
of the is the of occurrence of
,
Specification of Tokens
denoted by a Greek word
,
of denoted in
e.g. 0010, |0010| = 4, oo7,|oo7| = 3, ,||=0
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt concatenation distributes over |
(s|t)r=sr|tr
r = r Is the identity element for
r = r concatenation
r* = ( r | )* relation between * and
r** = r* * is idempotent
:- A language may be represented by or equivalent
. idempotent operation is one that has no
if it is called than with the same
Notational shorthand’s
Certain constructs occur so frequently in
that it is convenient to introduce
for them.
:- The
Means “ or of ” if is a
that denotes the then is a
regular expression that denotes the language
means.
(r)+ denoting (L(r)) +
( r )+ digit+
the * operator
R* is shorthand for |r+
If is a , then:
• means occurrence of x.
, it can generate { }
• means occurrence of x.
i.e., it can generate } or
• means occurrence of
i.e., it can generate
is all language.
is all language.
is all used in mathematics.
.
Notational shorthand’s
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Digit = (digit)+
Letter = (letter) +
Identifier = (letter)(letter | digit)*
Decimal = (sign)?(digit)+
Real Number ([digit] + “.” [digit] *) | (“.” [digit] +)
keyword -> keyword
Relation operator -> < | > | <= | >= | = | <>
The only problem left with the is how to verify
the of a used in the
of of a . A well-accepted solution is
to use for .
Finite automata
Finite automata are recognizer/machine which is used to
recognize the pattern.
It accept or reject inputs based on already defined set of
strings known as language of automata. It also have finite
numbers of states(accepting state or rejecting state)
Finite automata is “Yes” or “No” about each possible
strings.
FA is the mathematical models which consist of
Set of state S Initial state S0
Set of input symbol ∑ Final State/ Accepting state F
Transition function Move/δ
Types of Finite automata
There are two types of automata
Deterministic finite automata(DFA):- have each state exactly
one edge leaving out each symbols