0% found this document useful (0 votes)
4 views

Chapter-2

The document outlines the principles of compiler design, specifically focusing on lexical analysis, which is the first phase of a compiler that processes source code into tokens. It discusses the roles of lexical analysis, input buffering techniques, token specification using regular expressions, and the recognition of tokens through transition diagrams. Additionally, it provides examples of tokens and their attributes, as well as methods for optimizing input reading and token specification.

Uploaded by

Tesfalegn Yakob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter-2

The document outlines the principles of compiler design, specifically focusing on lexical analysis, which is the first phase of a compiler that processes source code into tokens. It discusses the roles of lexical analysis, input buffering techniques, token specification using regular expressions, and the recognition of tokens through transition diagrams. Additionally, it provides examples of tokens and their attributes, as well as methods for optimizing input reading and token specification.

Uploaded by

Tesfalegn Yakob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Compiler design

Department of Computer Science


FACULTY OF TECHNOLOGY
Debre Markos University Burie campus
Year IV Semester I

by…
Mr. Birku L.
B.Sc. CS, M.Sc. SE
The Academic year of 2015 E.C
Chapter 2 Outline
 Lexical Analysis?
 Role of lexical analysis
 Issue in lexical analysis
 Token ,Patter, Lexeme
 Attributes of a token
 Input Buffering
 Buffer pairs
 Sentinels (Guards)
 Specification of token using Regular expression
 Regular Expression
 Recognizing of token using Transition diagram
 Transition diagram
Lexical Analysis

 The first phase of a compiler, This phase is the part


of and the program text: The text is
read and divided into , is a of a
program.
 So the translated in to a
.by removing

Source code Lexical Analysis List of Token

Error Messages
Role of Lexical Analysis

1. To the characters
2. To them into
3. To a sequence of used by the
for syntax analysis as an of parser
4. To Interact with the
◦ To Insert
token
Source Lexical To semantic
Parser
program Analyzer analysis
getNextToken

Symbol table
Role of Lexical Analysis

5. To
• Comments : such as // or /* -------*/
• Whitespaces : such as blank, newline, tab, …

• To correlate generated by the


with the
• To keep track of the number of seen
• To a line with each
Sample of is
 A mistake in a lexeme,
 for examples, typing far instead of for,
 Unclosed comment
 For example /* read a value….
Role of Lexical Analysis
 Sometimes are in a cascade of
two phases
: - to produce a sequence of tokens as an
output.(main task)
:-the scanner is responsible for simple task

 There are several reasons for separating the analysis phase


into lexical and syntax analysis.

: is perhaps the most important


consideration
: liberty to apply specialized
techniques that serves only lexical tasks,
:- Input device peculiarities are
restricted to the lexical analyzer
Token, Lexeme, Pattern:
: Token is a sequence of characters that can be treated
as a or a of a program.
 There are two categorized of token those are:-
1. Valid Tokens
2. Invalid tokens

Keyword White Space (blank)

Identifiers New line (/n)

Operators Tab space (/t)

Special symbols Comment // or /* -------*/

Number Pre-processor directives # include<header.h>

Literal/Constant Pre-processor directives # define avg(a+b/2)


Token, Lexeme, Pattern:
: A set of in the input for which the same token is
produced as output. This set of strings is by a called a
pattern associated with the token. Patter uses a
to identify tokens.
: A lexeme is a in the source
program that is by the for a token.
: Description of token

Keyword(if) If characters i, f
Id X Letter followed by seq. of alphanumeric l
(l+d)*
Relation <,<=,=,<>,>=,> < or <= or = or < > or >= or letter
Operator followed by letters & digi (l+d)*
Special symbols :,:,(,),{,},’,”,., : or : or ( or ) or { or } or’ or ” or . or ,
Number 3.14 Any numeric constant
Literal/Constant "core" Anything but “, surrounded by “
Cont…
 Token is the sequence of character that includes identifiers,
keywords, operations, special symbol, constraints. These
are typical token. E.g int a=20; (keyword , identifier ,
operator , constant , symbols)
 Non tokens comments , tabs, blanks, newlines etc.
 Lexeme is the sequence of character in the source program
that are matched by the patter for the token. Int, a,=,20,;
 It is words in the source of program.
 Pattern : it is the a rule describing all the lexemes that can
represent a particular token in the source language
 It is the description of class of tokens
Attributes of a token
 When lexemes match the then
the must provide with the
token.
is placed in a symbol-table entry and the
lexical analyzer represents these lexemes in the form of tokens
as:< , >
: an abstract symbol is used during syntax analysis,
: points to an entry in the symbol table for this
token.
. E = M * C ** 2 has 7 tokens and associated attribute-values:
E :- <(id,1) reference to symbol-table entry for E>
= :- <assign_op,>
M :- < (id,2) reference to symbol-table entry for M>
* :- <mult_op,>
C :- < (id,3) reference to symbol-table entry for C>
** :- <exp_op,>
2 :- <num, integer value 2>
Attributes of a token

y = 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token-name
Token(var)
(token attribute) Parser
12
Some example of token

1. #include <iostream.h>
Handle by
2. #include <conio.h>
1 2 3 4
3. void main()
void main ( )
4. // this is a hello world program X Comment, Invalid token
5. { 5
{

6. cout << “Hello World!\n"; 6 7 8 9 10


cout < < “Hello world!\n” ;
7. } 11
}

 So The of is 11
Token of a Hello World program
void main ( )
{ cout < <
“Hello world!\n” ; }
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols

void main - “Hello ( <<


world!\n

cout - - - ) -
- - - - { -
- - - - } -
- - - - ; -
2 1 0 1 5 1
Some example of token

1. #include <iostream.h>
Handle by
2. #include <conio.h>

3. void main() void main ( )


4. { {
5. int a, b, c; int a , b , c ;
6. cout << "Enter two numbers to add\n"; cout < < "Enter two .." ;
7. cin >> a >> b; Cin > > a > > b ;
8. c = a * b+10^2; c = a * b + 10 ^ 2 ;

9. cout <<“The value of c = " << c << endl; cout < < “The…"
< < c < < ;

10. } }

 So The of is
46
Tokens of Adding two Numbers
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols
void a 10 “Enter...” ( <<
int b “Sub....” ) >>
cout c { +
cin main } *
; =
,
4 4 1 2 6 5

A.O R.O L.O Unary. O Assi. O B.O


+ = <<
* >>
^
2 0 0 0 1 3
Some example of token

 So The of is 60
1. #include <iostream.h>
Handle by
2. #include <conio.h>
3. void main() void main ( )
4. { {
5. int n,m, r; int m , n , r ;
6. cout << "Enter two No to get G,C,F\n"; cout < < "Enter two ….." ;
7. cin >> n >> m;
Cin > > m > > n ;
8. while (n!=0)
while ( n ! = 0 ) ;
9. { {
10. r = n % m; r = n % m ;
11. m=n; m = n ;
12. n=r; n = r ;
13. } }
14. cout <<“G.C.F of entered numbers = " << m << endl; cout < < “Sum."
15. } } < < M < < ;
E.g. Token of a G.C.F program
Keyword Identifier Numbers Literal/ Special Operators
Constant Symbols
void n 0 “Enter...” ( <<
int m “C.F.G....” ) >>
cout r { %
cin Main } !=
while ; =
,
4 4 1 2 6 5

A.O R.O L.O Unary. O Assi. O B.O

% != = <<
>>
2 1 0 0 1 2
Input Buffer
contains data that is stored for a short amount of time
typically in computers memory (RAM) for the purpose of holding
the data before used
is also commonly known as the
When referring to computer memory,
is a location that holds all incoming information
before it continues to the .
• This is the of the that the entire
one a
• There are general to the implementation of a
:
such as Lex or Flex compiler to
produce the lexical analyzer from a regular expression

using I/O facilities of that language to read the input.


and explicitly
manage the reading of input.
Input Buffer
 To speed up reading the source program, input buffer using
d/t technique
the last lexeme under process will
be over-written when we reload the buffer.
handling large look
ahead safely
–improvement which saves time checking buffer
end
 There are two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is found

i ,i ,a b, a b, b, a b, a b, a b, a b
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
 Mark the stream with a special character
.
 Maintain two pointers into the buffer marking
of the current lexeme.
 two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
 Mark the with a special character
. Maintain two pointers into the buffer marking
of the current lexeme.
 two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Simple Input Buffering algorithm
both to the of the
next to be found.
 The is until a for a
pattern is found. the lexeme is set both
pointers to the following the lexeme.
 Code to advance forward pointer:
forward at end of half
second half;
:= forward + 1
forward at end of half
reload first half;

forward := forward + 1;
Sentinels (Guards)

• To test if it is at the
• To determine what character is read ( )
added at each
: Optimize the common case by the
of to one per of fwd.
: Extend each to hold a at the end.
• This is a that cannot occur in a
( )
• It the need for some (fill
, or processing).
• when other than at the of a it means that the
is at
Simple Sentinels algorithm
• For almost character perform :
1. Is the character ?
2. Is the at the end of the ?
3. Is the at the end of the ?
• This can be to by using .
• Add an eof character past the end of .
• Use the code from next slide to .
:= forward + 1;
forward = then
forward at end of then
second half;
:= forward + 1 end
forward at the then
first half;
move forward to of
end
else end;
Specification of Tokens
 There are Two Dual Notions to specification and
recognition of tokens.

• (grammar or regular expression)

• (automaton)

• Many theorems to transforms one approach automatically


to another
Specification of Tokens
Definition of : - defined in regular
expression
is an important notation for specifying
patterns.
 Each matches a set of , so regular
expressions serve as names for a set of .
 Programming language can be described by
.
E.g. letter A|B|…|Z|a|b|…|z
digit 0|1|2|…|9
Id  letter(letter|digit)*
Id {x ,y, y, count, d12, h45r5}
 The specification of is an example of a
.
are easy to and have
efficient implementation.
Specification of Tokens
 Let us understand how the undertakes the
following terms, Terminology:

 An abstract entity that we shall not define formally


e.g. (such as “ ” in ) , and is
examples of symbols.

A of out of which we build larger structures.


 An alphabet is typically denoted using the Greek word ,
e.g., is a set of binary alphabets,
is a set of Hexadecimal
alphabets,
is a set of English language alphabets.

 Any is called a .
of the is the of occurrence of
,
Specification of Tokens
denoted by a Greek word
,
of denoted in
 e.g. 0010, |0010| = 4, oo7,|oo7| = 3, ,||=0

 is as a finite set of over some


.
are considered as , and
can be performed on them.
 can be described by means of
.
set is a
The set containing is a language
• The set of all is a language
e.g. r = {ab} => L(r) => L(ab)
Specification of Tokens

 There are important that can be to


.
 Let be the set of and be the set of
: - L U M { x | x is in L or in M} E.g. L UM = {a,b,c,1,2}
:- {st| s is in L and t is in M}
E.g. L M ={a1,a2,b1,b2,c1,c2}
:- L* = i=0,…, Li , means
concatenation of L.
E.g. L* is the set of all string of letter and empty set,
{} or {a} or {a,b} or {a,b,c},
:- L+ = i=1,…, Li means
concatenation of L.
E.g. L+ is the set of all string of one or more latter without empty
set {a} or {a,b} or {a,b,c}
:- Ln , is any number of
E.g. L0 = {}; ,L2 ={2-latter string} = {aa,ab,ac,ba,bb,bc,ca,cb,cc…}
Regular Expression
 are a / Techniques for Constructing of
( ) From an .
 Regular expressions are an to describe .
 Regular Expressions: Used in the lexical analysis phase to
describe patterns of tokens.
 The defined by is known as
.
 The defined by is known as

A notation (rule) that allows us to define a


pattern in a high level language.
Each regular expression denotes a
language (the set of sentences relating to the
)
: Each in a program can be in a
and regular expression denotes a .
Regular Expression

The rule of regular expression can be categorized in to


and

is a regular expression language


is a regular expression

Suppose s are denoting languages


respectively then
is a denoting
is a denoting
is a denoting
is a denoting
Note:- A defined by a is called a
Regular Expression

 The precedence of regular expression are


(.), and (|) are left associative
has the highest precedence
has the second highest precedence.
has the lowest precedence of all.
: - All are Left-Associative. Parentheses are dropped as
allowed by precedence rules

Example Let r is a set of {1,2}


Let s is a set of {a, b}
Let t is a set of {7,f}
( s | t r *) r* tr* s|tr*
Algebraic Properties of RE

r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt concatenation distributes over |
(s|t)r=sr|tr
r = r  Is the identity element for
r = r concatenation
r* = ( r |  )* relation between * and 
r** = r* * is idempotent
:- A language may be represented by or equivalent
. idempotent operation is one that has no
if it is called than with the same
Notational shorthand’s
 Certain constructs occur so frequently in
that it is convenient to introduce
for them.
:- The
Means “ or of ” if is a
that denotes the then is a
regular expression that denotes the language
means.
(r)+ denoting (L(r)) +
( r )+ digit+
the * operator
R* is shorthand for |r+

r? is a shorthand for r| (E(+|-)?digits)?

[a-z] denotes a|b|c|…|z [A-Za-z] [A-Za-z0-9]


Notational shorthand’s

If is a , then:
• means occurrence of x.
, it can generate { }

• means occurrence of x.
i.e., it can generate } or

• means occurrence of
i.e., it can generate

is all language.
is all language.
is all used in mathematics.
.
Notational shorthand’s
 letter = [a – z] or [A – Z]
 digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
 sign = [ + | - ]

 Digit = (digit)+
 Letter = (letter) +
 Identifier = (letter)(letter | digit)*
 Decimal = (sign)?(digit)+
 Real Number ([digit] + “.” [digit] *) | (“.” [digit] +)
 keyword -> keyword
 Relation operator -> < | > | <= | >= | = | <>
The only problem left with the is how to verify
the of a used in the
of of a . A well-accepted solution is
to use for .
Finite automata
 Finite automata are recognizer/machine which is used to
recognize the pattern.
 It accept or reject inputs based on already defined set of
strings known as language of automata. It also have finite
numbers of states(accepting state or rejecting state)
 Finite automata is “Yes” or “No” about each possible
strings.
 FA is the mathematical models which consist of
 Set of state S Initial state S0
 Set of input symbol ∑ Final State/ Accepting state F
 Transition function Move/δ
Types of Finite automata
 There are two types of automata
 Deterministic finite automata(DFA):- have each state exactly
one edge leaving out each symbols

 Non deterministic finite automata:- there is no restrictions on


the edge leaving a state. There can be several with the same
symbols label and some edge can be labeled with epsilon(є)
Error and recovery
 Error recovery is a crucial aspect of compiler
design and plays a vital role in ensuring robustness
and usability of programming languages.
 Error recovery mechanisms are employed to handle
syntax and semantic errors encountered during the
compilation process, allowing the compiler to
provide meaningful diagnostic messages and
continue processing the source code wherever
possible.
Types of Error
Errors are either syntactic or semantic:
 Syntax Errors: These occur when the source code violates
the grammar rules of the programming language.
 Examples include missing semicolons, mismatched
parentheses, and incorrect keyword usage.
 Semantic Errors: These occur when the source code is
syntactically correct but violates the semantic rules of the
programming language.
 Examples include type mismatches, undefined variables,
and invalid function calls.
 Typing error, compilation error, and runtime

You might also like