CT3

The document discusses compilation techniques focusing on the lexical analyzer, which breaks source input into tokens while discarding irrelevant characters. It explains regular definitions, transition diagrams, and the implementation of the lexical analyzer, including methods for token generation and handling various types of tokens. Additionally, it covers the complexities of greedy and non-greedy regular expressions and the structure of tokens in programming languages.

Uploaded by

Istin Codruta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

CT3

Uploaded by

Istin Codruta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

Compilation Techniques

(3)
The Lexical Analyzer
Regular Definitions
Tokens
Transitions Diagrams
Implementation
The Lexical Analyzer
Splits the source/input characters (from
files, strings, …) in tokens
A lexical unit from the point of view of the
next compiler phases is an indivisible
unit of information (analogous as atoms
were considered indivisible)
Some characters and symbols which are
not relevant in the future phases (spaces,
comments) can be discarded
Examples: numbers, identifiers, strings, …
Regular definitions
A regular expression with an assigned name
 Most of the times a lexical unit definition is given as a
regular definition (RD)
 Each lexical unit has its own RD
 Some RD can be fragments which help forming other
lexical units , without being lexical units by themselves
 Inside a RD letters can name other RD so a
distinction must be made between simple characters
and RD names. The names can be bold or underline,
and the characters can be put in single or double
quotes
 Spaces are not significant inside a RD so it can be
formatted for a better readability
Regular definitions -
example
fragment LETTER: [a-zA-Z_] ;
fragment DIGIT: [0-9] ;
ID: LETTER ( LETTER | DIGIT )* ;
INT: DIGIT+ ;
REAL: INT ‘.’ INT ;
WHILE: ‘while’ ;
LINECOMMENT: ‘//’ [^\r\n\0]* ;
 Many compiler tools use the convention to start lexical
definitions with uppercase letters, in order to differentiate them
from the syntactic definitions
 For a single letter the quotes are equivalent with a character
class: ‘.’ == [.]
 LETTER and DIGIT are only fragments and they do not form
tokens. INT is both a fragment of REAL and a token on its own
 LINECOMMENT is not significant for the next phases and it will
be discarded
Transition Diagrams (TD)
 A graphical method to represent regular definitions (RD)
 A directed graph where each transition consumes at most one
character
 For all RD there is only one initial state (state 0)
 From each state only a transition can be made with a specific
character (DFA). RDs with a common prefix will start on a
common path for that prefix and after that they will split into
distinct definitions.
 Each state can have at most an “else” transition, which does
not consume any character and it is always considered last
 Each final state corresponds to a fully recognized RD. These
states are represented with a double circle and with the RD name
next to them. No transitions can originate from final states.
 Error states can be placed inside TD in places where only
specific characters are allowed
Regular definitions to transition
diagrams
1
characters and characters classes

e* e+
e1e2

e1|e2 e?

 The RD which does not form meaningful tokens for the next phases of the
compiler (spaces, comments) will have the final state in the initial state. In
this way they can be consumed without generating tokens
 If TD becomes too complicated because many RD must be added, these
can be represented separately, by considering the initial state (0) as the
common point. In this case also the above properties must be
preserved. The lexical fragments and the identifiers which appear
inside a RD must be expanded to their own constituents (they will
not be put as names/references on TD).
Transition Diagrams –
example
input characters end before the end
[^}\0] \0 of the comment

6 0-9 0-9
{
} 0-9 . 0-9
0 1 2 3 4 REAL

invalid 5 after the decimal point a

character digit must follow
INT

fragment DIGIT: [0-9] ;

INT: DIGIT+ ;
REAL: INT [.] INT ;
COMMENT: { [^}]* } ;
Non-greedy definitions
 By default a regular expression (RE) is greedy, so it tries to consume as
many characters as possible. Sometimes this behavior is not desirable and
we want the RE to consume a minimal number of characters.
 Example: create a RE for C multi-line comments (/*…*/)
 Trial 1: ‘/*’ .* ‘*/’
If we have multiple comments, this RE because of its greediness will match
all the characters from the first ‘/*’ until the last ‘*/’, including all content
between comments
 Trial 2: ‘/*’ [^*]* ‘*/’
This RE is not greedy anymore but it does not accept comments with ‘*’
inside them
 Trial 3: ‘/*’ ( [^*] | ‘*’ [^/] )* ‘*/’
This RE does not recognize comments which ends with an even number of
‘*’ (because for each inner ‘*’ another character must follow before ‘*/’)
 Trial 4: ‘/*’ ( [^*] | ‘*’+ [^*/] )* ‘*/’
This RE recognizes any sequence of inner ‘*’, but it does not recognize the
end of the comment in the case of multiple ending ‘*’
 Solution: ‘/*’ ( [^*] | ‘*’+ [^*/] )* ‘*’+ ’/’
Token
A pair consisting of a token name and an optional
attribute
 The token name (code, type) is an abstract symbol
representing a kind of lexical unit (identifier, a
keyword, an integer, …). In most of the cases a token
is referred by its name.
 The attribute is used to differentiate between tokens
with the same name but possibly with different
lexemes (ex: a particular integer). The attribute can be
the lexeme itself or the result of a processing.
 Lexeme – the source characters which were matched
by the token definition (ex: lightSpeed for an ID, “hi
everybody” for a STRING)
Other token components
Source position – information regarding the
token position in source: line and column
number, file name, …
Linking fields (ex: to create a list of tokens)
Numbers primary
processing
 For the tokens which represent numbers, their lexeme
can be directly converted to a number. Let c1…cn the
lexeme characters forming an integer K in a base B
(decimal, binary, hexadecimal, …). It can be written:
K=c1*Bn-1+c2*Bn-2+…+cnB0
 An iterative algorithm which converts a sequence of digits
in base B from a vector v in a number K:

 asciiToInt is a function which converts an ASCII

character to a number: ‘0’->0, ‘1’->1,… ‘a’->10, ‘F’ ->15
 For real numbers the decimal digits can be multiplied with
negative powers of B and added to number
Strings primary

processing
Many languages accept inside strings (“…”) or character
constants (‘.’) control characters (ESCAPE sequences)
which have different meanings. For example in C ‘\t’
means a TAB and not the two characters ‘\’ and ‘t’
 Other characters have different meaning in Windows or
in Linux when using the standard input/output functions
in text mode: ‘\n’ in Windows is translated as ASCII
codes 13,10 (\r\n) and in Linux only as ASCII code 10 (\
n). Inside strings in Windows ‘\n’ is kept as a single
character.
 These sequences must be converted according to their
meaning and a final string (or character) must be formed
Token structure - example
Source to tokens example

KINT, ID:i, COMMA, ID:c, ASSIGN, INT:0, SEMICOLON, KFOR,

LPAR, ID:i, ASSIGN, INT:0, SEMICOLON, ID:i, LESS, ID:n,
SEMICOLON, ID:i, INC, RPAR, KIF, LPAR, ID:v, LBRACKET, ID:i,
RBRACKET, EQUAL, INT:10, RPAR, ID:c, INC, SEMICOLON

 Allspaces and comments were discarded

 Tokens which can have multiple lexemes (ID, INT) must
keep their specific lexeme, possibly in a processed form
(an attribute)
 The keyword int has the code KINT and the integer
constants have the code INT
The implementation of the lexical
analyzer
 For simple compilers and programming languages the lexical analyzer
can be implemented as a subroutine getNextToken() which on every
call returns the code of the next token. The associated token structure is
stored in a variable currentToken, if some fields are needed.
 This subroutine can be called directly from the syntactic analyzer. The
advantage is that less memory is needed and it is faster because no
token list is constructed.
 For complex languages the direct use of getNextToken() in the syntactic
analyzer has significant drawbacks because sometimes it is needed to
backtrace to a former token. In this case the information of (possibly
multiple) currentToken must be saved and restored (or recomputed).
This direct use also increases coupling between the lexical and
syntactic analyzers and makes harder their debugging and
development.
 For the above reasons, it is preferred to call getNextToken() in a loop
and make a list with all the tokens. This list will be used in the next
phases. In this case getNextToken() can put in the tokens list at each
call a dynamically allocated token structure.
getNextToken() with explicit
states
 Let state a variable which holds the current state. The initial
state is 0.
 Inside an infinite loop, at each iteration:

◦ For the current state check the input character against all
possible transitions
◦ If a transition for that character is found, advance to the next
character and set the new state
◦ If no suitable transition is found but the current state has an
“else” transition, set the new state without advancing to the
next character
◦ If the current state is a final state, create a new token, set its
fields and return its code
 This algorithm is straightforward to implement when a TD is
available. It is quite verbose and for future developments the
original TD is required to see the original states.
getNextToken() with explicit
states
getNextToken() with implicit
states
 This method does not need an explicit TD but for cases with
many common prefixes it is good to have one.
 Inside an infinite loop, at each iteration:
◦ Read a character. Depending on this character advance
inside a regular definition (RD) or on a common prefix
path.
◦ While still in a RD, read next characters and advance
more inside that RD (or inside a common prefix)
◦ If the end of a RD is reached and if this RD is not a prefix
for another RD, create a token and return its code. If this
RD is a prefix for other RD, first try to advance on that
RD.
 This algorithm more complex and it requires more coding
effort. The code is shorter and easier to maintain. In many
cases it is also easier to read.
getNextToken() with implicit
states
Bibliography reading
 Compilers. Principles, Techniques and Tools
3.1, 3.4, 3.10

CS3304 9 LanguageSyntax 2 PDF
No ratings yet
CS3304 9 LanguageSyntax 2 PDF
39 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Lexical Analysis: Programming Languages Translators
No ratings yet
Lexical Analysis: Programming Languages Translators
21 pages
Lexical Analysis: Textbook:Modern Compiler Design
No ratings yet
Lexical Analysis: Textbook:Modern Compiler Design
43 pages
Compiler
No ratings yet
Compiler
60 pages
unit1
No ratings yet
unit1
34 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Ch3myppt
No ratings yet
Ch3myppt
59 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Ch3 1
No ratings yet
Ch3 1
52 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Lexical Analysis I: Compiler Construction
No ratings yet
Lexical Analysis I: Compiler Construction
35 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Lexical
No ratings yet
Lexical
34 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
cd1
No ratings yet
cd1
92 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
lect03
No ratings yet
lect03
19 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
CH 2 - Lexical Analysis
No ratings yet
CH 2 - Lexical Analysis
36 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
CD ch2
No ratings yet
CD ch2
104 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Comp Chap2
No ratings yet
Comp Chap2
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Compilers: Topic 2: Lexical Analysis
No ratings yet
Compilers: Topic 2: Lexical Analysis
29 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
Unit 2-Introduction to Compilers
No ratings yet
Unit 2-Introduction to Compilers
51 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Bitdefender Licensing - Handout - v3
No ratings yet
Bitdefender Licensing - Handout - v3
15 pages
SiPass Entro 6.5 2010
No ratings yet
SiPass Entro 6.5 2010
8 pages
Unit-3 Instruction Cycle
No ratings yet
Unit-3 Instruction Cycle
9 pages
Arm 7 Architecture
No ratings yet
Arm 7 Architecture
22 pages
I PU Computer Science QP
No ratings yet
I PU Computer Science QP
4 pages
Redgate Software
No ratings yet
Redgate Software
8 pages
Hugepages
No ratings yet
Hugepages
4 pages
Python Lab Exercise 1-10
No ratings yet
Python Lab Exercise 1-10
12 pages
User Manual
No ratings yet
User Manual
9 pages
9 QB Computer Application 2023-24
No ratings yet
9 QB Computer Application 2023-24
13 pages
CS541 Lecture5
No ratings yet
CS541 Lecture5
41 pages
2020 Data Center Roadmap Survey PDF
No ratings yet
2020 Data Center Roadmap Survey PDF
16 pages
PN Systemredundanz V1 0 en
No ratings yet
PN Systemredundanz V1 0 en
9 pages
Underground Cable Fault Location Using Arduino, GSM & GPS: Presentation of Main Project On
No ratings yet
Underground Cable Fault Location Using Arduino, GSM & GPS: Presentation of Main Project On
24 pages
UNV IPC6424SR-X25-VF 4MP 25x Lighthunter Network PTZ Dome Camera Datasheet V1.7-EN
No ratings yet
UNV IPC6424SR-X25-VF 4MP 25x Lighthunter Network PTZ Dome Camera Datasheet V1.7-EN
6 pages
Download Complete (Ebook) Functional Python Programming...Efficient Python Code by Lott S. PDF for All Chapters
100% (14)
Download Complete (Ebook) Functional Python Programming...Efficient Python Code by Lott S. PDF for All Chapters
55 pages
Rohit Sach Deva
No ratings yet
Rohit Sach Deva
1 page
Turing Machines: Introductory Concepts
No ratings yet
Turing Machines: Introductory Concepts
2 pages
7b - Eigrp
No ratings yet
7b - Eigrp
17 pages
Intro To C
No ratings yet
Intro To C
2 pages
Unit V
No ratings yet
Unit V
5 pages
Report 20220209 Talent Matcht Prueba 1 Backend C.buitron Outlook - Com77911720767
No ratings yet
Report 20220209 Talent Matcht Prueba 1 Backend C.buitron Outlook - Com77911720767
10 pages
RPF Hardware
No ratings yet
RPF Hardware
9 pages
junos-security-jncis-sec-certification-course-jsec-jsec
No ratings yet
junos-security-jncis-sec-certification-course-jsec-jsec
2 pages
Guide To SQL 9th Edition Pratt 111152727X Test Bank
100% (45)
Guide To SQL 9th Edition Pratt 111152727X Test Bank
14 pages
System Adminisration
No ratings yet
System Adminisration
2 pages
Install Clamav and Maldet
No ratings yet
Install Clamav and Maldet
4 pages
LECTURE 1 - Inroduction To OOPs
No ratings yet
LECTURE 1 - Inroduction To OOPs
22 pages
Rohini 69771659590
No ratings yet
Rohini 69771659590
7 pages
Unit-I Programming For Problem Solving: Hardware
No ratings yet
Unit-I Programming For Problem Solving: Hardware
9 pages

CT3

Uploaded by

CT3

Uploaded by

Compilation Techniques

invalid 5 after the decimal point a

fragment DIGIT: [0-9] ;

 asciiToInt is a function which converts an ASCII

KINT, ID:i, COMMA, ID:c, ASSIGN, INT:0, SEMICOLON, KFOR,

 Allspaces and comments were discarded

You might also like