CT3
CT3
(3)
The Lexical Analyzer
Regular Definitions
Tokens
Transitions Diagrams
Implementation
The Lexical Analyzer
Splits the source/input characters (from
files, strings, …) in tokens
A lexical unit from the point of view of the
next compiler phases is an indivisible
unit of information (analogous as atoms
were considered indivisible)
Some characters and symbols which are
not relevant in the future phases (spaces,
comments) can be discarded
Examples: numbers, identifiers, strings, …
Regular definitions
A regular expression with an assigned name
Most of the times a lexical unit definition is given as a
regular definition (RD)
Each lexical unit has its own RD
Some RD can be fragments which help forming other
lexical units , without being lexical units by themselves
Inside a RD letters can name other RD so a
distinction must be made between simple characters
and RD names. The names can be bold or underline,
and the characters can be put in single or double
quotes
Spaces are not significant inside a RD so it can be
formatted for a better readability
Regular definitions -
example
fragment LETTER: [a-zA-Z_] ;
fragment DIGIT: [0-9] ;
ID: LETTER ( LETTER | DIGIT )* ;
INT: DIGIT+ ;
REAL: INT ‘.’ INT ;
WHILE: ‘while’ ;
LINECOMMENT: ‘//’ [^\r\n\0]* ;
Many compiler tools use the convention to start lexical
definitions with uppercase letters, in order to differentiate them
from the syntactic definitions
For a single letter the quotes are equivalent with a character
class: ‘.’ == [.]
LETTER and DIGIT are only fragments and they do not form
tokens. INT is both a fragment of REAL and a token on its own
LINECOMMENT is not significant for the next phases and it will
be discarded
Transition Diagrams (TD)
A graphical method to represent regular definitions (RD)
A directed graph where each transition consumes at most one
character
For all RD there is only one initial state (state 0)
From each state only a transition can be made with a specific
character (DFA). RDs with a common prefix will start on a
common path for that prefix and after that they will split into
distinct definitions.
Each state can have at most an “else” transition, which does
not consume any character and it is always considered last
Each final state corresponds to a fully recognized RD. These
states are represented with a double circle and with the RD name
next to them. No transitions can originate from final states.
Error states can be placed inside TD in places where only
specific characters are allowed
Regular definitions to transition
diagrams
1
characters and characters classes
e* e+
e1e2
e1|e2 e?
The RD which does not form meaningful tokens for the next phases of the
compiler (spaces, comments) will have the final state in the initial state. In
this way they can be consumed without generating tokens
If TD becomes too complicated because many RD must be added, these
can be represented separately, by considering the initial state (0) as the
common point. In this case also the above properties must be
preserved. The lexical fragments and the identifiers which appear
inside a RD must be expanded to their own constituents (they will
not be put as names/references on TD).
Transition Diagrams –
example
input characters end before the end
[^}\0] \0 of the comment
6 0-9 0-9
{
} 0-9 . 0-9
0 1 2 3 4 REAL
◦ For the current state check the input character against all
possible transitions
◦ If a transition for that character is found, advance to the next
character and set the new state
◦ If no suitable transition is found but the current state has an
“else” transition, set the new state without advancing to the
next character
◦ If the current state is a final state, create a new token, set its
fields and return its code
This algorithm is straightforward to implement when a TD is
available. It is quite verbose and for future developments the
original TD is required to see the original states.
getNextToken() with explicit
states
getNextToken() with implicit
states
This method does not need an explicit TD but for cases with
many common prefixes it is good to have one.
Inside an infinite loop, at each iteration:
◦ Read a character. Depending on this character advance
inside a regular definition (RD) or on a common prefix
path.
◦ While still in a RD, read next characters and advance
more inside that RD (or inside a common prefix)
◦ If the end of a RD is reached and if this RD is not a prefix
for another RD, create a token and return its code. If this
RD is a prefix for other RD, first try to advance on that
RD.
This algorithm more complex and it requires more coding
effort. The code is shorter and easier to maintain. In many
cases it is also easier to read.
getNextToken() with implicit
states
Bibliography reading
Compilers. Principles, Techniques and Tools
3.1, 3.4, 3.10