0% found this document useful (0 votes)
5 views

CT3

The document discusses compilation techniques focusing on the lexical analyzer, which breaks source input into tokens while discarding irrelevant characters. It explains regular definitions, transition diagrams, and the implementation of the lexical analyzer, including methods for token generation and handling various types of tokens. Additionally, it covers the complexities of greedy and non-greedy regular expressions and the structure of tokens in programming languages.

Uploaded by

Istin Codruta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CT3

The document discusses compilation techniques focusing on the lexical analyzer, which breaks source input into tokens while discarding irrelevant characters. It explains regular definitions, transition diagrams, and the implementation of the lexical analyzer, including methods for token generation and handling various types of tokens. Additionally, it covers the complexities of greedy and non-greedy regular expressions and the structure of tokens in programming languages.

Uploaded by

Istin Codruta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Compilation Techniques

(3)
The Lexical Analyzer
Regular Definitions
Tokens
Transitions Diagrams
Implementation
The Lexical Analyzer
Splits the source/input characters (from
files, strings, …) in tokens
A lexical unit from the point of view of the
next compiler phases is an indivisible
unit of information (analogous as atoms
were considered indivisible)
Some characters and symbols which are
not relevant in the future phases (spaces,
comments) can be discarded
Examples: numbers, identifiers, strings, …
Regular definitions
A regular expression with an assigned name
 Most of the times a lexical unit definition is given as a
regular definition (RD)
 Each lexical unit has its own RD
 Some RD can be fragments which help forming other
lexical units , without being lexical units by themselves
 Inside a RD letters can name other RD so a
distinction must be made between simple characters
and RD names. The names can be bold or underline,
and the characters can be put in single or double
quotes
 Spaces are not significant inside a RD so it can be
formatted for a better readability
Regular definitions -
example
fragment LETTER: [a-zA-Z_] ;
fragment DIGIT: [0-9] ;
ID: LETTER ( LETTER | DIGIT )* ;
INT: DIGIT+ ;
REAL: INT ‘.’ INT ;
WHILE: ‘while’ ;
LINECOMMENT: ‘//’ [^\r\n\0]* ;
 Many compiler tools use the convention to start lexical
definitions with uppercase letters, in order to differentiate them
from the syntactic definitions
 For a single letter the quotes are equivalent with a character
class: ‘.’ == [.]
 LETTER and DIGIT are only fragments and they do not form
tokens. INT is both a fragment of REAL and a token on its own
 LINECOMMENT is not significant for the next phases and it will
be discarded
Transition Diagrams (TD)
 A graphical method to represent regular definitions (RD)
 A directed graph where each transition consumes at most one
character
 For all RD there is only one initial state (state 0)
 From each state only a transition can be made with a specific
character (DFA). RDs with a common prefix will start on a
common path for that prefix and after that they will split into
distinct definitions.
 Each state can have at most an “else” transition, which does
not consume any character and it is always considered last
 Each final state corresponds to a fully recognized RD. These
states are represented with a double circle and with the RD name
next to them. No transitions can originate from final states.
 Error states can be placed inside TD in places where only
specific characters are allowed
Regular definitions to transition
diagrams
1
characters and characters classes

e* e+
e1e2

e1|e2 e?

 The RD which does not form meaningful tokens for the next phases of the
compiler (spaces, comments) will have the final state in the initial state. In
this way they can be consumed without generating tokens
 If TD becomes too complicated because many RD must be added, these
can be represented separately, by considering the initial state (0) as the
common point. In this case also the above properties must be
preserved. The lexical fragments and the identifiers which appear
inside a RD must be expanded to their own constituents (they will
not be put as names/references on TD).
Transition Diagrams –
example
input characters end before the end
[^}\0] \0 of the comment

6 0-9 0-9
{
} 0-9 . 0-9
0 1 2 3 4 REAL

invalid 5 after the decimal point a


character digit must follow
INT

fragment DIGIT: [0-9] ;


INT: DIGIT+ ;
REAL: INT [.] INT ;
COMMENT: { [^}]* } ;
Non-greedy definitions
 By default a regular expression (RE) is greedy, so it tries to consume as
many characters as possible. Sometimes this behavior is not desirable and
we want the RE to consume a minimal number of characters.
 Example: create a RE for C multi-line comments (/*…*/)
 Trial 1: ‘/*’ .* ‘*/’
If we have multiple comments, this RE because of its greediness will match
all the characters from the first ‘/*’ until the last ‘*/’, including all content
between comments
 Trial 2: ‘/*’ [^*]* ‘*/’
This RE is not greedy anymore but it does not accept comments with ‘*’
inside them
 Trial 3: ‘/*’ ( [^*] | ‘*’ [^/] )* ‘*/’
This RE does not recognize comments which ends with an even number of
‘*’ (because for each inner ‘*’ another character must follow before ‘*/’)
 Trial 4: ‘/*’ ( [^*] | ‘*’+ [^*/] )* ‘*/’
This RE recognizes any sequence of inner ‘*’, but it does not recognize the
end of the comment in the case of multiple ending ‘*’
 Solution: ‘/*’ ( [^*] | ‘*’+ [^*/] )* ‘*’+ ’/’
Token
A pair consisting of a token name and an optional
attribute
 The token name (code, type) is an abstract symbol
representing a kind of lexical unit (identifier, a
keyword, an integer, …). In most of the cases a token
is referred by its name.
 The attribute is used to differentiate between tokens
with the same name but possibly with different
lexemes (ex: a particular integer). The attribute can be
the lexeme itself or the result of a processing.
 Lexeme – the source characters which were matched
by the token definition (ex: lightSpeed for an ID, “hi
everybody” for a STRING)
Other token components
Source position – information regarding the
token position in source: line and column
number, file name, …
Linking fields (ex: to create a list of tokens)
Numbers primary
processing
 For the tokens which represent numbers, their lexeme
can be directly converted to a number. Let c1…cn the
lexeme characters forming an integer K in a base B
(decimal, binary, hexadecimal, …). It can be written:
K=c1*Bn-1+c2*Bn-2+…+cnB0
 An iterative algorithm which converts a sequence of digits
in base B from a vector v in a number K:

 asciiToInt is a function which converts an ASCII


character to a number: ‘0’->0, ‘1’->1,… ‘a’->10, ‘F’ ->15
 For real numbers the decimal digits can be multiplied with
negative powers of B and added to number
Strings primary

processing
Many languages accept inside strings (“…”) or character
constants (‘.’) control characters (ESCAPE sequences)
which have different meanings. For example in C ‘\t’
means a TAB and not the two characters ‘\’ and ‘t’
 Other characters have different meaning in Windows or
in Linux when using the standard input/output functions
in text mode: ‘\n’ in Windows is translated as ASCII
codes 13,10 (\r\n) and in Linux only as ASCII code 10 (\
n). Inside strings in Windows ‘\n’ is kept as a single
character.
 These sequences must be converted according to their
meaning and a final string (or character) must be formed
Token structure - example
Source to tokens example

KINT, ID:i, COMMA, ID:c, ASSIGN, INT:0, SEMICOLON, KFOR,


LPAR, ID:i, ASSIGN, INT:0, SEMICOLON, ID:i, LESS, ID:n,
SEMICOLON, ID:i, INC, RPAR, KIF, LPAR, ID:v, LBRACKET, ID:i,
RBRACKET, EQUAL, INT:10, RPAR, ID:c, INC, SEMICOLON

 Allspaces and comments were discarded


 Tokens which can have multiple lexemes (ID, INT) must
keep their specific lexeme, possibly in a processed form
(an attribute)
 The keyword int has the code KINT and the integer
constants have the code INT
The implementation of the lexical
analyzer
 For simple compilers and programming languages the lexical analyzer
can be implemented as a subroutine getNextToken() which on every
call returns the code of the next token. The associated token structure is
stored in a variable currentToken, if some fields are needed.
 This subroutine can be called directly from the syntactic analyzer. The
advantage is that less memory is needed and it is faster because no
token list is constructed.
 For complex languages the direct use of getNextToken() in the syntactic
analyzer has significant drawbacks because sometimes it is needed to
backtrace to a former token. In this case the information of (possibly
multiple) currentToken must be saved and restored (or recomputed).
This direct use also increases coupling between the lexical and
syntactic analyzers and makes harder their debugging and
development.
 For the above reasons, it is preferred to call getNextToken() in a loop
and make a list with all the tokens. This list will be used in the next
phases. In this case getNextToken() can put in the tokens list at each
call a dynamically allocated token structure.
getNextToken() with explicit
states
 Let state a variable which holds the current state. The initial
state is 0.
 Inside an infinite loop, at each iteration:

◦ For the current state check the input character against all
possible transitions
◦ If a transition for that character is found, advance to the next
character and set the new state
◦ If no suitable transition is found but the current state has an
“else” transition, set the new state without advancing to the
next character
◦ If the current state is a final state, create a new token, set its
fields and return its code
 This algorithm is straightforward to implement when a TD is
available. It is quite verbose and for future developments the
original TD is required to see the original states.
getNextToken() with explicit
states
getNextToken() with implicit
states
 This method does not need an explicit TD but for cases with
many common prefixes it is good to have one.
 Inside an infinite loop, at each iteration:
◦ Read a character. Depending on this character advance
inside a regular definition (RD) or on a common prefix
path.
◦ While still in a RD, read next characters and advance
more inside that RD (or inside a common prefix)
◦ If the end of a RD is reached and if this RD is not a prefix
for another RD, create a token and return its code. If this
RD is a prefix for other RD, first try to advance on that
RD.
 This algorithm more complex and it requires more coding
effort. The code is shorter and easier to maintain. In many
cases it is also easier to read.
getNextToken() with implicit
states
Bibliography reading
 Compilers. Principles, Techniques and Tools
3.1, 3.4, 3.10

You might also like