Lecture 3- Lexical Analysis (1)
Lecture 3- Lexical Analysis (1)
Compiler construction
The role of Lexical analyser
• Lexical analysis is the first phase of a compiler.
• It takes the modified source code from language
preprocessors that are written in the form of
sentences.
• The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or
comments in the source code.
• If the lexical analyzer finds a token invalid, it
generates an error.
• The lexical analyzer works closely with the syntax
analyzer.
• It reads character streams from the source code,
checks for legal tokens, and passes the data to the
syntax analyzer when it demands.
• Identify tokens within a string.
• Return the lexeme of a found Compiler
token, as well as
construction
The role of Lexical analyser
Compiler construction
Lexical Analysis
• Lexical analysis consists of two stages of processing
which are follows:
• Scanning - Scanning is the process of reading
the input source code and converting it into a
sequence of tokens. It is often the first phase of a
compiler or interpreter.
• The goal is to break down the source code into
meaningful symbols that represent the smallest
elements of the language (like keywords,
identifiers, literals, operators, etc.).
• Tokenization - Tokenization is the specific step
within scanning where the raw input is divided
into tokens. Each token is a categorized block of
text that represents a specific type of element in
the programming language.
Compiler construction
Lexical analysis or scanning
A lexical analyser, often referred to as a lexer or
tokenizer, is a component of a compiler or
interpreter that breaks down source code into its
fundamental building blocks, known as tokens.
Functions of a Lexical Analyser
• Tokenization: The primary function is to read the
input source code and convert it into a sequence
of tokens. Each token is a string with an assigned
meaning.
• Classification: Tokens are classified into
categories such as keywords, identifiers, literals,
operators, and punctuation.
• Removing Whitespace and Comments: The lexer
typically ignores whitespace and comments, as
they do not affect the program's execution.
Compiler construction
Lexemes , Tokens, Patterns
Compiler construction
Lexical analysis
• Token Structure: A token typically consists of:
• Type: The category of the token (e.g., keyword, identifier,
operator).
• Value: The actual text or value represented by the token.
• Example 1 - Given the input code:
if (x > 10) {
return x;
}
The lexer might produce the following tokens:
if (keyword)
( (punctuation)
x (identifier)
> (operator)
10 (literal)
) (punctuation)
{ (punctuation)
return (keyword)
x (identifier)
; (punctuation)
} (punctuation) Compiler construction
Lexical analysis
• Example 2: Consider the following simple piece of
code:
int x = 10 + 20;
Tokens Generated:
int (Keyword)
x (Identifier)
= (Assignment Operator)
10 (Integer Literal)
+ (Addition Operator)
20 (Integer Literal)
; (Semicolon)
Lexemes Generated:
int , x, =, 10, +, 20, ;
Compiler construction
Symbol table
• An essential function of a compiler is to build the
Symbol Table where the identifiers used in the
program are recorded along with various properties:
• allocated for the ID; its type; its scope (where in the
program is valid); number and types of its
arguments (in case the ID is a procedure name);
etc.
Compiler construction
Interaction of the Lexical Analyzer with
the Parser
• .
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser
Compiler construction
Issues in Lexical Analysis
Lexical analysis has the following issues:
• Look ahead
• Lookahead refers to the technique used in parsing to
examine upcoming tokens before making parsing
decisions. This is particularly useful in contexts where
the next token can influence the interpretation of the
current token.
• Look ahead is required to decide when one token will
end and the next token will begin.
• Ambiguities
• Ambiguities occur when a sequence of tokens can be
interpreted in multiple ways. This can lead to
confusion during parsing, as the parser may not be
able to determine the correct structure or meaning of
the code.
• The lexical analysis programs written with lex accept
ambiguous specifications and choose the longest
match possible at each input point.
Compiler construction
Finite Automaton and Lexical Analysis
• The lexical analyzer uses a finite automaton
(deterministic or non-deterministic) to recognize
lexemes based on predefined patterns.
Compiler construction
Tokens Specification
Let us understand how the language theory undertakes
the following terms:
• Alphabets
• Any finite set of symbols {0,1} is a set of binary
alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set
of Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
• Strings
• Any finite sequence of alphabets is called a string.
Length of the string is the total number of occurrence
of alphabets, e.g., the length of the string Compiler
is 8 and is denoted by |Compiler| = 8. A string having
no alphabets, i.e. a string of zero length is known as
an empty string and is denoted by ε (epsilon).
• Language
• A language is considered as a finite set of strings
over some finite set of Compiler alphabets. Computer
construction
Regular Expressions
• The lexical analyzer needs to scan and identify only a
finite set of valid string/token/lexeme that belong to the
language in hand.
• It searches for the pattern defined by the language rules.
• Regular expressions have the capability to express finite
languages by defining a pattern for finite strings of
symbols.
• The grammar defined by regular expressions is known as
regular grammar.
• The language defined by regular grammar is known as
regular language.
• Regular expression is an important notation for specifying
patterns.
• Each pattern matches a set of strings, so regular
expressions serve as names for a set of strings.
• Programming language tokens can be described by
Compiler construction
regular languages.
Finite automata (fa)
• Finite Automata(FA) is the simplest machine
to recognize patterns.
δ:
0 1
q0 q1 q0
q1 q0 q1
22
Compiler construction
Example #2:
a a a/b/c
Q = {q0, q1, q2}
Σ = {a, b, c} c c
q0 q1 q2
Start state is q0
F = {q2} b b
δ: a b c
q0 q0 q0 q1
q1 q1 q1 q2
q2 q2 q2 q2
q1 {q2} {q2}
q2
25
Compiler construction
Example #2: pair of 0’s or pair of 1’s as substring
0/1 0/1
Q = {q0, q1, q2 , q3 , q4}
Σ = {0, 1} 0 0
q0 q3 q4
Start state is q0
F = {q2, q4} 1 0/1
1
δ: 0 1 q1 q2
q0 {q0, q3} {q0, q1}
{} {q2}
q1
{q2} {q2}
q2
{q4} {}
q3
{q4} {q4}
26
q4 Compiler construction
Example #3
27
Compiler construction
Example #3
28
Compiler construction
Finite automata (fa)
• Even number of a’s :
• The regular expression for even
number of a’s is (b + ab*ab*)*.
• We can construct a finite automata
as shown in Figure 1.
Compiler construction
Finite automata (fa)
• String with ‘ab’ as substring :
• The regular expression for strings
with ‘ab’ as substring is (a +
b)*ab(a + b)*.
• We can construct finite automata as
shown in Figure 2.
Compiler construction
Finite automata (fa)
• String with count of ‘a’ divisible by 3 :
• The regular expression for strings with
count of a divisible by 3 is {a3n | n >=
0}.
• We can construct automata as shown in
Figure 3.
Compiler construction
Specification of Patterns for Tokens:
Definitions
• An alphabet is a finite set of
symbols (characters)
• A string s is a finite sequence of
symbols from
• s denotes the length of string s
• denotes the empty string, thus
= 0
• A language is a specific set of
strings over some fixed alphabet
Compiler construction
Specification of Patterns for Tokens:
Definitions
• An alphabet is a finite set of symbols
(characters)
s0 =
si = si-1s for i > 0
note that s = s = s
Compiler construction
Specification of Patterns for Tokens:
Language Operations
• Union
L M = {s s L or s M}
• Concatenation
LM = {xy x L and y M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Compiler construction
Specification of Patterns for Tokens:
Regular Expressions
• Basis symbols:
is a regular expression denoting language {}
– a is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression
is called a regular set
Compiler construction
Specification of Patterns for Tokens:
Regular Definitions
• Regular definitions introduce a naming
convention:
d1 r1
d2 r2
…
dn rn
where each ri is a regular expression
over
{d1, d2, …, di-1 }
• Any dj in ri can be textually substituted
in ri to obtain an equivalent set of
Compiler construction
Specification of Patterns for Tokens:
Regular Definitions
• Example:
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+
)?
Compiler construction
Regular Definitions and Grammars
Grammar
stmt if expr then stmt
if expr then stmt else stmt
expr term relop term
term Regular definitions
term id if if
num then then
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit+ )?
Compiler construction
Exercises
1. Complete the table below for the
automaton shown below.
Compiler construction
Exercises
Regular expression: (0+1)*01(0+1)*
Transition table:
42
Compiler construction