0% found this document useful (0 votes)
16 views

Lecture 3- Lexical Analysis (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 3- Lexical Analysis (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Lexical Analysis

Compiler construction
The role of Lexical analyser
• Lexical analysis is the first phase of a compiler.
• It takes the modified source code from language
preprocessors that are written in the form of
sentences.
• The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or
comments in the source code.
• If the lexical analyzer finds a token invalid, it
generates an error.
• The lexical analyzer works closely with the syntax
analyzer.
• It reads character streams from the source code,
checks for legal tokens, and passes the data to the
syntax analyzer when it demands.
• Identify tokens within a string.
• Return the lexeme of a found Compiler
token, as well as
construction
The role of Lexical analyser

• Lexical analysis perform by Lexical Analyzer . It is


also called scanner. A scanner reads a stream of
characters and puts them together into some
meaningful (with respect to the source language)
units called tokens.
• It produces a stream of tokens for the next phase
of compiler.

Compiler construction
Lexical Analysis
• Lexical analysis consists of two stages of processing
which are follows:
• Scanning - Scanning is the process of reading
the input source code and converting it into a
sequence of tokens. It is often the first phase of a
compiler or interpreter.
• The goal is to break down the source code into
meaningful symbols that represent the smallest
elements of the language (like keywords,
identifiers, literals, operators, etc.).
• Tokenization - Tokenization is the specific step
within scanning where the raw input is divided
into tokens. Each token is a categorized block of
text that represents a specific type of element in
the programming language.
Compiler construction
Lexical analysis or scanning
A lexical analyser, often referred to as a lexer or
tokenizer, is a component of a compiler or
interpreter that breaks down source code into its
fundamental building blocks, known as tokens.
Functions of a Lexical Analyser
• Tokenization: The primary function is to read the
input source code and convert it into a sequence
of tokens. Each token is a string with an assigned
meaning.
• Classification: Tokens are classified into
categories such as keywords, identifiers, literals,
operators, and punctuation.
• Removing Whitespace and Comments: The lexer
typically ignores whitespace and comments, as
they do not affect the program's execution.
Compiler construction
Lexemes , Tokens, Patterns

• Token: Token is a sequence of


characters that can be treated
sum = 3 + 2;
as a single logical entity. Typical
tokens are Identifiers, keywords,
operators, special symbols,
constants.
• Lexeme: A lexeme is a
sequence of characters in the
source program that is matched
by the pattern for a token.
• Pattern: A set of strings in the
input for which the same token
is produced as output. This set
of strings is described by a rule
called a pattern associated with
the token.

Compiler construction
Lexical analysis
• Token Structure: A token typically consists of:
• Type: The category of the token (e.g., keyword, identifier,
operator).
• Value: The actual text or value represented by the token.
• Example 1 - Given the input code:
if (x > 10) {
return x;
}
The lexer might produce the following tokens:
if (keyword)
( (punctuation)
x (identifier)
> (operator)
10 (literal)
) (punctuation)
{ (punctuation)
return (keyword)
x (identifier)
; (punctuation)
} (punctuation) Compiler construction
Lexical analysis
• Example 2: Consider the following simple piece of
code:
int x = 10 + 20;
Tokens Generated:
int (Keyword)
x (Identifier)
= (Assignment Operator)
10 (Integer Literal)
+ (Addition Operator)
20 (Integer Literal)
; (Semicolon)

Lexemes Generated:
int , x, =, 10, +, 20, ;

Compiler construction
Symbol table
• An essential function of a compiler is to build the
Symbol Table where the identifiers used in the
program are recorded along with various properties:

• allocated for the ID; its type; its scope (where in the
program is valid); number and types of its
arguments (in case the ID is a procedure name);
etc.

• When an identifier is detected an ID token is


generated, the
corresponding lexeme is entered in the Symbol Table,
and a pointer to the position in the Symbol Table is
associated to the ID token.

Compiler construction
Interaction of the Lexical Analyzer with
the Parser
• .

• The interaction between the Lexical Analyzer and the


Parser is a cooperative process in the compilation
pipeline:
• The lexical analyzer generates tokens from the source
code, removing unnecessary elements like whitespace
and comments.
• The parser uses these tokens to check the syntactic
structure of the program and build a parse tree.
• This interaction is usually demand-driven, where the
Compiler construction
Separating lexical analysis from
parsing
• Separating lexical analysis from parsing is a
common practice in compiler design and
programming language implementation for several
reasons:
• Modularity- By separating these two stages, each can
be developed and maintained independently. This
modularity makes it easier to manage complex
systems.
• Simplified Logic - Lexical analysis focuses only on
breaking input text into tokens, while parsing deals
with the structure of those tokens. This separation
simplifies the logic of each component.
• Efficiency- Lexical analysis can be optimized for token
recognition and can use finite state machines for
quick processing.
• Error Handling - Lexical analyzers can detect and
report errors related to invalid tokens, while parsers
Compiler construction
Interaction of the Lexical Analyzer with
the Parser
• .

• A parser is a program or component that performs syntax


analysis. It takes a sequence of tokens (produced by a
lexer) and checks their arrangement according to the
grammar rules of the language. The interaction between
the lexical analyzer and the parser also facilitates error
handling.
• Lexical errors: If the lexical analyzer encounters an
unrecognized character or malformed token, it can signal
an error to the parser or try to recover by skipping the
problematic characters and continuing.
• Syntactic errors: If the parser receives anconstruction
Compiler unexpected
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
Compiler construction
Issues in Lexical Analysis
Lexical analysis has the following issues:
• Look ahead
• Lookahead refers to the technique used in parsing to
examine upcoming tokens before making parsing
decisions. This is particularly useful in contexts where
the next token can influence the interpretation of the
current token.
• Look ahead is required to decide when one token will
end and the next token will begin.
• Ambiguities
• Ambiguities occur when a sequence of tokens can be
interpreted in multiple ways. This can lead to
confusion during parsing, as the parser may not be
able to determine the correct structure or meaning of
the code.
• The lexical analysis programs written with lex accept
ambiguous specifications and choose the longest
match possible at each input point.
Compiler construction
Finite Automaton and Lexical Analysis
• The lexical analyzer uses a finite automaton
(deterministic or non-deterministic) to recognize
lexemes based on predefined patterns.

• Regular expressions are commonly used to describe


the patterns of tokens, and finite automata are used
to process those regular expressions.

• Deterministic Finite Automaton (DFA): A type of


state machine used to scan input and determine
the appropriate token by transitioning between
states based on character input.

Compiler construction
Tokens Specification
Let us understand how the language theory undertakes
the following terms:
• Alphabets
• Any finite set of symbols {0,1} is a set of binary
alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set
of Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
• Strings
• Any finite sequence of alphabets is called a string.
Length of the string is the total number of occurrence
of alphabets, e.g., the length of the string Compiler
is 8 and is denoted by |Compiler| = 8. A string having
no alphabets, i.e. a string of zero length is known as
an empty string and is denoted by ε (epsilon).
• Language
• A language is considered as a finite set of strings
over some finite set of Compiler alphabets. Computer
construction
Regular Expressions
• The lexical analyzer needs to scan and identify only a
finite set of valid string/token/lexeme that belong to the
language in hand.
• It searches for the pattern defined by the language rules.
• Regular expressions have the capability to express finite
languages by defining a pattern for finite strings of
symbols.
• The grammar defined by regular expressions is known as
regular grammar.
• The language defined by regular grammar is known as
regular language.
• Regular expression is an important notation for specifying
patterns.
• Each pattern matches a set of strings, so regular
expressions serve as names for a set of strings.
• Programming language tokens can be described by
Compiler construction
regular languages.
Finite automata (fa)
• Finite Automata(FA) is the simplest machine
to recognize patterns.

• A finite automaton (FA) is a theoretical


model of computation used to represent and
manipulate a set of strings or sequences.
• To process a string, the finite automaton
starts at the initial state and reads the input
symbols one by one, following the
transitions defined by the automaton.

• If it ends in an accepting state after


processing all input symbols, the string is
accepted; Compiler construction
Finite automata (fa)
• Finite Automata(FA) is the simplest machine to
recognize patterns.
• A Finite Automata consists of the following :
• Q : Finite set of states.
• ∑ : set of Input Symbols.
• q : Initial state.
• F : set of Final States.
• δ : Transition Function.
• Formal specification of machine is
{ Q, ∑, q, F, δ }.
• Example of FA with ∑ = {0, 1} accepts all
strings ending with 0.
• (0+1)*0
Compiler construction
Types of Finite automata (FA)

Deterministic Finite Automata (DFA):


• For each state and input symbol, there is exactly
one transition to a next state.
• Easier to implement and analyze due to its
deterministic nature.

Nondeterministic Finite Automata (NFA):

• May have multiple transitions for a given state and


input symbol, including transitions without input (ε-
transitions).
• More expressive in terms of the languages they can
define, but less straightforward to implement.
Compiler construction
Formal Definition of a DFA
A DFA is a five-tuple:
M = (Q, Σ, δ, q0, F)
Q A finite set of states
Σ A finite input alphabet
q0 The initial/starting state, q0 is in Q
F A set of final/accepting states, which is a
subset of Q
δ A transition function, which is a total function
from Q x Σ to Q

δ: (Q x Σ) –> Q δ is defined for any q in Q


and s in Σ, and
δ(q,s) = q’ is equal to some state q’ in Q,
could be q’=q
21
Compiler construction
 Example #1:
1
Q = {q0, q1}
0
Σ = {0, 1}
q0 q1 1
Start state is q0
F = {q0} 0

δ:
0 1
q0 q1 q0

q1 q0 q1

22
Compiler construction
 Example #2:

a a a/b/c
Q = {q0, q1, q2}
Σ = {a, b, c} c c
q0 q1 q2
Start state is q0
F = {q2} b b

δ: a b c
q0 q0 q0 q1

q1 q1 q1 q2

q2 q2 q2 q2

• Since δ is a function, at each step M has exactly one


23
option. Compiler construction
Nondeterministic Finite State
Automata (NFA)
 An NFA is a five-tuple:
M = (Q, Σ, δ, q0, F)

Q A finite set of states


Σ A finite input alphabet
q0 The initial/starting state, q0 is in Q
F A set of final/accepting states, which is a subset
of Q
δ A transition function, which is a total function
from Q x Σ to 2Q
δ: (Q x Σ) –> 2Q :2Q is the power set of Q, the
set of all subsets of Q δ(q,s) :The set of all
states p such that there is a transition
labeled s from q to p
δ(q,s) is a function from Q x S to 2Q (but not24
Compiler construction
only to Q)
 Example #1: one or more 0’s followed by one or more
1’s
0 1 0/1

Q = {q0, q1, q2} 0 1


q0 q1 q2
Σ = {0, 1}
Start state is q0
F = {q2}

δ: {q0, q10} {}1


q0
{} {q1, q2}

q1 {q2} {q2}

q2

25
Compiler construction
 Example #2: pair of 0’s or pair of 1’s as substring

0/1 0/1
Q = {q0, q1, q2 , q3 , q4}
Σ = {0, 1} 0 0
q0 q3 q4
Start state is q0
F = {q2, q4} 1 0/1

1
δ: 0 1 q1 q2
q0 {q0, q3} {q0, q1}

{} {q2}
q1
{q2} {q2}
q2
{q4} {}
q3
{q4} {q4}
26
q4 Compiler construction
 Example #3

The set of states = {0,1,2,3}


Input symbol = {a,b}
Start state is S0, accepting state is S3

27
Compiler construction
 Example #3

Transition function can be implemented as a transition


table.

28
Compiler construction
Finite automata (fa)
• Even number of a’s :
• The regular expression for even
number of a’s is (b + ab*ab*)*.
• We can construct a finite automata
as shown in Figure 1.

Compiler construction
Finite automata (fa)
• String with ‘ab’ as substring :
• The regular expression for strings
with ‘ab’ as substring is (a +
b)*ab(a + b)*.
• We can construct finite automata as
shown in Figure 2.

Compiler construction
Finite automata (fa)
• String with count of ‘a’ divisible by 3 :
• The regular expression for strings with
count of a divisible by 3 is {a3n | n >=
0}.
• We can construct automata as shown in
Figure 3.

Compiler construction
Specification of Patterns for Tokens:
Definitions
• An alphabet  is a finite set of
symbols (characters)
• A string s is a finite sequence of
symbols from 
• s denotes the length of string s
•  denotes the empty string, thus
 = 0
• A language is a specific set of
strings over some fixed alphabet 

Compiler construction
Specification of Patterns for Tokens:
Definitions
• An alphabet  is a finite set of symbols
(characters)

• A string s is a finite sequence of


symbols from 
• s denotes the length of string s
•  denotes the empty string, thus 
=0

• A language is a specific set of strings


over some fixed alphabet 
Compiler construction
Specification of Patterns for Tokens:
String Operations
• The concatenation of two strings x
and y is denoted by xy
• The exponentation of a string s is
defined by

s0 = 
si = si-1s for i > 0

note that s = s = s
Compiler construction
Specification of Patterns for Tokens:
Language Operations

• Union
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li

Compiler construction
Specification of Patterns for Tokens:
Regular Expressions
• Basis symbols:
  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression
is called a regular set

Compiler construction
Specification of Patterns for Tokens:
Regular Definitions
• Regular definitions introduce a naming
convention:
d1  r1
d2  r2

dn  rn
where each ri is a regular expression
over
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted
in ri to obtain an equivalent set of
Compiler construction
Specification of Patterns for Tokens:
Regular Definitions
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions are not


recursive:

digits  digit digitsdigit


wrong! Compiler construction
Specification of Patterns for Tokens:
Notational Shorthand

• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+
)?
Compiler construction
Regular Definitions and Grammars
Grammar
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term
 term Regular definitions
term  id if  if
 num then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? digit+ )?

Compiler construction
Exercises
1. Complete the table below for the
automaton shown below.

Compiler construction
Exercises
Regular expression: (0+1)*01(0+1)*

 DFA of R.E.  Q= {q0, q1, q2}


 Σ={0,1}
 start state = q0
 F = {q2}

 Transition table:

42
Compiler construction

You might also like