0% found this document useful (0 votes)

30 views

Lexical Analysis

The document discusses lexical analysis which is the first phase of compilation. It breaks down source code into lexical tokens. A lexical analyzer identifies substrings called lexemes which are associated with token categories. It strips out comments and whitespace. Reasons for separating lexical and syntax analysis include simplicity, efficiency, and portability. Formal languages can be specified using finite state machines or regular expressions. Input buffering techniques like double buffering and sentinels are used to improve lexical analyzer performance.

Uploaded by

habteyesus Tadesse

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Lexical Analysis

Uploaded by

habteyesus Tadesse

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Lexical Analysis

Chapter 2
Introduction
Tokens
Source Lexical Analyzer Syntax Analyzer
Program (Scanner) (Parser)

Symbol Table
Manager

01/20/2024 CD 2020 2
Introduction
• Lexical analysis is the first phase of the compiler
• A lexical analyzer is a pattern matcher for character strings
• It is a “front-end” for the parser
• Identifies substrings of the source program that belong together -
lexemes
• Lexemes match a character pattern, which is associated with a
lexical category called a token
• sum is a lexeme; its token may be IDENT
• Lexical analyzer also strips out comments and white spaces in the
form of blank tab and newline characters from the source program
• It also correlates error messages from the compiler with the source
program
01/20/2024 CD 2020 3
Introduction
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token
• Three approaches to build a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs table-driven lexical analyzers given such a
description
• E.g. lex or flex
• Easiest to implement but least efficient
• Design a state diagram that describes the tokens and write a
program that implements the state diagram
• Intermediate in ease, and efficiency
• Design a state diagram that describes the tokens and hand-
construct a table-driven implementation of the state diagram
• Hardest to implement, but most efficient

01/20/2024 CD 2020 4
Reasons to separate lexical and
syntax analysis
• Simplicity - less complex approaches can be used
for lexical analysis; separating them simplifies the
parser
• Efficiency - separation allows optimization of the
lexical analyzer
• Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable

01/20/2024 CD 2020 5
Formal Languages (Revision)
• A formal language is one that can be specified precisely and
is amenable for use with computers
• E.g. syntax of java programming language
• Whereas a natural language is one which is normally spoken
by people
• E.g. Kiswahili, Arabic
• Formal language is a set of strings from a given alphabet
• E.g. 1. Alphabet: {0,1}
• {0, 10, 1011} finite string and {} finite and null string
• {Є, 0, 00, 000,0000, …}
• The set of all strings of zeros & ones having an even number of ones
• E.g. 2. Alphabet: charcters on a computer keyboard
• {0, 10, 1011}, {Є}, java syntax, English systax
01/20/2024 CD 2020 6
Formal Languages (Revision)
cont’d
Formal language can be specified using Finite State Machine or
Regular Expressions
Finite State Machine (FSM )
• It is a theoretical machine consists of
1. A finite set of states: one starting state and zero or more accepting
states
2. A state transition function:
• Two arguments: a state and input symbol
• Returns a state as it’s result
• How the FSM works:
• The input is a string of symbols from the input alphabet
• The machine is initially in the starting state
• Each symbol read from the input string
• Machine proceeds to a new state as indicated by the transition function
• Finally machine is either in accepting state (input is accepted) or non-
accepting state (input is rejected)
• The set of all input strings which will be accepted by the machine
form a language
01/20/2024 CD 2020 7
Formal Languages (Revision)
cont’d

• FSM can be represented by state diagram, table or

other methods
Double circle
• State Diagram Representation representing accepting
state (final state)

Starting state – no
state at it’s source Arc representing
transition function

Labels
representing
01/20/2024 CD 2020 inputs 8
Formal Languages (Revision)
cont’d

• Table Representation
Input symbols

0 1
Accepting state *A A B
marked by asterisk
B B A The next state

The starting state

listed in the first row

01/20/2024 CD 2020 9
Formal Languages (Revision)
cont’d
• Example: strings containing an odd number of zeros
from input alphabet {0,1}

0 1

A B A

*B A B

State Diagram representation Table representation

01/20/2024 CD 2020 10
Formal Languages (Revision)
cont’d

Regular Expression
• These are formulas or expressions consisting of
three possible operations on languages:
1. Union
2. Concatenation, and
3. Kleene star
1. Union: since a language is a set, this operation is
the union operation as defined in set theory
• E.g. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Note: L + {} = L

01/20/2024 CD 2020 11
Formal Languages (Revision)
cont’d
2. Concatenation: concatenation of two languages is that language
formed by concatenating each string in one language with each string
in the other language
• E.g. {ab, a, c} · {b, ǫ} = {ab · b, ab · ǫ, a · b, a · ǫ, c · b, c · ǫ} = {abb, ab, a, cb, c}
• L1●L2 ≠ L2●L1
• L● =L
• L● =
3. Klen star (Closure): if L is a language, we define it as follows
• L0 = {}
• L1 = L
• L2 = L ● L 1
…

• Ln = L ● Ln-1
• L* = L 0 + L 1 + L 2 + L 3 + …
• Note: * = {}
01/20/2024 CD 2020 12
Formal Languages (Revision)
cont’d

• E.g. A regular expression specifying a language

containing set of all strings of zeros and ones:
(0+1)* To understand what strings are in this
language, let L = {0,1}. We need to find L*:
• L0 = {}
• L1 = {0, 1}
• L2 = L · L1 = {00, 01, 10, 11}
• L3 = L · L2 = {000, 001, 010, 011, 100, 101, 110, 111}
...
• L∗ = {, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101,
110, 111, 0000, ...}
01/20/2024 CD 2020 13
Input Buffering
• Scanner performance is crucial:
• This is the only part of the compiler that examines the
entire input program one character at a time
• Disk input can be slow
• The scanner accounts for ~25-30% of total compile time
• We need lookahead to determine when a match
has been found
• Scanners use double-buffering to minimize the
overheads associated with this

01/20/2024 CD 2020 14
Buffer Pairs

• Use two N-byte buffers (N = size of a disk block; typically, N

= 1024 or 4096).
• Read N bytes into one half of the buffer each time. If input
has less than N bytes, put a special EOF marker in the
buffer.
• When one buffer has been processed, read N bytes into the
other buffer (“circular buffers”).
01/20/2024 CD 2020 15
Buffer Pairs cont’d

Code:
if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;
else
fwd++;

• it takes two tests for each advance of the fwd pointer

01/20/2024 CD 2020 16
Buffer Pairs With Sentinels

• Objective: Optimize the common case by reducing the

number of tests to one per advance of fwd
• Idea: Extend each buffer half to hold a sentinel at the
end
• This is a special character that cannot occur in a
program (e.g., EOF)
• It signals the need for some special action (fill other
buffer-half, or terminate processing)
01/20/2024 CD 2020 17
Buffer Pairs With Sentinels cont’d

Code:
fwd++;
if ( *fwd == EOF ) { /* special
processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
}
• Common case now needs just a single test per character
01/20/2024 CD 2020 18
Lexical Tokens
• The lexical analyzer scans the input strings/source code and attempts to
isolate the words/lexims
• Words/lexims are token as a units & passed to the next phase of
compilation
• Some of the words include:
1. Keywords/Reserved words: while, if, else, for, int…
2. Identifiers: constructed by programmers
• May be used to identify variables classes, functions constants etc.
3. Operators: symbols used for arithmetic, logical or character operations
+, -, <=, = …
4. Numeric constants: like integer 43, float 4.25
5. Character constants: single characters or strings of characters enclosed
by quotes
6. Special characters: -, (, ), ,, ; ……
7. Comments
8. White spaces
9. Newline
01/20/2024 CD 2020 19
Lexical Tokens cont’d
• Example:
6 7
1 2 3 4

int fee = 12; //example comment

Keyword Operator Special character

Identifier
Constant

• Each token consists of two parts:

• Class: indicating which kind of token
• Value: indicating which member of the class

01/20/2024 CD 2020 20
Lexical Tokens cont’d
• Example:
Class Value
1 [code for int]
2 [Pointer to symbol table entry for fee]
3 [code for =]
4 [Pointer to constant table entry for 12]
6 [code for ;]

• Note that the lexical analysis phase does not check

for proper syntax

01/20/2024 CD 2020 21
Implementation with Finite State Machines

• Finite state machines can be used to simplify lexical

analysis
• A finite state machine can be implemented very
simply by an array in which there is a row for each
state of the machine and a column for each
possible
• It may be necessary or desirable to code the states
and/or input symbols as integers, depending on the
implementation programming language

01/20/2024 CD 2020 22
Implementation with Finite State
Machines cont’d
boolean [] accept = new boolean [STATES];
int [][] fsm = new int[STATES][INPUTS]; // state table
// initialize table here...
int inp = 0; // input symbol (0..INPUTS)
int state = 0; // starting state;
try
{ inp = System.in.read() - ’0’; // character input,
// convert to int.
while (inp>=0 && inp<INPUTS)
{ state = fsm[state][inp]; // next state
inp = System.in.read() - ’0’; // get next input
}
} catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
if (accept[state])
System.out.println ("Accepted"); System.out.println
("Rejected");

01/20/2024 CD 2020 23
Examples of Finite State
Machines for Lexical Analysis
• Example 1: An example of a finite state machine which accepts any
identifier beginning with a letter and followed by any number of
letters and digits

• The letter ‘L’ represents any letter (a-z), and the letter ‘D’
represents any numeric digit (0-9)
• This implies that a preprocessor would be needed to convert input
characters to tokens suitable for input to the finite state machine
01/20/2024 CD 2020 24
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 2: A finite state machine which accepts
numeric constants.
• These constants must begin with a digit, and
numbers such as .099 are not acceptable

01/20/2024 CD 2020 25
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 3: FSM that accepts the keywords if, int,
import, for, float.

01/20/2024 CD 2020 26
Actions for Finite State Machines
• Lexical analysis involves more than simply
recognizing words
• It may involve:
• Building a symbol table
• Converting numeric constants to the appropriate data
type and
• Putting out tokens.
• For this reason, we wish to associate an action, or
function to be invoked, with each state transition in
the finite state machine

01/20/2024 CD 2020 27
Actions for Finite State Machines
cont’d

• Design a finite state machine, with actions, to read

numeric strings and convert them to an appropriate
internal format, such as floating point

01/20/2024 CD 2020 28
Lexical Tables
• One of the most important functions of the lexical
analysis phase is the creation of tables which are used
later in the compiler. These include:
• Symbol table for identifiers
• Table of numeric constants
• Table of string constants
• Table of statement table
• Table of line numbers
• These can be implemented using:
• Sequential search
• Binary search tree
• Hash table
01/20/2024 CD 2020 29

Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
How To Conduct Effective SSN Search.
No ratings yet
How To Conduct Effective SSN Search.
1 page
CS3304 9 LanguageSyntax 2 PDF
No ratings yet
CS3304 9 LanguageSyntax 2 PDF
39 pages
Ricardo Software Product and License Manager Installation Guide
100% (1)
Ricardo Software Product and License Manager Installation Guide
29 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
34 pages
lect03
No ratings yet
lect03
19 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Compiler Design
No ratings yet
Compiler Design
42 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
CH 2
No ratings yet
CH 2
36 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
CSC 453 Lexical Analysis (Scanning) : Saumya Debray
No ratings yet
CSC 453 Lexical Analysis (Scanning) : Saumya Debray
27 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
Lecture 3 (30-1-23)
No ratings yet
Lecture 3 (30-1-23)
11 pages
SSCD Chapter3
No ratings yet
SSCD Chapter3
97 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lexical
No ratings yet
Lexical
34 pages
Compiler_Construction_Lexical_Analysis
No ratings yet
Compiler_Construction_Lexical_Analysis
63 pages
Compilers: CS414-2017S-01 Compiler Basics & Lexical Analysis
No ratings yet
Compilers: CS414-2017S-01 Compiler Basics & Lexical Analysis
58 pages
02. Chapter 3 - Lexical Analysis
No ratings yet
02. Chapter 3 - Lexical Analysis
51 pages
2.1 Constituents of Lexical Analysis
No ratings yet
2.1 Constituents of Lexical Analysis
10 pages
PCD - Theory - Paper Solution - Nov - Dec - 2017
No ratings yet
PCD - Theory - Paper Solution - Nov - Dec - 2017
27 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Csc3205-Lexical - Analysis PDF
No ratings yet
Csc3205-Lexical - Analysis PDF
33 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
No ratings yet
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
64 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
Ch2-CC
No ratings yet
Ch2-CC
47 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Chapter 33
No ratings yet
Chapter 33
107 pages
The Structure of A Compiler: Any Compiler Must Perform Two Major Tasks
No ratings yet
The Structure of A Compiler: Any Compiler Must Perform Two Major Tasks
57 pages
Unit2
No ratings yet
Unit2
61 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
2024 CSN352 Lec 8
No ratings yet
2024 CSN352 Lec 8
48 pages
Unit 2-Introduction to Compilers
No ratings yet
Unit 2-Introduction to Compilers
51 pages
Compiler
No ratings yet
Compiler
60 pages
Chap04
No ratings yet
Chap04
15 pages
M2 Session2
No ratings yet
M2 Session2
17 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
Visvesvaraya Technological University: Artificial Intelligence & Data Science
No ratings yet
Visvesvaraya Technological University: Artificial Intelligence & Data Science
11 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
CSC 318 Class Notes
No ratings yet
CSC 318 Class Notes
21 pages
Lexical Analyzer (Compiler Contruction)
100% (1)
Lexical Analyzer (Compiler Contruction)
6 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Security CH-03 2024
No ratings yet
Security CH-03 2024
58 pages
Internship Form (2)
No ratings yet
Internship Form (2)
2 pages
4+1 View FINAL
No ratings yet
4+1 View FINAL
11 pages
SRS - IEEE Group 4
No ratings yet
SRS - IEEE Group 4
20 pages
Sirs Chapter 1
No ratings yet
Sirs Chapter 1
16 pages
Compiler Design - Course Outline
No ratings yet
Compiler Design - Course Outline
2 pages
CHPT 3 Analysis
No ratings yet
CHPT 3 Analysis
18 pages
FLAT - Ch-1
No ratings yet
FLAT - Ch-1
42 pages
ANSYS Platform Support Stategy and Plans July 2015
No ratings yet
ANSYS Platform Support Stategy and Plans July 2015
4 pages
C800IP Manual
No ratings yet
C800IP Manual
10 pages
E 2272014
No ratings yet
E 2272014
10 pages
Cpds Notes
No ratings yet
Cpds Notes
177 pages
Gemini Dollar Whitepaper
No ratings yet
Gemini Dollar Whitepaper
7 pages
Guides To Check The Battery CCA Values
No ratings yet
Guides To Check The Battery CCA Values
1 page
NP Completeness Presentation
No ratings yet
NP Completeness Presentation
27 pages
Scope and Effort Analyzer How-To Guide: SAP Solution Manager
No ratings yet
Scope and Effort Analyzer How-To Guide: SAP Solution Manager
50 pages
Arulmigu Meenakshi Amman College of Engineering: Dr.S.Muthu
No ratings yet
Arulmigu Meenakshi Amman College of Engineering: Dr.S.Muthu
1 page
A Spotlight On The New .NET Connector 3.0
No ratings yet
A Spotlight On The New .NET Connector 3.0
9 pages
3 G
No ratings yet
3 G
290 pages
Python For Data Science PDF
100% (8)
Python For Data Science PDF
30 pages
Python Notes Class XI IP
No ratings yet
Python Notes Class XI IP
112 pages
1 Adders
No ratings yet
1 Adders
33 pages
Access Layer Leverage: Section 6: Chapter 2
No ratings yet
Access Layer Leverage: Section 6: Chapter 2
1 page
Python Solid Principles
No ratings yet
Python Solid Principles
9 pages
Analog To Digital Converter
100% (3)
Analog To Digital Converter
26 pages
Pumping Lemma
No ratings yet
Pumping Lemma
5 pages
Muratec Printer/Scanner Drivers and Officebridge Install/Uninstall For Windows 7
No ratings yet
Muratec Printer/Scanner Drivers and Officebridge Install/Uninstall For Windows 7
11 pages
Crack Msi Files
No ratings yet
Crack Msi Files
4 pages
Bioinformatica
100% (5)
Bioinformatica
408 pages
JavaString Cheatsheet Edureka
No ratings yet
JavaString Cheatsheet Edureka
1 page
Soundbooth CS5 Read Me
No ratings yet
Soundbooth CS5 Read Me
8 pages
Yngrid Kurei Factuar STEM 1, 11-GOLD: Activity NO.1
No ratings yet
Yngrid Kurei Factuar STEM 1, 11-GOLD: Activity NO.1
1 page
Packages
No ratings yet
Packages
21 pages
Jazz Packages
100% (1)
Jazz Packages
10 pages
F
No ratings yet
F
1 page
L01 Introduction Programming Languages
No ratings yet
L01 Introduction Programming Languages
54 pages

Lexical Analysis

Uploaded by

Lexical Analysis

Uploaded by

Lexical Analysis

• FSM can be represented by state diagram, table or

The starting state

State Diagram representation Table representation

• E.g. A regular expression specifying a language

• Use two N-byte buffers (N = size of a disk block; typically, N

• it takes two tests for each advance of the fwd pointer

• Objective: Optimize the common case by reducing the

int fee = 12; //example comment

Keyword Operator Special character

• Each token consists of two parts:

• Note that the lexical analysis phase does not check

• Finite state machines can be used to simplify lexical

• Design a finite state machine, with actions, to read

You might also like