0% found this document useful (0 votes)
30 views

Lexical Analysis

The document discusses lexical analysis which is the first phase of compilation. It breaks down source code into lexical tokens. A lexical analyzer identifies substrings called lexemes which are associated with token categories. It strips out comments and whitespace. Reasons for separating lexical and syntax analysis include simplicity, efficiency, and portability. Formal languages can be specified using finite state machines or regular expressions. Input buffering techniques like double buffering and sentinels are used to improve lexical analyzer performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Lexical Analysis

The document discusses lexical analysis which is the first phase of compilation. It breaks down source code into lexical tokens. A lexical analyzer identifies substrings called lexemes which are associated with token categories. It strips out comments and whitespace. Reasons for separating lexical and syntax analysis include simplicity, efficiency, and portability. Formal languages can be specified using finite state machines or regular expressions. Input buffering techniques like double buffering and sentinels are used to improve lexical analyzer performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Lexical Analysis

Chapter 2
Introduction
Tokens
Source Lexical Analyzer Syntax Analyzer
Program (Scanner) (Parser)

Symbol Table
Manager

01/20/2024 CD 2020 2
Introduction
• Lexical analysis is the first phase of the compiler
• A lexical analyzer is a pattern matcher for character strings
• It is a “front-end” for the parser
• Identifies substrings of the source program that belong together -
lexemes
• Lexemes match a character pattern, which is associated with a
lexical category called a token
• sum is a lexeme; its token may be IDENT
• Lexical analyzer also strips out comments and white spaces in the
form of blank tab and newline characters from the source program
• It also correlates error messages from the compiler with the source
program
01/20/2024 CD 2020 3
Introduction
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token
• Three approaches to build a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs table-driven lexical analyzers given such a
description
• E.g. lex or flex
• Easiest to implement but least efficient
• Design a state diagram that describes the tokens and write a
program that implements the state diagram
• Intermediate in ease, and efficiency
• Design a state diagram that describes the tokens and hand-
construct a table-driven implementation of the state diagram
• Hardest to implement, but most efficient

01/20/2024 CD 2020 4
Reasons to separate lexical and
syntax analysis
• Simplicity - less complex approaches can be used
for lexical analysis; separating them simplifies the
parser
• Efficiency - separation allows optimization of the
lexical analyzer
• Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable

01/20/2024 CD 2020 5
Formal Languages (Revision)
• A formal language is one that can be specified precisely and
is amenable for use with computers
• E.g. syntax of java programming language
• Whereas a natural language is one which is normally spoken
by people
• E.g. Kiswahili, Arabic
• Formal language is a set of strings from a given alphabet
• E.g. 1. Alphabet: {0,1}
• {0, 10, 1011} finite string and {} finite and null string
• {Є, 0, 00, 000,0000, …}
• The set of all strings of zeros & ones having an even number of ones
• E.g. 2. Alphabet: charcters on a computer keyboard
• {0, 10, 1011}, {Є}, java syntax, English systax
01/20/2024 CD 2020 6
Formal Languages (Revision)
cont’d
Formal language can be specified using Finite State Machine or
Regular Expressions
Finite State Machine (FSM )
• It is a theoretical machine consists of
1. A finite set of states: one starting state and zero or more accepting
states
2. A state transition function:
• Two arguments: a state and input symbol
• Returns a state as it’s result
• How the FSM works:
• The input is a string of symbols from the input alphabet
• The machine is initially in the starting state
• Each symbol read from the input string
• Machine proceeds to a new state as indicated by the transition function
• Finally machine is either in accepting state (input is accepted) or non-
accepting state (input is rejected)
• The set of all input strings which will be accepted by the machine
form a language
01/20/2024 CD 2020 7
Formal Languages (Revision)
cont’d

• FSM can be represented by state diagram, table or


other methods
Double circle
• State Diagram Representation representing accepting
state (final state)

Starting state – no
state at it’s source Arc representing
transition function

Labels
representing
01/20/2024 CD 2020 inputs 8
Formal Languages (Revision)
cont’d

• Table Representation
Input symbols

0 1
Accepting state *A A B
marked by asterisk
B B A The next state

The starting state


listed in the first row

01/20/2024 CD 2020 9
Formal Languages (Revision)
cont’d
• Example: strings containing an odd number of zeros
from input alphabet {0,1}

0 1

A B A

*B A B

State Diagram representation Table representation

01/20/2024 CD 2020 10
Formal Languages (Revision)
cont’d

Regular Expression
• These are formulas or expressions consisting of
three possible operations on languages:
1. Union
2. Concatenation, and
3. Kleene star
1. Union: since a language is a set, this operation is
the union operation as defined in set theory
• E.g. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Note: L + {} = L

01/20/2024 CD 2020 11
Formal Languages (Revision)
cont’d
2. Concatenation: concatenation of two languages is that language
formed by concatenating each string in one language with each string
in the other language
• E.g. {ab, a, c} · {b, ǫ} = {ab · b, ab · ǫ, a · b, a · ǫ, c · b, c · ǫ} = {abb, ab, a, cb, c}
• L1●L2 ≠ L2●L1
• L● =L
• L● =
3. Klen star (Closure): if L is a language, we define it as follows
• L0 = {}
• L1 = L
• L2 = L ● L 1

• Ln = L ● Ln-1
• L* = L 0 + L 1 + L 2 + L 3 + …
• Note: * = {}
01/20/2024 CD 2020 12
Formal Languages (Revision)
cont’d

• E.g. A regular expression specifying a language


containing set of all strings of zeros and ones:
(0+1)* To understand what strings are in this
language, let L = {0,1}. We need to find L*:
• L0 = {}
• L1 = {0, 1}
• L2 = L · L1 = {00, 01, 10, 11}
• L3 = L · L2 = {000, 001, 010, 011, 100, 101, 110, 111}
...
• L∗ = {, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101,
110, 111, 0000, ...}
01/20/2024 CD 2020 13
Input Buffering
• Scanner performance is crucial:
• This is the only part of the compiler that examines the
entire input program one character at a time
• Disk input can be slow
• The scanner accounts for ~25-30% of total compile time
• We need lookahead to determine when a match
has been found
• Scanners use double-buffering to minimize the
overheads associated with this

01/20/2024 CD 2020 14
Buffer Pairs

• Use two N-byte buffers (N = size of a disk block; typically, N


= 1024 or 4096).
• Read N bytes into one half of the buffer each time. If input
has less than N bytes, put a special EOF marker in the
buffer.
• When one buffer has been processed, read N bytes into the
other buffer (“circular buffers”).
01/20/2024 CD 2020 15
Buffer Pairs cont’d

Code:
if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;
else
fwd++;

• it takes two tests for each advance of the fwd pointer


01/20/2024 CD 2020 16
Buffer Pairs With Sentinels

• Objective: Optimize the common case by reducing the


number of tests to one per advance of fwd
• Idea: Extend each buffer half to hold a sentinel at the
end
• This is a special character that cannot occur in a
program (e.g., EOF)
• It signals the need for some special action (fill other
buffer-half, or terminate processing)
01/20/2024 CD 2020 17
Buffer Pairs With Sentinels cont’d

Code:
fwd++;
if ( *fwd == EOF ) { /* special
processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
}
• Common case now needs just a single test per character
01/20/2024 CD 2020 18
Lexical Tokens
• The lexical analyzer scans the input strings/source code and attempts to
isolate the words/lexims
• Words/lexims are token as a units & passed to the next phase of
compilation
• Some of the words include:
1. Keywords/Reserved words: while, if, else, for, int…
2. Identifiers: constructed by programmers
• May be used to identify variables classes, functions constants etc.
3. Operators: symbols used for arithmetic, logical or character operations
+, -, <=, = …
4. Numeric constants: like integer 43, float 4.25
5. Character constants: single characters or strings of characters enclosed
by quotes
6. Special characters: -, (, ), ,, ; ……
7. Comments
8. White spaces
9. Newline
01/20/2024 CD 2020 19
Lexical Tokens cont’d
• Example:
6 7
1 2 3 4

int fee = 12; //example comment

Keyword Operator Special character


Identifier
Constant

• Each token consists of two parts:


• Class: indicating which kind of token
• Value: indicating which member of the class

01/20/2024 CD 2020 20
Lexical Tokens cont’d
• Example:
Class Value
1 [code for int]
2 [Pointer to symbol table entry for fee]
3 [code for =]
4 [Pointer to constant table entry for 12]
6 [code for ;]

• Note that the lexical analysis phase does not check


for proper syntax

01/20/2024 CD 2020 21
Implementation with Finite State Machines

• Finite state machines can be used to simplify lexical


analysis
• A finite state machine can be implemented very
simply by an array in which there is a row for each
state of the machine and a column for each
possible
• It may be necessary or desirable to code the states
and/or input symbols as integers, depending on the
implementation programming language

01/20/2024 CD 2020 22
Implementation with Finite State
Machines cont’d
boolean [] accept = new boolean [STATES];
int [][] fsm = new int[STATES][INPUTS]; // state table
// initialize table here...
int inp = 0; // input symbol (0..INPUTS)
int state = 0; // starting state;
try
{ inp = System.in.read() - ’0’; // character input,
// convert to int.
while (inp>=0 && inp<INPUTS)
{ state = fsm[state][inp]; // next state
inp = System.in.read() - ’0’; // get next input
}
} catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
if (accept[state])
System.out.println ("Accepted"); System.out.println
("Rejected");

01/20/2024 CD 2020 23
Examples of Finite State
Machines for Lexical Analysis
• Example 1: An example of a finite state machine which accepts any
identifier beginning with a letter and followed by any number of
letters and digits

• The letter ‘L’ represents any letter (a-z), and the letter ‘D’
represents any numeric digit (0-9)
• This implies that a preprocessor would be needed to convert input
characters to tokens suitable for input to the finite state machine
01/20/2024 CD 2020 24
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 2: A finite state machine which accepts
numeric constants.
• These constants must begin with a digit, and
numbers such as .099 are not acceptable

01/20/2024 CD 2020 25
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 3: FSM that accepts the keywords if, int,
import, for, float.

01/20/2024 CD 2020 26
Actions for Finite State Machines
• Lexical analysis involves more than simply
recognizing words
• It may involve:
• Building a symbol table
• Converting numeric constants to the appropriate data
type and
• Putting out tokens.
• For this reason, we wish to associate an action, or
function to be invoked, with each state transition in
the finite state machine

01/20/2024 CD 2020 27
Actions for Finite State Machines
cont’d

• Design a finite state machine, with actions, to read


numeric strings and convert them to an appropriate
internal format, such as floating point

01/20/2024 CD 2020 28
Lexical Tables
• One of the most important functions of the lexical
analysis phase is the creation of tables which are used
later in the compiler. These include:
• Symbol table for identifiers
• Table of numeric constants
• Table of string constants
• Table of statement table
• Table of line numbers
• These can be implemented using:
• Sequential search
• Binary search tree
• Hash table
01/20/2024 CD 2020 29

You might also like