0% found this document useful (0 votes)

13 views

Lecture 03

Lexical analysis partitions a program string into tokens. It identifies tokens by regular expressions that describe the strings belonging to each token class. Regular expressions use operators like concatenation, union, and Kleene star to build up complex expressions from atomic character classes. Lookahead may be required to disambiguate tokens. Languages like FORTRAN and PL/I introduced ambiguities that made lexical analysis more complex.

Uploaded by

vishvadeepgutley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture 03

Uploaded by

vishvadeepgutley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Lexical Analysis

CS143
Lecture 3

Instructor: Fredrik Kjolstad

Slide design by Prof. Alex Aiken, with modifications 1
Outline

• Informal sketch of lexical analysis

– Identifies tokens in input string

• Issues in lexical analysis

– Lookahead
– Ambiguities

• Specifying lexers (aka. scanners)

– By regular expressions (aka. regex)
– Examples of regular expressions

2
Lexical Analysis

• What do we want to do? Example:

if (i == j)
Z = 0;
else
Z = 1;

• The input is just a string of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings

– Where the substrings are called tokens

3
What’s a Token?

• A syntactic category
– In English:
noun, verb, adjective, …

– In a programming language:
Identifier, Integer, Keyword, Whitespace, …

4
Tokens

• A token class corresponds to a set of strings

Infinite set
• Examples var1
i
ports
– Identifier: strings of letters or foo Person
digits, starting with a letter …
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

5
What are Tokens For?

• Classify program substrings according to role

• Lexical analysis produces a stream of tokens

• … which is input to the parser

• Parser relies on token distinctions

– An identifier is treated differently than a keyword

6
Designing a Lexical Analyzer: Step 1

• Define a finite set of tokens

– Tokens describe all items of interest

• Identifiers, integers, keywords

– Choice of tokens depends on

• language
• design of parser

7
Example

• Recall
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Useful tokens for this expression:

Integer, Keyword, Relation, Identifier, Whitespace, (, ),
=, ;

• N.B., (, ), =, ; above are tokens, not characters

8
Designing a Lexical Analyzer: Step 2

• Describe which strings belong to each token

• Recall:
– Identifier: strings of letters or digits, starting with a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

9
Lexical Analyzer: Implementation

• An implementation must do two things:

1. Classify each substring as a token

2. Return the value or lexeme (value) of the token

– The lexeme is the actual substring
– From the set of substrings that make up the token

• The lexer thus returns token-lexeme pairs

– And potentially also line numbers, file names, etc. to
improve later error messages

10
Example

• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

11
Lexical Analyzer: Implementation

• The lexer usually discards “uninteresting” tokens

that don’t contribute to parsing.

• Examples: Whitespace, Comments

12
True Crimes of Lexical Analysis

• Is it as easy as it sounds?

• Sort of… if you do not make it hard!

• Look at some history

13
Lexical Analysis in FORTRAN

• FORTRAN rule: Whitespace is insignificant

• E.g., VAR1 is the same as VA R1

• A terrible design!

• Historical footnote: FORTRAN Whitespace rule

motivated by inaccuracy of punch card operators

14
FORTRAN Example

• Consider
– DO 5 I = 1,25
– DO 5 I = 1.25

15
Lexical Analysis in FORTRAN (Cont.)

• Two important points:

1. The goal is to partition the string. This is implemented

by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one

token ends and the next token begins

16
Lookahead

• Even our simple example has lookahead issues

– i vs. if
– = vs. ==

17
Lexical Analysis in PL/I

• PL/I keywords are not reserved

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

18
Lexical Analysis in PL/I (Cont.)

• PL/I Declarations:
DECLARE (ARG1,. . ., ARGN)

• Cannot tell whether DECLARE is a keyword or

array reference until after the ).
– Requires arbitrary lookahead!

19
Lexical Analysis in C++

• Unfortunately, the problems continue today

• C++ template syntax:

Foo<Bar>
• C++ stream syntax:
cin >> var;
• But there is a conflict with nested templates:
Foo<Bar<Bazz>>

Closing templates, not stream 20

Review

• The goal of lexical analysis is to

– Partition the input string into lexemes
– Identify the token of each lexeme

• Left-to-right scan => lookahead sometimes

required

21
Next

• We still need
– A way to describe the lexemes of each token

– A way to resolve ambiguities

• Is if two variables i and f?
• Is == two equal signs = =?

22
Regular Languages

• There are several formalisms for specifying tokens

• Regular languages are the most popular

– Simple and useful theory
– Easy to understand
– Efficient implementations

23
Languages

Def. Let alphabet Σ be a set of characters.

A language over Σ is a set of strings of
characters drawn from Σ.

24
Examples of Languages

• Alphabet = English • Alphabet = ASCII

characters
• Language = English • Language = C programs
sentences

• Not every string of English • Note: ASCII character set

characters is an English is different from English
sentence character set

25
Notation

• Languages are sets of strings.

• Need some notation for specifying which sets we

want

• The standard notation for regular languages is

regular expressions.

26
Atomic Regular Expressions

• Single character

' c ' = {" c "}

• Epsilon
ε = {""}
Not the empty set, but set with
a single, empty, string.

27
Compound Regular Expressions

• Union
A+ B = {s | s ∈ A or s ∈ B}

• Concatenation
AB = {ab | a ∈ A and b ∈ B}
• Iteration
A* = ∪i≥0 A i where A i = AA . . . A
i times
28
Regular Expressions

• Def. The regular expressions over Σ are the

smallest set of expressions including
ε
'c ' where c ∈ ∑
A + B where A, B are rexp over ∑
AB " " "
*
A where A is a rexp over ∑
29
Syntax vs. Semantics

• Notation so far was imprecise

AB = {ab | a ∈ A and b ∈ B}

B as a piece of syntax B as a set

(the semantics of the syntax)

30
Syntax vs. Semantics

L('a' + 'b')
Semantics (content) L('a'*)

b aa aaa
a ϵ a ...

Box 'a' + 'b' 'a'*

Syntax (label)

31
Syntax vs. Semantics

• To be careful, we distinguish syntax and semantics.

L(ε ) = {""}
L(' c ') = {" c "}
L( A + B ) = L( A)∪ L( B)
L( AB) = {ab | a ∈ L( A) and b ∈ L( B)}
ii
L(A*)*
L( A ) =
= ∪i≥0
! i ≥0
L(A
L( A ))

32
Segue

• Regular expressions are simple, almost trivial

– But they are useful!

• We will describe tokens in regular expressions

33
Example: Keyword

Keyword: “else” or “if” or “begin” or …

‘else’ + ‘if’ + ‘begin’ + . . .

Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’

34
Example: Integers

Integer: a non-empty string of digits

digit = '0 '+ '1'+ ' 2 '+ '3'+ ' 4 '+ '5'+ '6 '+ '7 '+ '8'+ '9 '
integer = digit digit *

+
Abbreviation: A = AA*
Abbreviation: [0-2] = '0' + '1' + '2'

35
Example: Identifier

Identifier: strings of letters or digits, starting with a

letter

letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’

identifier = letter (letter + digit)*

Is (letter* + digit) the same as (letter + digit)?

36
Example: Whitespace

Whitespace: a non-empty sequence of blanks,

newlines, and tabs

+
(' ' + '\n' + '\t')

37
Example: Phone Numbers

• Regular expressions are all around you!

• Consider (650)-723-3232

∑ = digits ∪ {-,(,)}
3
exchange = digit
phone = digit 4
area = digit 3
phone_number = '(' area ')-' exchange '-' phone
38
Example: Email Addresses

• Consider [email protected]

∑ = letters ∪ {.,@}
name = letter +
address = name '@' name '.' name '.' name

39
Example: Unsigned Pascal Numbers

digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9'

+
digits = digit
opt_fraction = ('.' digits) + ε
opt_exponent = ('E' ('+' + '-' + ε ) digits) + ε
num = digits opt_fraction opt_exponent

40
Other Examples

• File names
• Grep tool family

41
Summary

• Regular expressions describe many useful

languages
– We will look at non-regular languages next week

• Regular languages are a language

specification
– We still need an implementation

• Next time: Given a string s and a rexp R, is

s ∈ L( R ) ?
42

CIS110 Syllabus
100% (1)
CIS110 Syllabus
5 pages
IBDP Math Applications & Interpretation HL COURSE OUTLINES
100% (2)
IBDP Math Applications & Interpretation HL COURSE OUTLINES
23 pages
(Robert P. Pippin) Kant's Theory of Form
100% (1)
(Robert P. Pippin) Kant's Theory of Form
128 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
2 Lex
No ratings yet
2 Lex
45 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
4-LexicalAnalysis
No ratings yet
4-LexicalAnalysis
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Outline: - Identifies Tokens in Input String
No ratings yet
Outline: - Identifies Tokens in Input String
7 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
2024_CD-Ch02_Lexical_Analysis
No ratings yet
2024_CD-Ch02_Lexical_Analysis
25 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
1_scanning-slides-sanyal-part1
No ratings yet
1_scanning-slides-sanyal-part1
22 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Module 3
No ratings yet
Module 3
7 pages
Compiler
No ratings yet
Compiler
60 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Ch3myppt
No ratings yet
Ch3myppt
59 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Lecture3_E
No ratings yet
Lecture3_E
153 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Chap 11
No ratings yet
Chap 11
28 pages
cd1
No ratings yet
cd1
92 pages
Chapter 3 Finite automata and lexical analysis
No ratings yet
Chapter 3 Finite automata and lexical analysis
100 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Chapter 3 Lexical Analyser
No ratings yet
Chapter 3 Lexical Analyser
29 pages
EC Cryptography Tutorials - Herong's Tutorial Examples
From Everand
EC Cryptography Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
Role of The Good Angel and The Bad
No ratings yet
Role of The Good Angel and The Bad
13 pages
Keo Chanbo, A Khmer Writer
100% (1)
Keo Chanbo, A Khmer Writer
1 page
BR - Nath Pai Central School, Kudal Class-XI Date-22/07/2024 Sub-Computer Science (083) Marks-40
No ratings yet
BR - Nath Pai Central School, Kudal Class-XI Date-22/07/2024 Sub-Computer Science (083) Marks-40
3 pages
@DSATuz -- NEW DSAT practice test
No ratings yet
@DSATuz -- NEW DSAT practice test
13 pages
ACA - Memory
No ratings yet
ACA - Memory
26 pages
AI Logbook
No ratings yet
AI Logbook
14 pages
Microsoft Word - Charlotte's Web WS CH 8-10
No ratings yet
Microsoft Word - Charlotte's Web WS CH 8-10
2 pages
Missing Values
No ratings yet
Missing Values
6 pages
Varieties of English
No ratings yet
Varieties of English
4 pages
Python Unit 2
No ratings yet
Python Unit 2
13 pages
A Course in Model Theory
No ratings yet
A Course in Model Theory
110 pages
A Documentary by Jessica Soho
No ratings yet
A Documentary by Jessica Soho
3 pages
Comandos Racf
No ratings yet
Comandos Racf
716 pages
J Tree
No ratings yet
J Tree
3 pages
Intertextuality Between The Alchemical Precepts of Maria Prophetissa and Sayings in The Gospel of Thomas
No ratings yet
Intertextuality Between The Alchemical Precepts of Maria Prophetissa and Sayings in The Gospel of Thomas
4 pages
The Christmas Party SV A2
No ratings yet
The Christmas Party SV A2
5 pages
The Laburnum Top
No ratings yet
The Laburnum Top
9 pages
SLIDE UNIT 1 LESSON 2 Academic Texts Across The Disciplines
No ratings yet
SLIDE UNIT 1 LESSON 2 Academic Texts Across The Disciplines
30 pages
Language Processing Activities
No ratings yet
Language Processing Activities
14 pages
Gen Sec Ref Books
No ratings yet
Gen Sec Ref Books
516 pages
0510 English As A Second Language: MARK SCHEME For The October/November 2015 Series
No ratings yet
0510 English As A Second Language: MARK SCHEME For The October/November 2015 Series
11 pages
Symmetric Groups: N N N N
No ratings yet
Symmetric Groups: N N N N
11 pages
Verb To Be - Have - Has - 3° Primaria
No ratings yet
Verb To Be - Have - Has - 3° Primaria
1 page
Ttgyygg
No ratings yet
Ttgyygg
33 pages
IRON-GPT Guide Enduser
No ratings yet
IRON-GPT Guide Enduser
930 pages
Assignment Instructions
No ratings yet
Assignment Instructions
5 pages
Download Full Coanda Effect Flow Phenomenon and Applications 1st Edition Noor A Ahmed (Author) PDF All Chapters
No ratings yet
Download Full Coanda Effect Flow Phenomenon and Applications 1st Edition Noor A Ahmed (Author) PDF All Chapters
67 pages

Lecture 03

Uploaded by

Lecture 03

Uploaded by

Lexical Analysis

Instructor: Fredrik Kjolstad

• Informal sketch of lexical analysis

• Issues in lexical analysis

• Specifying lexers (aka. scanners)

• What do we want to do? Example:

• The input is just a string of characters:

• Goal: Partition input string into substrings

• A token class corresponds to a set of strings

• Classify program substrings according to role

• Lexical analysis produces a stream of tokens

• Parser relies on token distinctions

• Define a finite set of tokens

– Tokens describe all items of interest

– Choice of tokens depends on

• Useful tokens for this expression:

• N.B., (, ), =, ; above are tokens, not characters

• Describe which strings belong to each token

• An implementation must do two things:

1. Classify each substring as a token

2. Return the value or lexeme (value) of the token

• The lexer thus returns token-lexeme pairs

• The lexer usually discards “uninteresting” tokens

• Examples: Whitespace, Comments

• Sort of… if you do not make it hard!

• Look at some history

• FORTRAN rule: Whitespace is insignificant

• E.g., VAR1 is the same as VA R1

• Historical footnote: FORTRAN Whitespace rule

• Two important points:

1. The goal is to partition the string. This is implemented

2. “Lookahead” may be required to decide where one

• Even our simple example has lookahead issues

• PL/I keywords are not reserved

• Cannot tell whether DECLARE is a keyword or

• Unfortunately, the problems continue today

• C++ template syntax:

Closing templates, not stream 20

• The goal of lexical analysis is to

• Left-to-right scan => lookahead sometimes

– A way to resolve ambiguities

• There are several formalisms for specifying tokens

• Regular languages are the most popular

Def. Let alphabet Σ be a set of characters.

• Alphabet = English • Alphabet = ASCII

• Not every string of English • Note: ASCII character set

• Languages are sets of strings.

• Need some notation for specifying which sets we

• The standard notation for regular languages is

' c ' = {" c "}

• Def. The regular expressions over Σ are the

• Notation so far was imprecise

B as a piece of syntax B as a set

Box 'a' + 'b' 'a'*

• To be careful, we distinguish syntax and semantics.

• Regular expressions are simple, almost trivial

• We will describe tokens in regular expressions

Keyword: “else” or “if” or “begin” or …

‘else’ + ‘if’ + ‘begin’ + . . .

Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’

Integer: a non-empty string of digits

Identifier: strings of letters or digits, starting with a

letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’

Is (letter* + digit*) the same as (letter + digit)*?

Whitespace: a non-empty sequence of blanks,

• Regular expressions are all around you!

digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9'

• Regular expressions describe many useful

• Regular languages are a language

• Next time: Given a string s and a rexp R, is

You might also like

Is (letter* + digit) the same as (letter + digit)?