0% found this document useful (0 votes)
13 views

Lecture 03

Lexical analysis partitions a program string into tokens. It identifies tokens by regular expressions that describe the strings belonging to each token class. Regular expressions use operators like concatenation, union, and Kleene star to build up complex expressions from atomic character classes. Lookahead may be required to disambiguate tokens. Languages like FORTRAN and PL/I introduced ambiguities that made lexical analysis more complex.

Uploaded by

vishvadeepgutley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 03

Lexical analysis partitions a program string into tokens. It identifies tokens by regular expressions that describe the strings belonging to each token class. Regular expressions use operators like concatenation, union, and Kleene star to build up complex expressions from atomic character classes. Lookahead may be required to disambiguate tokens. Languages like FORTRAN and PL/I introduced ambiguities that made lexical analysis more complex.

Uploaded by

vishvadeepgutley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lexical Analysis

CS143
Lecture 3

Instructor: Fredrik Kjolstad


Slide design by Prof. Alex Aiken, with modifications 1
Outline

• Informal sketch of lexical analysis


– Identifies tokens in input string

• Issues in lexical analysis


– Lookahead
– Ambiguities

• Specifying lexers (aka. scanners)


– By regular expressions (aka. regex)
– Examples of regular expressions

2
Lexical Analysis

• What do we want to do? Example:


if (i == j)
Z = 0;
else
Z = 1;

• The input is just a string of characters:


\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings


– Where the substrings are called tokens

3
What’s a Token?

• A syntactic category
– In English:
noun, verb, adjective, …

– In a programming language:
Identifier, Integer, Keyword, Whitespace, …

4
Tokens

• A token class corresponds to a set of strings


Infinite set
• Examples var1
i
ports
– Identifier: strings of letters or foo Person
digits, starting with a letter …
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

5
What are Tokens For?

• Classify program substrings according to role

• Lexical analysis produces a stream of tokens


• … which is input to the parser

• Parser relies on token distinctions


– An identifier is treated differently than a keyword

6
Designing a Lexical Analyzer: Step 1

• Define a finite set of tokens

– Tokens describe all items of interest


• Identifiers, integers, keywords

– Choice of tokens depends on


• language
• design of parser

7
Example

• Recall
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Useful tokens for this expression:


Integer, Keyword, Relation, Identifier, Whitespace, (, ),
=, ;

• N.B., (, ), =, ; above are tokens, not characters

8
Designing a Lexical Analyzer: Step 2

• Describe which strings belong to each token

• Recall:
– Identifier: strings of letters or digits, starting with a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

9
Lexical Analyzer: Implementation

• An implementation must do two things:

1. Classify each substring as a token

2. Return the value or lexeme (value) of the token


– The lexeme is the actual substring
– From the set of substrings that make up the token

• The lexer thus returns token-lexeme pairs


– And potentially also line numbers, file names, etc. to
improve later error messages

10
Example

• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

11
Lexical Analyzer: Implementation

• The lexer usually discards “uninteresting” tokens


that don’t contribute to parsing.

• Examples: Whitespace, Comments

12
True Crimes of Lexical Analysis

• Is it as easy as it sounds?

• Sort of… if you do not make it hard!

• Look at some history

13
Lexical Analysis in FORTRAN

• FORTRAN rule: Whitespace is insignificant

• E.g., VAR1 is the same as VA R1

• A terrible design!

• Historical footnote: FORTRAN Whitespace rule


motivated by inaccuracy of punch card operators

14
FORTRAN Example

• Consider
– DO 5 I = 1,25
– DO 5 I = 1.25

15
Lexical Analysis in FORTRAN (Cont.)

• Two important points:

1. The goal is to partition the string. This is implemented


by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one


token ends and the next token begins

16
Lookahead

• Even our simple example has lookahead issues


– i vs. if
– = vs. ==

17
Lexical Analysis in PL/I

• PL/I keywords are not reserved


IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

18
Lexical Analysis in PL/I (Cont.)

• PL/I Declarations:
DECLARE (ARG1,. . ., ARGN)

• Cannot tell whether DECLARE is a keyword or


array reference until after the ).
– Requires arbitrary lookahead!

19
Lexical Analysis in C++

• Unfortunately, the problems continue today

• C++ template syntax:


Foo<Bar>
• C++ stream syntax:
cin >> var;
• But there is a conflict with nested templates:
Foo<Bar<Bazz>>

Closing templates, not stream 20


Review

• The goal of lexical analysis is to


– Partition the input string into lexemes
– Identify the token of each lexeme

• Left-to-right scan => lookahead sometimes


required

21
Next

• We still need
– A way to describe the lexemes of each token

– A way to resolve ambiguities


• Is if two variables i and f?
• Is == two equal signs = =?

22
Regular Languages

• There are several formalisms for specifying tokens

• Regular languages are the most popular


– Simple and useful theory
– Easy to understand
– Efficient implementations

23
Languages

Def. Let alphabet Σ be a set of characters.


A language over Σ is a set of strings of
characters drawn from Σ.

24
Examples of Languages

• Alphabet = English • Alphabet = ASCII


characters
• Language = English • Language = C programs
sentences

• Not every string of English • Note: ASCII character set


characters is an English is different from English
sentence character set

25
Notation

• Languages are sets of strings.

• Need some notation for specifying which sets we


want

• The standard notation for regular languages is


regular expressions.

26
Atomic Regular Expressions

• Single character

' c ' = {" c "}


• Epsilon
ε = {""}
Not the empty set, but set with
a single, empty, string.

27
Compound Regular Expressions

• Union
A+ B = {s | s ∈ A or s ∈ B}

• Concatenation
AB = {ab | a ∈ A and b ∈ B}
• Iteration
A* = ∪i≥0 A i where A i = AA . . . A
i times
28
Regular Expressions

• Def. The regular expressions over Σ are the


smallest set of expressions including
ε
'c ' where c ∈ ∑
A + B where A, B are rexp over ∑
AB " " "
*
A where A is a rexp over ∑
29
Syntax vs. Semantics

• Notation so far was imprecise

AB = {ab | a ∈ A and b ∈ B}

B as a piece of syntax B as a set


(the semantics of the syntax)

30
Syntax vs. Semantics

L('a' + 'b')
Semantics (content) L('a'*)

b aa aaa
a ϵ a ...

Box 'a' + 'b' 'a'*

Syntax (label)

31
Syntax vs. Semantics

• To be careful, we distinguish syntax and semantics.

L(ε ) = {""}
L(' c ') = {" c "}
L( A + B ) = L( A)∪ L( B)
L( AB) = {ab | a ∈ L( A) and b ∈ L( B)}
ii
L(A*)*
L( A ) =
= ∪i≥0
! i ≥0
L(A
L( A ))

32
Segue

• Regular expressions are simple, almost trivial


– But they are useful!

• We will describe tokens in regular expressions

33
Example: Keyword

Keyword: “else” or “if” or “begin” or …

‘else’ + ‘if’ + ‘begin’ + . . .

Abbreviation: ‘else’ = ‘e’ ‘l’ ‘s’ ‘e’

34
Example: Integers

Integer: a non-empty string of digits

digit = '0 '+ '1'+ ' 2 '+ '3'+ ' 4 '+ '5'+ '6 '+ '7 '+ '8'+ '9 '
integer = digit digit *

+
Abbreviation: A = AA*
Abbreviation: [0-2] = '0' + '1' + '2'

35
Example: Identifier

Identifier: strings of letters or digits, starting with a


letter

letter = ‘A’ + . . . + ‘Z’ + ‘a’ + . . . + ‘z’


identifier = letter (letter + digit)*

Is (letter* + digit*) the same as (letter + digit)*?

36
Example: Whitespace

Whitespace: a non-empty sequence of blanks,


newlines, and tabs

+
(' ' + '\n' + '\t')

37
Example: Phone Numbers

• Regular expressions are all around you!


• Consider (650)-723-3232

∑ = digits ∪ {-,(,)}
3
exchange = digit
phone = digit 4
area = digit 3
phone_number = '(' area ')-' exchange '-' phone
38
Example: Email Addresses

• Consider [email protected]

∑ = letters ∪ {.,@}
name = letter +
address = name '@' name '.' name '.' name

39
Example: Unsigned Pascal Numbers

digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9'


+
digits = digit
opt_fraction = ('.' digits) + ε
opt_exponent = ('E' ('+' + '-' + ε ) digits) + ε
num = digits opt_fraction opt_exponent

40
Other Examples

• File names
• Grep tool family

41
Summary

• Regular expressions describe many useful


languages
– We will look at non-regular languages next week

• Regular languages are a language


specification
– We still need an implementation

• Next time: Given a string s and a rexp R, is


s ∈ L( R ) ?
42

You might also like