0% found this document useful (0 votes)
5 views

lect03

Uploaded by

ruhinalmuhit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

lect03

Uploaded by

ruhinalmuhit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to Lexical Analysis

Source code Front-End IR


Back-End
Object code
Lexical
Analysis

Lexical Analysis:
• reads characters and produces sequences of tokens.
Today’s lecture:
Towards automated Lexical Analysis.

2 Dec 2024 1
The Big Picture
First step in any translation: determine whether the text to be
translated is well constructed (hence formal languages, rather than
natural languages) in terms of the input language. Syntax is
specified with parts of speech - syntax checking matches parts of
speech against a grammar.

What does lexical analysis do?


Recognises the language’s parts of speech.

2 Dec 2024 2
Language
• Languages in general, consists of three components: alphabet/letters,
words and sentences.
• Examples:
English Computer Languages
Letters of alphabet ASCII characters
Words in a dictionary Keywords, user-defined identifiers, etc.

Sentences Statements (e.g. assignment statement,


conditional statements, etc.)
• A language L has alphabets, denoted by ∑={a, b,…, Z}
• A language L has strings of words, denoted by Γ(gamma)
• A language L has sentences.
In a language, the component ∑ (must be) and Γ (may be) are both finite sets,
but the set of sentences can be infinite. Ex: i) You ate an apple.
ii) You ate two apples.
iii) You ate two thousand and five…
Hence, we need a finite way to define languages that may contains
infinite number of words and sentences. 3
In a formal language:
• Alphabet is a finite set of symbols.
• A string is any finite sequence of symbols from alphabet.
• A grammar is a finite way of describing a language.
• A (e.g. context-free) grammar, G, is a 4-tuple, G=(S,V,T,P), where:
S: starting symbol
V: set of non-terminal symbols, or variables
T: set of terminal symbols
P: set of production rules
• A language is the set of all productions of G.
• Example: Identifying numbers, we may write the grammar as
G= (S, V, T, P) in which V: {N, D, F};
T: {., 0, 1, 2,…, 9}
P: { N DN | NFN | D
D 0 | 1 | 2 | … | 9}

4
2 Dec 2024
Tokens, Lexemes, Patterns
• Tokens
Token is a sequence of characters. Tokens may be
a) Identifiers b) Keywords c) Operator d) Special
Symbols c) Constant, and so on.
• Lexemes
Lexeme is sequence of characters are matched by a
pattern (i.e. RE) for token.
• Patterns
Rule of description is a pattern. Patterns are specified
using regular expression.
Ex: letter(letter | digit)*
2 Dec 2024 5
Tokens, Lexemes, Patterns

2 Dec 2024 6
Tokenization
• Process of forming tokens from input
stream is called tokenization.
Ex: div = 6/2;

2 Dec 2024 7
What is attribute for token
• Lexical analyzer provides additional
information about particular lexeme.
Ex: y = 4*x + 5;
Token stream should be:
<id, y><op,=><num,4><id, x><op,+><num,5>

2 Dec 2024 COMP36512 Lecture 3 8


Why all this?
• Why study lexical analysis?
– To avoid writing lexical analysers (scanners) by hand.
– To simplify specification and implementation.
– To understand the underlying techniques and technologies.
• We want to specify lexical patterns (to derive tokens):
– Some parts are easy:
• WhiteSpace  blank | tab | WhiteSpace blank | WhiteSpace tab
• Keywords and operators (if, then, =, +)
• Comments (/* followed by */ in C, // in C++, % in latex, ...)
– Some parts are more complex:
• Identifiers (letter followed by - up to n - alphanumerics…)
• Numbers
We need a notation that could lead to an implementation!
2 Dec 2024 9
Building a Lexical Analyser by hand
Based on the specifications of tokens through regular expressions we
can write a lexical analyser. One approach is to check case by case
and split into smaller problems that can be solved ad hoc. Example:
void get_next_token() {
c=input_char();
if (is_eof(c)) { token  (EOF,”eof”); return}
if (is_letter(c)) {recognise_id()}
else if (is_digit(c)) {recognise_number()}
else if (is_operator(c))||is_separator(c))
{token  (c,c)} //single char assumed
else {token  (ERROR,c)}
return;
}
...
do {
get_next_token();
print(token.class, token.attribute);
} while (token.class != EOF);

Can be efficient; but requires a lot of work and may be difficult to modify!
2 Dec 2024 10
2 Dec 2024 COMP36512 Lecture 3 11
Building lexical analyser by hand can be
efficient; but requires a lot of work and may
be difficult to modify!
Hence the suitable approach is:
Building Lexical Analysers “automatically”

2 Dec 2024 COMP36512 Lecture 3 12


Regular Expressions
Patterns form a regular language. A regular expression is a way
of specifying a regular language. It is a formula that describes
a possibly infinite set of strings.
(Have you ever tried ls [x-z]* ?)
Regular Expression (RE) (over a vocabulary V):
•  is a RE denoting the empty set {}.
• If a V then a is a RE denoting {a}.
• If r1, r2 are REs then:
– r1* denotes zero or more occurrences of r1;
– r1r2 denotes concatenation;
– r1 | r2 denotes either r1 or r2;
• Shorthands: [a-d] for a | b | c | d; r+ for rr*; r? for r | 
Describe the languages denoted by the following REs
a; a | b; a*; (a | b)*; (a | b)(a | b); (a*b*)*; (a | b)*baa;
(What about ls [x-z]* above? Hmm… not a good example?)
2 Dec 2024 13
Examples
• Decimal  Integer.(0 | 1 | 2 | … | 9)*
• Identifier  [a-zA-Z] [a-zA-Z0-9]*

(Not all languages can be described by regular expressions.


But, we don’t care for now).

2 Dec 2024 14
We study REs to automate scanner construction!
Consider the problem of recognising register names starting with r and
requiring at least one digit:
Register  r (0|1|2|…|9) (0|1|2|…|9)* (or, Register  r Digit Digit*)
The RE corresponds to a transition diagram:

digit
start r digit
S0 S1 S2

Depicts the actions that take place in the scanner.


• A circle represents a state; S0: start state; S2: final state (double circle)
• An arrow represents a transition; the label specifies the cause of the transition.
A string is accepted if, going through the transitions, ends in a final state
(for example, r345, r0, r29, as opposed to a, r, rab)
2 Dec 2024 15
Towards Automation (finally!)
An easy (computerised) implementation of a transition diagram
is a transition table: a column for each input symbol and a
row for each state. An entry is a set of states that can be
reached from a state on some input symbol. E.g.:
state ‘r’ digit
S0 S1 -
S1 - S2
S2(final) - S2
If we know the transition table and the final state(s) we can
build directly a recogniser that detects acceptance:
char=input_char();
state=0; // starting state
while (char != EOF) {
state  table(state,char);
if (state == ‘-’) return failure;
word=word+char;
char=input_char();
}
if (state == FINAL) return acceptance; else return failure;

2 Dec 2024 16
An Example (recognise r0 through r31)
Register  r ((0|1|2) (Digit|) | (4|5|6|7|8|9) | (3|30|31))
S2 digit S3
0|1|2
S0 r S1 3 S5 0|1 S6

4|5|6|7|8|9 S4

State ‘r’ 0,1 2 3 4,5,…,9


0 1 - - - -
1 - 2 2 5 4
2(final) - 3 3 3 3
3(final) - - - - -
4(final) - - - - -
5(final) - 6 - - -
6(final) - - - - -

2 Dec 2024 17
The Full Story!

The generalised transition diagram is a finite automaton.


It can be:
• Deterministic, DFA; as in the example
• Non-Deterministic, NFA; more than 1 transition out of a state may
be possible on the same input symbol: think about: (a | b)* abb

Every regular expression can be converted to a DFA!

2 Dec 2024 18
Assignments
1. Write a C program that read the following string:
“ Md. Tareq Zaman, Part-3, 2011”
a) Count number of words, letters, digits and other characters.
b) Separates letters, digits and others characters.
2. Write a program that read the following string:
“ Munmun is the student of Computer Science & Engineering”.
a) Count how many vowels and Consonants are there?
b) Find out which vowels and consonants are existed in the above string?
c) Divide the given string into two separate strings, where one string only
contains the words started with vowel, and another contains the words
started with consonant.
3. Write a program that abbreviates the following code:
CSE-3141 as Computer Science & Engineering, 3rd year, 1st semester,
Compiler Design, Theory.

2 Dec 2024 COMP36512 Lecture 3 19

You might also like