Lecture Week 03
Lecture Week 03
Construction
CS 322
Mr. Atif Ali
Lecture 6
How to Describe Tokens?
Regular Languages are the most popular for specifying tokens
because
• These are based on Simple and useful theory
• Easy to understand
• Efficient implementations exist for generating lexical analyzers
based on such languages.
Languages
Let be a set of characters. is called the
alphabet.
A language over is set of strings of characters
drawn from
2
Example of Languages
Alphabet = English characters
Language = English sentences
Alphabet = ASCII
Language = C++ programs,
Java, C#
Notation
Languages are sets of strings (finite sequence of
characters)
Need some notation for specifying which sets we want
For lexical analysis we care about regular
languages.
Regular languages can be described using regular
3
expressions.
Regular Languages
Each regular expression is a notation for a regular
language (a set of words).
If A is a regular expression, we write L(A) to refer
to language denoted by A.
A regular expression (RE) is defined inductively
a ordinary character from
the empty string
R|S = either R or S
RS = R followed by S (concatenation)
R* = concatenation of R zero or more
times
(R*= |R|RR|RRR...) 4
RE Extensions
Regular expression extensions are used as
convenient notation of complex RE:
R? = | R (zero or one R)
R+ = RR* (one or more R)
(R) = R (grouping)
[abc] = a|b|c (any of listed)
[a-z] = a|b|....|z (range)
[^ab] = c|d|... (anything but ‘a’‘b’)
5
Regular Expression
RE Strings in L(R)
a “a”
ab “ab”
a|b “a” “b”
(ab)* “” “ab” “abab” ...
(a|)b “ab” “b”
Here are examples of common tokens found in
programming languages.
integer: a non-empty string of digits
digit = ‘0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’
integer = digit digit*
6
Example: identifiers
identifier:
string or letters or digits starting with a letter
C identifier: [a-zA-Z_][a-zA-Z0-9_]*
8
Finite Automata (FA)
Specification: Regular Expressions
Implementation: Finite Automata
An accepting state
a
A transition 10
Finite Automata
A finite automaton accepts a string if we can
follow transitions labelled with characters in the
string from start state to some accepting state.
FA Example
A FA that accepts only “1”
1
11
FA Example
A FA that accepts any number of 1’s followed by
a single 0
1
0
a b
0 1 err
1 2 1
2 err err
13
RE → Finite Automata
Can we build a finite automaton for every regular
expression?
Yes, – build FA inductively based on the definition
of Regular Expression
NFA
Nondeterministic Finite Automaton (NFA)
Can have multiple transitions for one input in a given state
Can have - moves
Epsilon Moves
ε – moves
machine can move from state A
to state B without consuming
input
A 14 B
NFA
operation of the automaton is not completely defined by input
1
0 1
A B C
On input “11”, automaton could be in either state
Execution of FA
A NFA can choose
Whether to make -moves.
Which of multiple transitions to take for a single
input. 15
Acceptance of NFA
NFA can get into multiple states
Rule: NFA accepts if it can get in a final state
1
0 1
A B C
0
DFA and NFA
Deterministic Finite Automata (DFA)
One transition per input per state.
No - moves
16
Execution of FA
A DFA
can take only one path through the state graph.
Completely determined by input.
NFA vs DFA
NFAs and DFAs recognize the same set of languages (RL)
DFAs are easier to implement – table driven.
For a given language, the NFA can be simpler than the DFA.
DFA can be exponentially larger than NFA.
NFAs are the key to automating RE → DFA construction.
17
RE → NFA Construction
Thompson’s construction (CACM 1968)
Build an NFA for each RE term.
Combine NFAs with -moves.
Subset construction
NFA → DFA
Build the simulation.
Minimize number of states in DFA (Hopcroft’s
algorithm)
Key idea:
NFA pattern for each symbol and each operator.
Join them with -moves in precedence order.
18
RE → NFA Construction
a
NFA for a s0 s1
b
NFA for b s3 s4
a b
s0 s1 s3 s4
NFA for ab
19
RE → NFA Construction
a
s1 s2
s0 s5
b
s3 s4
NFA for a | b
20
RE → NFA Construction
a
s0 s1 s2 s4
NFA for a*
21
RE → NFA Construction
a
s0 s1 s2 s4
NFA for a*
22
Example RE → NFA
NFA for a ( b|c )*
b
s4 s5
a
s0 s1 s2 s3 s8 s9
s c
6 s7
23
Thank You!
24