Chapter Two
Chapter Two
Automata
Introduction
woodchuck or
/woodchucks?/ “woodchuck”
woodchucks
– Examples:
• /^The/ - matches the word “The” only at the start of the line.
• Three uses of “^”:
1. /^xyz/ - Matches the start of the line
2. [^xyz] – Negation
3. /^/ - Just to mean a caret
/⌴$/ - “⌴” Stands for space “character”; matches a space at the end
of line.
/^The dog\.$/ - matches a line that contains only the phrase “The
dog”.
Anchors
• /\b/ - matches a word boundary
• /\B/ - matches a non-boundary
• /\bthe\b/ - matches the word “the” but not the word “other”.
• Word is defined as a any sequence of digits, underscores or
letters.
• /\b99/ - will match the string 99 in
“There are 99 bottles of beer on the wall” but NOT
“There are 299 bottles of beer on the wall”
and it will match the string
“$99” since 99 follows a “$” which is not a digit, underscore, or
a letter.
Disjunction, Grouping and Precedence.
• Suppose we need to search for texts about pets;
specifically we may be interested in cats and dogs. If
we want to search for either “cat” or the string “dog”
we can not use any of the constructs we have
introduced so far (why not “[]”?).
• New operator that defines disjunction, also called the
pipe symbol is “|”.
• /cat|dog/ - matches either cat or the string dog.
Grouping
• In many instances it is necessary to be able to group
the sequence of characters to be treated as one set.
• Example: Search for guppy and guppies.
– /gupp(y|ies)/
• Useful in conjunction to “*” operator.
– /*/ - applies to single character and not to a whole sequence.
• Example: Match “Column 1 Column 2 Column 3 …”
– /Column⌴[0-9]+⌴*/ - will match “Column # …“
– /(Column⌴[0-9]+⌴*)*/ - will match “Column 1 Column 2
Column 3 …”
Operator Precedence Hierarchy
Counters * + ? {}
Disjunction |
Simple Example
• Problem Statement: Want to write RE to find cases of the English article
“the”.
• /\b((Windows)+⌴*(XP|Vista|7|8)?)\b/
• /\b((Mac|Macintosh|Apple|OS ⌴X)\b/
Advanced Operators
Whitespace
\s [⌴\r\t\n\f] (space, tab)
“”
Non-
\S [^\s] whitespace
“in Concord”
.* Matches any string of characters and until it encounters a new line character
Literal Matching of Special Characters & “\”
Characters
RE Match Example Patterns
\n A newline
\t A tab
Regular
expressions
Finite Regular
automata languages
Finite-State Automaton for Regular
Expressions
• Using FSA to Recognize Sheeptalk with RE:
/baa+!/
a
b a a !
q0 q1 q2 q3 q4
Start State Final State
Transitions
FSA Use
• The FSA can be used for recognizing (we also say accepting)
strings in the following way. First, think of the input as being
written on a long tape broken up into cells, with one symbol
written in each cell of the tape, as figure below:
q0
a b a ! b
Recognition Process
• The machine starts in the start state (q0), and iterates the following
process:
1. Check the next letter of the input.
a. If it matches the symbol on an arc leaving the current state, then
i. cross that arc
ii. move to the next state, also
iii. advance one symbol in the input
b. If we are in the accepting state (q4) when we run out of input, the machine
has successfully recognized an instance of sheeptalk.
2. If the machine never gets to the final state,
a. either because it runs out of input, or
b. it gets some input that doesn’t match an arc (as in Fig in previous slide), or
c. if it just happens to get stuck in some non-final state, we say the machine
rejects or fails to accept an input.
State Transition Table
Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
We’ve marked state 4 with a colon to indicate that it’s a final state (you
can have as many final states as you want), and the Ø indicates an
illegal or missing transition. We can read the first row as “if we’re in
state 0 and we see the input b we must go to state 1. If we’re in state 0
and we see the input a or !, we fail”.
Formal Definition of Automaton
• Q = {q0,q1,q2,q3,q4},
• = {a,b, !},
• F = {q4}, and
• (q, i)
a
b a a !
q0 q1 q2 q3 q4
Start State Final State
Transitions
Deterministic Algorithm for Recognizing a
String
function D-RECOGNIZE(tape,machine) returns accept or reject
index←Beginning of tape
current-state←Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elsif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state←transition-table[current-state,tape[index]]
index←index + 1
end
Tracing Execution for Some Sheep Talk
Input
a
State b a !
0 1 Ø Ø
b a a ! 1 Ø 2 Ø
q0 q1 q2 q3 q4
Start State Final State 2 Ø 3 Ø
3 Ø 3 4
Transitions
4: Ø Ø Ø
q0 q1 q2 q3 q3 q4
b a a a !
Tracing Execution for Some Sheep
Talk (cont.)
Before examining the beginning of the tape, the machine is in
state q0. Finding a b on input tape, it changes to state q1 as
indicated by the contents of transition-table[q0,b] in Fig.
It then finds an a and switches to state q2, another a puts it in
state q3, a third a leaves it in state q3, where it reads the “!”,
and switches to state q4. Since there is no more input, the
End of input condition at the beginning of the loop is
satisfied for the first time and the machine halts in q4.
State q4 is an accepting state, and so the machine has accepted
the string baaa! as a sentence in the sheep language.
Fail State
• The algorithm will fail whenever there is no legal
transition for a given combination of state and input. The
input abc will fail to be recognized since there is no legal
transition out of state q0 on the input a, (i.e., this entry of
the transition table has a Ø).
• Even if the automaton had allowed an initial a it would
have certainly failed on c, since c isn’t even in the
sheeptalk alphabet! We can think of these “empty”
elements in the table as if they all pointed at one “empty”
state, which we might call the fail state or sink state.
• In a sense then, we could FAIL STATE view any machine
with empty transitions as if we had augmented it with a fail
state, and drawn in all the extra arcs, so we always had
somewhere to go from any state on any possible input. Just
for completeness, next Fig. shows the FSA from previous
Figure with the fail state qF filled in.
Adding a Fail State to FSA
b a a !
q0 q1 q2 q3 q4
Start State ! Final State
! b
b
a ! b !
qF
b
a
Formal Languages
• Key Concept #1. Formal Language:
– A model which can both generate and recognize all and only the
strings of a formal language acts as a definition of the formal
language.
Ten
One Eleven
Two Twelve Twenty
Three Thirteen Thirty
Four Fourteen Forty
Five Fifteen Fifty
Six Sixteen Sixty
Seven Seventeen Seventy
Eight Eighteen Eighty
Nine Nineteen Ninety
q0 Twenty
q1 One q2
Two
Thirty
Three
Forty
Start State Fifty
Four Final State
Five
Sixty
Six
Seventy
Seven
Eighty
Eight
Ninety
Nine
FSA for the simple Dollars and Cents
Ten Ten
One Eleven One Eleven
Two Twelve Twenty Two Twelve Twenty
Three Thirteen Thirty Three Thirteen Thirty
Four Fourteen Forty Four Fourteen Forty
Five Fifteen Fifty Five Fifteen Fifty
Six
Seven
Sixteen
Seventeen
Sixty
Seventy
q3 Six
Seven
Sixteen
Seventeen
Sixty
Seventy
q7
Eight Eighteen Eighty Eight Eighteen Eighty
Nine Nineteen Ninety Nine Nineteen Ninety
cents cents
dollars
q0 Twenty q1 One q2 q4 Twenty q5 One q6
Thirty Two Thirty Two
Forty Three Forty Three
Fifty Four Fifty Four
Sixty Five Sixty Five
Seventy Six Seventy Six
Eighty Seven Eighty Seven
Ninety Eight Ninety Eight
Nine Nine
Non-Deterministic FSAs
a
b a a !
q0 q1 q2 q3 q4
Start State Final State
Deterministic FSA
b a a !
q0 q1 q2 q3 q4
Start State Final State
Non-Deterministic FSA
Deterministic vs Non-deterministic FSA
• Deterministic FSA is one whose behavior during
recognition is fully determined by the state it is in and
the symbol it is looking at.
• The FSA in the previous slide when FSA is at the state
q2 and the input symbol is a we do not know whether
to remain in state 2 (self-loop transition) or state 3 (the
other transition) .
• Clearly the decision dependents on the next input
symbols.
Another NFSA for “sheep” language: -
transition
a
b a !
q0 q1 q2 q3 q4
Start State Final State
• We will focus here on the backup approach and defer discussion of the
look-ahead and parallelism approaches to later chapters.
Back-up Approach for NFSA Recognizer
• The backup approach suggests that we should make
choices that might lead to dead-ends, knowing that
we can always return to unexplored alternative
choices.
• There are two key points to this approach:
1. Must know ALL alternatives for each choice point.
2. Store sufficient information about each alternative so that
we can return to it when necessary.
Back-up Approach for NFSA Recognizer
• When a backup algorithm reaches a point in its processing where
no progress can be made:
– Runs out of input, or
– Has no legal transitions,
It returns to a previous choice point and selects one of the unexplored
alternatives and continues from there.
State b a !
0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 2
4: Ø Ø Ø Ø
a
b a !
q0 q1 q2 q3 q4
Start State Final State
An Algorithm for NFSA Recognition
function ND-RECOGNIZE(tape,machine) returns accept or reject
loop
if ACCEPT-STATE?(current-search-state) returns true
then
return accept
else
agenda← agenda ∪ GENERATE-NEW-STATES(current-search-state)
if agenda is empty
then
return reject
else
current-search-state←NEXT(agenda)
end
An Algorithm for NFSA Recognition
(cont.)
function GENERATE-NEW-STATES(current-state) returns a set of search-states
if index is at the end of the tape and current-node is an accept state of machine
then
return true
else
return false
Possible execution of ND-RECOGNIZE
q0 Input
a
b a
q0 q1 q2 q3
!
q4 1 b a a a ! State b a !
Start State Final State
q 0 q1 0 1 Ø Ø Ø
2 b a a a ! 1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
q1 q2
3 Ø Ø 4 2
3 b a a a ! 4: Ø Ø Ø Ø
q 2 q3 2 possibilities q2
4 b a a a ! b a a a ! 6
q3 q3
X
5 b a a a ! b a a a ! 7
q3
b a a a ! 8
Depth-First-Search
• Depth-First-Search or Last-In-First-Out
(LIFO):
– Uses stack data structure to implement the
function NEXT.
• NEXT returns the state at the front of the
agenda.
• Pitfall: Under certain circumstances they
can enter an infinite loop.
Depth-First Search of ND-RECOGNIZE
a
q0
1 b a a a ! Input
q0 q1 State b a !
2 b a a a ! 0 1 Ø Ø Ø
1 Ø 2 Ø Ø
q1 q2
2 Ø 2,3 Ø Ø
3 b a a a !
1 possibility is 3 Ø Ø 4 2
evaluated first 4: Ø Ø Ø Ø
q2 q3 q2
4 b a a a ! b a a a ! 6
q3 q3
X
5 b a a a ! b a a a ! 7
q4
b a a a ! 8
Breadth-First Search
• Breadth-First Search or First In First Out (FIFO)
strategy.
– All possible choices explored at once.
– Uses a queue data structure to implement NEXT function
• Pitfalls:
– As with depth-first if the state-space is infinite, the search
may never terminate.
– More importantly due to growth in the size of the agenda if
the state-space is even moderately large, the search may
require an impractically large amount of memory.
Breadth-First Search of ND-RECOGNIZE
a
b a !
q0 q1 q2 q3 q4
A breadth-first trace of FSA on some sheeptalk Start State Final State
q0
1 b a a a !
Input
q0 q1
State b a !
2 b a a a !
0 1 Ø Ø Ø
q1 q2 1 Ø 2 Ø Ø
3 b a a a ! 2 Ø 2,3 Ø Ø
2 possibilities 3 Ø Ø 4 2
are evaluated
q 2 q3 q2 4: Ø Ø Ø Ø
4 b a a a ! 4 b a a a !
q3 q3 q2
X
5 b a a a ! 5 b a a a ! 5 b a a a !
q4
b a a a ! 6
Advanced Search Algorithms
• All and only the sets of languages which meet the above properties are regular
languages.
Regular Languages and FSAs
• All regular languages can be implemented by the three operations
which define regular languages:
– Concatenation
– Disjunction|Union (also called “|”),
– * closure.
• Example:
– (*,+,{n,m}) are just a special case of repetition plus * closure.
– All the anchors can be thought of as individual special symbols.
– The square braces [] are a kind of disjunction:
• [ab] means “a or b”, or
• The disjunction of a and b.
Regular Languages and FSAs
• Regular languages are also closed under the following
operations:
– Intersection: if L1 and L2 are regular languages, then so is
L1 ∩ L2, the language consisting of the set of strings that are
in both L1 and L2.
– Difference: if L1 and L2 are regular languages, then so is L1
– L2, the language consisting of the set of strings that are in
L1 but not L2.
– Complementation: if L1 and L2 are regular languages, then
so is *-L1, *-L2 the set of all possible strings that are not
in L1, L2.
– Reversal: if L1 is regular language, then so is L1R, the
language consisting of the set of reversals of the strings that
are in L1.
Regular Expressions and FSA
• The regular expressions are equivalent to finite-state automaton (Proof:
Hopcroft and Ullman 1979).
• Proof is inductive. Each primitive operations of a regular expression
(concatenation, union, closure) is shown as part of inductive step of the
proof:
q0 qf q0 qf q0 a qf
Automata for the base case (no operators) for the induction showing that
any regular expression can be turned into an equivalent automaton
Concatenation
• FSAs next to each other by connecting all the final states of FSA1 to the
initial state of FSA2 by an -transition
q0
qf q0 qf
FSA1 FSA2
Closure
• Repetition: All final states of the FSA back to the initial states by -
transition
• Zero occurrences case: Direct link from the initial state to final state
Union
• Add a single new initial state q0, and add new -transitions from it to the
former initial states of the two machines to be joined
q0
qf
FSA1
q0 qf
q0 qf
FSA2
The union (|) of two FSAs