Optimization of DFA-Based Pattern Matchers
Introduction: In this section we present three algorithms that have been used to
implement and optimize pattern matchers constructed from regular expressions.
1. The first algorithm is useful in a Lex compiler, because it constructs a DFA directly
from a regular expression, without constructing an intermediate NFA. The resulting
DFA also may have fewer states than the DFA constructed via an NFA.
2. The second algorithm minimizes the number of states of any DFA, by combining
states that have the same future behavior. The algorithm itself is quite efficient, running
in time 0(n log n), where n is the number of states of the DFA.
3. The third algorithm produces more compact representations of transition tables
than the standard, two-dimensional table.
Important States of an NFA: To begin our discussion of how to go directly from a
regular expression to a DFA, we must first dissect the NFA construction of Algorithm
3.23 and consider the roles played by various states. We call a state of an NFA
important if it has a non-e out-transition. Notice that the subset construction (Algorithm
3.20) uses only the important states in a set T when it computes e-closure (move (T,
a)), the set of states reachable from T on input a. That is, the set of states move(s, a)
is nonempty only if state s is important. During the subset construction, two sets of
NFA states can be identified (treated as if they were the same set) if they:
1. Have the same important states, and
2. Either both have accepting states or neither does.
When the NFA is constructed from a regular expression, we can say more about the
important states. The only important states are those introduced as initial states in the
basis part for a particular symbol position in the regular expression. That is, each
important state corresponds to a particular operand in the regular expression.
The constructed NFA has only one accepting state, but this state, having no out-
transitions, is not an important state. By concatenating a unique right end marker # to
a regular expression r, we give the accepting state for r a transition on #, making it an
important state of the NFA for ( r ) # . In other words, by using the augmented regular
expression ( r ) # , we can forget about accepting states as the subset construction
proceeds; when the construction is complete, any state with a transition on # must be
an accepting state.
The important states of the NFA correspond directly to the positions in the regular
expression that hold symbols of the alphabet. It is useful, as we shall see, to present
the regular expression by its syntax tree, where the leaves correspond to operands
and the interior nodes correspond to operators. An interior node is called a cat-node,
or-node, or star-node if it is labeled by the concatenation operator (dot), union operator
|, or star operator *, respectively.
Example:Figure 3.56 shows the syntax tree for the regular expression of our running
example. Cat-nodes are represented by circles.
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not
labeled e, we attach a unique integer. We refer to this integer as the position of the
leaf and also as a position of its symbol. Note that a symbol can have several positions;
for instance, a has positions 1 and 3 in Fig. 3.56. The positions in the syntax tree
correspond to the important states of the constructed NFA.
E x a m p l e:Figure 3.57 shows the NFA for the same regular expression as Fig. 3.56,
with the important states numbered and other states represented by letters. The
numbered states in the NFA and the positions in the syntax tree correspond in a way
we shall soon see.