Text.Regex.DFA.Engine

regex-dfa-0.91: Replaces/Enhances Text.Regex

Description

By Chris Kuklewicz (haskell (at) list (dot) mightyreason (dot) com), 2006.

This file is licensed under the LGPL (version 2, see the LICENSE file), because it is a derivative work. This DFAEngine takes the lazy transition table from Manuel Chakravarty's lexer in CTK Light, but uses it for simpler purposes. The original CTK code can be found here http://www.cse.unsw.edu.au/~chak/haskell/ctk/

Don Stewart (http://www.cse.unsw.edu.au/~dons/contact.html) also contributed to this code.

I want the freedom to alter the types a bit, so this is a separate module.

The CTK and DFA code can be thought of as three parts:

1. The ability to compose Regexp combinators which will lazily assemble a DFA. This is mainly bound up in the Cont type and the internal functions that merge it (exported as >||<).

2. The interface of how to specify Failure and Success. This was bound up in LexAction holding an function and is now lexAcceptlesFailurelexContinue.

3. The traversal engine. At each longer and longer match the last seen match is updated. Different traversals keep track of different levels of detail.

As a descendent of the regex-dna entry at http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all, this module has contributions from Don Stewart, Alson Kemp, and Chris Kuklewicz.

Synopsis

data Lexer

= Lexer (Set DoPa) !LexAction Cont

| Predicate {

rIndex :: (Set DoPa)

whichBoundary :: Boundary

atBoundary :: Lexer

notAtBoundary :: Lexer

}

newtype LexAction = LexAction Int

type Regexp = Lexer -> Lexer

data Cont = Done

data Boundary

= BeginLine

| EndLine

| BeginInput

| EndInput

emptyOp :: Regexp

char :: DoPa -> Char -> Regexp

alt :: DoPa -> [Char] -> Regexp

altNot :: DoPa -> [Char] -> Regexp

allChar :: DoPa -> Regexp

beginLine :: DoPa -> Regexp

endLine :: DoPa -> Regexp

beginInput :: DoPa -> Regexp

endInput :: DoPa -> Regexp

(>|<) :: Regexp -> Regexp -> Regexp

orRE :: [Regexp] -> Regexp

(+>) :: Regexp -> Regexp -> Regexp

concatRE :: [Regexp] -> Regexp

quest :: Regexp -> Regexp -> Regexp

star :: Regexp -> Regexp -> Regexp

plus :: Regexp -> Regexp -> Regexp

failure :: Lexer

accept :: Regexp -> Lexer

(>||<) :: Lexer -> Lexer -> Lexer

findRegex :: Lexer -> String -> (String, Int, Maybe (String, Int, String))

matchesRegex :: Lexer -> [Char] -> Bool

testHere :: Lexer -> Int -> Char -> [Char] -> Bool

countRegex :: Lexer -> [Char] -> Int

findRegexS :: Lexer -> String -> (String, Maybe (String, Int, String))

peek :: Cont -> Char -> Lexer

inBounds :: Char -> BoundsNum -> Bool

lexFailure :: LexAction

lexContinue :: LexAction

lexAccept :: LexAction

Documentation

data Lexer

tree structure used to represent the lexer table

each node in the tree corresponds to a of the lexer; the associated actions are those that apply when the corresponding is reached

Constructors

Lexer (Set DoPa) !LexAction Cont

Predicate

rIndex :: (Set DoPa)
whichBoundary :: Boundary
atBoundary :: Lexer
notAtBoundary :: Lexer

Instances

Show Lexer

newtype LexAction

This is interface between the DFA table and the traversal engine, and is simpler than the original CTK version.

Constructors

LexAction Int

Instances

Eq LexAction

Ord LexAction

Show LexAction

type Regexp = Lexer -> Lexer

a regular expression

data Cont

Done or a table-like-thing to associate the next character with a Lexer

Constructors

Done

Instances

Show Cont

data Boundary

Need to encode, as data, the between character decision am I at this boundary?. The different types of boundary checks and their outcomes are all value of Boundary. Currently only the ^ and $ anchors are encoded.

Constructors

BeginLine
EndLine
BeginInput
EndInput

Instances

Eq Boundary

Show Boundary

emptyOp :: Regexp

These create Regexp

Empty lexeme (noop)

char :: DoPa -> Char -> Regexp

One character regexp

alt :: DoPa -> [Char] -> Regexp

accepts a list of alternative characters Equiv. to `(foldr1 (>|<) . map char) cs', but much faster

altNot :: DoPa -> [Char] -> Regexp

accepts an inverted list of alternative characters Equiv. to `(foldr1 (>|<) . map char) cs', but much faster

allChar :: DoPa -> Regexp

accepts any character

beginLine :: DoPa -> Regexp

endLine :: DoPa -> Regexp

beginInput :: DoPa -> Regexp

endInput :: DoPa -> Regexp

(>|<) :: Regexp -> Regexp -> Regexp

disjunctive combination of two regexps, corresponding to x|y.

This will find the longest match

orRE :: [Regexp] -> Regexp

(+>) :: Regexp -> Regexp -> Regexp

These combine two Regexp's

Concatenation of regexps is just concatenation of functions x +> y corresponds to xy

concatRE :: [Regexp] -> Regexp

quest :: Regexp -> Regexp -> Regexp

x quest y corresponds to the regular expression x?y

star :: Regexp -> Regexp -> Regexp

x star y corresponds to the regular expression x*y self is of type Lexer

plus :: Regexp -> Regexp -> Regexp

x plus y corresponds to the regular expression x+y

failure :: Lexer

accept :: Regexp -> Lexer

Have a match to Regexp be consider a success

(>||<) :: Lexer -> Lexer -> Lexer

disjunctive combination of two lexers (longest match, right biased)

findRegex

:: Lexer	The regular expression to match
-> String	The input string to scan along, looking for a match
-> (String, Int, Maybe (String, Int, String))	The input string before the match, length of the string before the match, Nothing if there was no match or Just input string at the start of the match, length of the match, input string starting just past the match
This is the ultra-lazy matching engine. It returns the longest match. This will not examine any more of the input than needed, checking and returning a character at a time. Once a character is read that leads to no possibility of future match it does not evaluate any deeper. When a match is found, the input past match is not examined at all. In the extreme case of the input string being (error _) this will still succeed if the Regexp matches only an empty string since the input will not be demanded at all. The input before matching in this case will be [] and its length is 0, and the length of the match is 0, which the input at start of match and the input past the match will both be (error _). This loops over `matchHere` to find the first match

matchesRegex :: Lexer -> [Char] -> Bool

This searches the input string for a match to the regex There is no need to wait for the longest match, so stop at first lexAccept

testHere

:: Lexer	current lexeme
-> Int	Origin offset
-> Char	previous input character
-> [Char]	current input
-> Bool
This checks for a match to the regex starting at the beginning of the input There is no need to wait for the longest match, so stop at first lexAccept

countRegex :: Lexer -> [Char] -> Int

This counts the number of matches to regex in the string, (it checks each possible starting position). This should be the same as ((length (splitRegex re input))-1) but more efficient ^^^ fix

findRegexS :: Lexer -> String -> (String, Maybe (String, Int, String))

This is a version of findRegex that does not compute the length of the prefix

peek :: Cont -> Char -> Lexer

inBounds :: Char -> BoundsNum -> Bool

lexFailure :: LexAction

lexContinue :: LexAction

lexAccept :: LexAction

Produced by Haddock version 0.8