Safe Haskell | None |
---|---|
Language | Haskell2010 |
Text.Regex.Pcre2
Synopsis
- match :: Alternative f => Text -> Text -> f Text
- matchOpt :: Alternative f => Option -> Text -> Text -> f Text
- matches :: Text -> Text -> Bool
- matchesOpt :: Option -> Text -> Text -> Bool
- captures :: Alternative f => Text -> Text -> f (NonEmpty Text)
- capturesOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text)
- sub :: Text -> Text -> Text -> Text
- gsub :: Text -> Text -> Text -> Text
- subOpt :: Option -> Text -> Text -> Text -> Text
- _match :: Text -> Traversal' Text Text
- _matchOpt :: Option -> Text -> Traversal' Text Text
- _captures :: Text -> Traversal' Text (NonEmpty Text)
- _capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text)
- regex :: QuasiQuoter
- _regex :: QuasiQuoter
- data Captures (info :: CapturesInfo)
- capture :: forall {k} (i :: k) (info :: CapturesInfo) (num :: Nat). (CaptNum i info ~ num, KnownNat num) => Captures info -> Text
- _capture :: forall {k} (i :: k) (info :: CapturesInfo) (num :: Nat). (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text
- data Option
- = AllowEmptyClass
- | AltBsux
- | AltBsuxLegacy
- | AltCircumflex
- | AltVerbNames
- | Anchored
- | Bsr Bsr
- | Caseless
- | DepthLimit Word32
- | DollarEndOnly
- | DotAll
- | EndAnchored
- | EscapedCrIsLf
- | Extended
- | ExtendedMore
- | FirstLine
- | HeapLimit Word32
- | Literal
- | MatchLimit Word32
- | MatchLine
- | MatchUnsetBackRef
- | MatchWord
- | MaxPatternLength Word64
- | Multiline
- | NeverBackslashC
- | NeverUcp
- | Newline Newline
- | NoAutoCapture
- | NoAutoPossess
- | NoDotStarAnchor
- | NoStartOptimize
- | NotBol
- | NotEmpty
- | NotEmptyAtStart
- | NotEol
- | OffsetLimit Word64
- | ParensLimit Word32
- | PartialHard
- | PartialSoft
- | SubGlobal
- | SubLiteral
- | SubReplacementOnly
- | SubUnknownUnset
- | SubUnsetEmpty
- | Ucp
- | Ungreedy
- data Bsr
- data Newline
- data SomePcre2Exception
- data Pcre2Exception
- data Pcre2CompileException
- defaultBsr :: Bsr
- compiledWidths :: [Int]
- defaultDepthLimit :: Int
- defaultHeapLimit :: Int
- supportsJit :: Bool
- jitTarget :: Maybe Text
- linkSize :: Int
- defaultMatchLimit :: Int
- defaultNewline :: Newline
- defaultIsNeverBackslashC :: Bool
- defaultParensLimit :: Int
- defaultTablesLength :: Int
- unicodeVersion :: Maybe Text
- supportsUnicode :: Bool
- pcreVersion :: Text
Matching and substitution
Introduction
Atop the low-level binding to the C API, we present a high-level interface to add regular expressions to Haskell programs.
All input and output strings are strict Text
, which maps to how
PCRE2 operates in UTF-8 mode.
The C API requires pattern strings to be compiled and the compiled patterns to be executed on subject strings in discrete steps. We simplify this procedure, accepting pattern and subject as arguments in a single function, essentially:
pattern -> subject -> result
The implementation guarantees that, when partially applied to pattern but not subject, the resulting function will close on the underlying compiled object and reuse it for every subject it is subsequently applied to.
Likewise, we do not require the user to know whether a PCRE2 option is to be
applied at pattern compile time or match time. Instead we fold all possible
options into a single datatype, Option
. Most functions have vanilla and
configurable variants; the latter have "Opt
" in the name and accept a
value of this type.
Similar to how head :: [a] -> a
sacrifices totality for type simplicity,
we represent user errors as imprecise exceptions. Unlike with head
, these
exceptions are typed (as SomePcre2Exception
s); moreover, we offer Template
Haskell facilities that can intercept some of these errors before the
program is run. (Failure to match is not considered a user error and is
represented by empty
; see below.)
There's more than one way to do it with this library. The choices between functions and traversals,
poly-kinded Captures
and plain lists, string literals and
quasi-quotations, quasi-quoted expressions and quasi-quoted patterns...these
are left to the user. She will observe that advanced features' extra
safety, power, and convenience entail additional language extensions,
cognitive overhead, and (for lenses) library dependencies, so it's really a
matter of finding the best trade-offs for her case.
Definitions
- Pattern
- The string defining a regular expression. Refer to syntax here.
- Subject
- The string the compiled regular expression is executed on.
- Regex
- A function of the form
, where the argument is the subject. It is "compiled" in the course of producing results and may be reused via partial application as discussed above. (Lens users: A regex has the more abstract formText
-> result
, but the concept is the same.)Traversal'
Text
result - Capture (or capture group)
- Any substrings of the subject matched by the pattern, meaning the whole pattern and any parenthesized groupings. The PCRE2 docs do not refer to the former as a "capture"; however it is accessed the same way in the C API, just with index 0, so we will consider it the 0th capture for consistency. Parenthesized captures are implicitly numbered from 1.
- Unset capture
- A capture considered unset as distinct from empty. This
can arise from matching the pattern
(a)?
to an empty subject—the 0th capture will be set as empty, but the 1st will be unset altogether. We represent both as emptyText
for simplicity. See below for discussions about how unset captures may be detected or substituted using this library. - Named capture
- A parenthesized capture can be named like this:
(?<foo>...)
. Whether they have names or not, captures are always numbered as described above.
Handling results and errors
In contrast to other
APIs
where there are separate functions to request single versus global matching,
we accomplish this in a unified fashion using the
Alternative
typeclass (since 2.0.0). Typically the
user will choose from two instances, Maybe
and []
:
b2 :: (Alternative f) => Text -> f Text b2 = match "b.." -- Zero or one match findB2 :: Text -> Maybe Text findB2 = b2 -- Zero or more matches findAllB2s :: Text -> [Text] findAllB2s = b2
Other instances exist for niche uses, such as STM
.
By contrast, user errors are thrown purely. If a user error is to be caught, it must be at the site where the match or substitution results are evaluated. As a particular consequence, pattern compile errors are deferred to match sites.
>>>
broken = match "*"
>>>
:t broken
broken :: Alternative f => Text -> f Text>>>
broken "foo"
*** Exception: pcre2_compile: quantifier does not follow a repeatable item * ^
evaluate
comes in handy to force results into the IO
monad in order to catch errors reliably:
>>>
:set -XTypeApplications
>>>
handle @SomePcre2Exception (\_ -> return Nothing) $ evaluate $ broken "foo"
Nothing
Basic matching functions
match :: Alternative f => Text -> Text -> f Text Source #
Match a pattern to a subject and return the portion(s) that matched in an
Alternative
, or empty
if no match.
Since: 2.0.0
matchOpt :: Alternative f => Option -> Text -> Text -> f Text Source #
matchOpt mempty = match
Since: 2.0.0
captures :: Alternative f => Text -> Text -> f (NonEmpty Text) Source #
Match a pattern to a subject and return some non-empty list(s) of captures
in an Alternative
, or empty
if no match. The non-empty list constructor
:|
serves as a cue to differentiate the 0th capture from the others:
let parseDate = captures "(\\d{4})-(\\d{2})-(\\d{2})" in case parseDate "submitted 2020-10-20" of Just (date :| [y, m, d]) -> ... Nothing -> putStrLn "didn't match"
Since: 2.0.0
capturesOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text) Source #
capturesOpt mempty = captures
Since: 2.0.0
PCRE2-native substitution
Perform at most one substitution. See the docs for the special syntax of replacement.
>>>
sub "\\b(\\w+) calling the (\\w+)\\b" "$2 calling the $1" "the pot calling the kettle black"
"the kettle calling the pot black"
gsub :: Text -> Text -> Text -> Text Source #
Perform substitutions globally.
>>>
gsub "a" "o" "apples and bananas"
"opples ond bononos"
subOpt :: Option -> Text -> Text -> Text -> Text Source #
subOpt mempty = sub subOpt SubGlobal = gsub
Lens-powered matching and substitution
To use this portion of the library, there are two prerequisites:
- A basic working understanding of optics. Many tutorials exist online, such as this, and videos such as this.
- A library providing combinators. For lens newcomers, it is
recommended to grab
microlens-platform—all
of the examples in this library work with it,
packed
andunpacked
are included for working withText
, and it is upwards-compatible with the full lens library.
We expose a set of traversals that focus on matched substrings within a subject.
_nee :: Traversal' Text Text _nee = _matchOpt (Caseless <> MatchWord) "nee"
In addition to getting results, they support global substitution through setting; more generally, they can accrete effects while performing replacements.
>>>
promptNee = traverseOf (_nee . unpacked) $ \s -> print s >> getLine
>>>
promptNee "We are the knights who say...NEE!"
"NEE" NOO "We are the knights who say...NOO!">>>
In general these traversals are not law-abiding.
_captures :: Text -> Traversal' Text (NonEmpty Text) Source #
Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to each non-empty list of captures that pattern matches.
Substitution works in the following way: If a capture is set such that the
new Text
is not equal to the old one, a substitution occurs, otherwise it
doesn't. This matters in cases where a capture encloses another
capture—notably, all parenthesized captures are enclosed by the 0th.
>>>
threeAndMiddle = _captures ". (.) ."
>>>
"A A A" & threeAndMiddle .~ "A A A" :| ["B"]
"A B A">>>
"A A A" & threeAndMiddle .~ "A B A" :| ["A"]
"A B A"
Changing multiple overlapping captures won't do what you want and is unsupported.
Changing an unset capture is unsupported because the PCRE2 match API does not
give location info about it. Currently we ignore all such attempts. (Native
substitution functions like sub
do not have this limitation. See also
SubUnknownUnset
and SubUnsetEmpty
.)
If the list becomes longer for some reason, the extra elements are ignored. If it's shortened, the absent elements are considered to be unchanged.
It's recommended that the list be modified capture-wise, using ix
.
let madlibs = _captures "(\\w+) my (\\w+)" print $ "Well bust my buttons!" &~ do zoom madlibs $ do ix 1 . _head .= 'd' ix 2 %= Text.reverse _last .= '?' -- "Well dust my snottub?"
_capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text) Source #
_capturesOpt mempty = _captures
Compile-time validation
The API thus far has some hazards:
- pattern malformation such as mismatched parentheses (runtime error)
- out-of-bounds indexing of a capture group list (runtime error)
- out-of-bounds
ix
ing of aTraversal'
target (spurious failure to match) - case expression containing a Haskell list pattern of the wrong length (spurious failure to match)
- regex created and discarded inline (suboptimal performance)
- precariously many backslashes in a pattern. Matching a literal
backslash requires the sequence
"\\\\"
!
Using a combination of language extensions and pattern introspection features, we provide a Template Haskell API to mitigate these. To make use of it these must be enabled:
Extension | Required for | When |
---|---|---|
DataKinds
| Nat s (numbers),
Symbol s (strings), and
other type-level data powering
compile-time capture lookups | Using regex /_regex with
a pattern containing
parenthesized captures
|
QuasiQuotes | [ f| ...|] syntax | Using regex /_regex |
TypeApplications
| @i syntax for supplying type index
arguments to applicable functions
| Using regex /_regex with
a pattern containing
parenthesized captures;
using capture /_capture |
ViewPatterns
| Running code and binding variables in pattern context proper (pattern guards are off-limits for this) | Using regex as a Haskell
pattern
|
The inspiration for this portion of the library is Ruby, which supports regular expressions with superior ergonomics.
Quasi-quoters
regex :: QuasiQuoter Source #
As an expression
regex :: (Alternative
f) => String -> Text -> f (Captures
info)
if there are parenthesized captures, or
regex :: (Alternative f) => String -> Text -> f Text
if there are none. In other words, if there is more than the 0th capture,
this behaves like captures
(except returning an opaque Captures
instead
of a NonEmpty
list), otherwise it behaves like match
.
To retrieve an individual capture from a Captures
, use capture
.
case [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] "submitted 2020-10-20" of Just cs -> let date = capture @0 cs year = read @Int $ Text.unpack $ capture @"y" cs ...
forM_ @Maybe ([regex|\s+$|] line) $ \spaces -> printf "line has trailing spaces (%d characters)\n" (Text.length spaces)
As a pattern
This matches when the regex first matches. Any named captures are bound to variables of the same names.
case "submitted 2020-10-20" of [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] -> let year = read @Int $ Text.unpack y ...
Note that it is not possible to access the 0th capture this way. As a workaround, explicitly capture the whole pattern and name it.
If there are no named captures, this simply acts as a guard.
_regex :: QuasiQuoter Source #
An optical variant of regex
/a type-annotated variant of _captures
. Can
only be used as an expression.
_regex :: String ->Traversal'
Text (Captures
info) _regex :: String -> Traversal' Text Text
embeddedNumbers :: Traversal' String Int embeddedNumbers = packed . [_regex|\d+|] . unpacked . _Show main :: IO () main = putStrLn $ "There are 14 competing standards" & embeddedNumbers %~ (+ 1) -- There are 15 competing standards
Type-indexed capture groups
data Captures (info :: CapturesInfo) Source #
A wrapper around a list of captures that carries additional type-level information about the number and names of those captures.
This type is only intended to be created by regex
/_regex
and consumed by
capture
/_capture
, relying on type inference. Specifying the info
explicitly in a type signature is not supported—the definition of
CapturesInfo
is not part of the public API and may change.
After obtaining Captures
it's recommended to immediately consume them and
transform them into application-level data, to avoid leaking the types.
capture :: forall {k} (i :: k) (info :: CapturesInfo) (num :: Nat). (CaptNum i info ~ num, KnownNat num) => Captures info -> Text Source #
Safely lookup a capture in a Captures
result obtained from a Template
Haskell-generated matching function.
The type signature may be interpreted like this: Given some capture group
index i
and some info
about a regex, ensure that index exists and is
resolved to the number num
at compile time. Then, at runtime, get a
capture group (numbered num
) from a list of (at least num
) captures.
In practice the variable i
is specified by type application and the other
variables are inferred.
capture @3 capture @"bar"
Specifying an invalid index will result in a type error.
_capture :: forall {k} (i :: k) (info :: CapturesInfo) (num :: Nat). (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text Source #
Options
A Monoid
representing nearly every facility PCRE2 presents for tweaking
the behavior of regex compilation and execution.
All library functions that take options have the suffix Opt
in their names;
for each of them, there's also a non-Opt
convenience function that simply
has the (unexported) mempty
option. For many uses, options won't be
needed.
Some options can be enabled by special character sequences in the pattern as
an alternative to specifying them as an Option
. See Caseless
for
example.
Most options are exported in Text.Regex.Pcre2. The callout interface is found in Text.Regex.Pcre2.Unsafe.
Documentation is scant here. For more complete, accurate information, including discussions of corner cases arising from specific combinations of options and pattern items, please see the C API documentation.
Constructors
AllowEmptyClass | Make |
AltBsux | Like |
AltBsuxLegacy | Behave like ECMAScript 5 for |
AltCircumflex | Match a |
AltVerbNames | Enable backslash escapes in verb names. E.g.,
|
Anchored | Equivalent to beginning pattern with |
Bsr Bsr | Override what |
Caseless | Case-insensitive match. Equivalent to |
DepthLimit Word32 | Override maximum depth of nested backtracking
(default given by |
DollarEndOnly | Don't match |
DotAll | A dot also matches a (single-character) newline. Equivalent
to |
EndAnchored | More or less like ending pattern with |
EscapedCrIsLf | Interpret |
Extended | In the pattern, ignore whitespace, and enable comments
starting with |
ExtendedMore | Like |
FirstLine | The match must begin in the first line of the subject. |
HeapLimit Word32 | Override maximum heap memory (in kibibytes) used to
hold backtracking information (default given by |
Literal | Treat the pattern as a literal string. |
MatchLimit Word32 | Override maximum value of the main matching loop's
internal counter (default given by |
MatchLine | Only match complete lines. Equivalent to bracketing the
pattern with |
MatchUnsetBackRef | A backreference to an unset capture group matches an empty string. |
MatchWord | Only match subjects that have word boundaries at the
beginning and end. Equivalent to bracketing the pattern with
|
MaxPatternLength Word64 | Default is |
Multiline |
|
NeverBackslashC | Do not allow the unsafe |
NeverUcp | Don't count Unicode characters in some character classes
such as |
Newline Newline | Override what a newline is (default given by
|
NoAutoCapture | Disable numbered capturing parentheses. |
NoAutoPossess | Turn off some optimizations, possibly resulting in some callouts not being called. |
NoDotStarAnchor | Turn off an optimization involving |
NoStartOptimize | Turn off some optimizations normally performed at the beginning of a pattern. |
NotBol | First character of subject is not the beginning of
line. Only affects |
NotEmpty | The 0th capture doesn't match if it would be empty. |
NotEmptyAtStart | The 0th capture doesn't match if it would be empty and at the beginning of the subject. |
NotEol | End of subject is not the end of line. Only
affects |
OffsetLimit Word64 | Limit how far an unanchored search can advance in the subject. |
ParensLimit Word32 | Override max depth of nested parentheses (default
given by |
PartialHard | If the subject ends without finding a complete match,
stop trying alternatives and signal a partial match immediately.
Currently we do this by throwing a |
PartialSoft | If the subject ends and all alternatives have been tried,
but no complete match is found, signal a partial match. Currently we do
this by throwing a |
SubGlobal | Affects |
SubLiteral | Affects |
SubReplacementOnly | Affects |
SubUnknownUnset | Affects |
SubUnsetEmpty | Affects |
Ucp | Count Unicode characters in some character classes such as |
Ungreedy | Invert the effect of |
What \R
, backslash R, can mean.
Constructors
BsrUnicode | any Unicode line ending sequence |
BsrAnyCrlf |
|
What's considered a newline.
Constructors
NewlineCr |
|
NewlineLf |
|
NewlineCrlf |
|
NewlineAny | any Unicode line ending sequence |
NewlineAnyCrlf | any of the above |
NewlineNul | binary zero |
User errors
data SomePcre2Exception Source #
The root of the PCRE2 exception hierarchy.
Instances
Exception SomePcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal Methods toException :: SomePcre2Exception -> SomeException # fromException :: SomeException -> Maybe SomePcre2Exception # | |
Show SomePcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal Methods showsPrec :: Int -> SomePcre2Exception -> ShowS # show :: SomePcre2Exception -> String # showList :: [SomePcre2Exception] -> ShowS # |
data Pcre2Exception Source #
Vanilla PCRE2 exceptions with messages generated by the underlying C library.
Instances
Exception Pcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal Methods toException :: Pcre2Exception -> SomeException # | |
Show Pcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal Methods showsPrec :: Int -> Pcre2Exception -> ShowS # show :: Pcre2Exception -> String # showList :: [Pcre2Exception] -> ShowS # |
data Pcre2CompileException Source #
PCRE2 compile exceptions. Along with a message stating the cause, we show the pattern with a cursor pointing at where the error is (if not after the last character).
Instances
Exception Pcre2CompileException Source # | |
Defined in Text.Regex.Pcre2.Internal | |
Show Pcre2CompileException Source # | |
Defined in Text.Regex.Pcre2.Internal Methods showsPrec :: Int -> Pcre2CompileException -> ShowS # show :: Pcre2CompileException -> String # showList :: [Pcre2CompileException] -> ShowS # |
PCRE2 build configuration
defaultBsr :: Bsr Source #
See Bsr
.
compiledWidths :: [Int] Source #
Which code widths PCRE2 is compiled to operate on. Can be any combination
of 8, 16, and 32. Should be [8]
but provided here for completeness.
defaultDepthLimit :: Int Source #
See DepthLimit
.
defaultHeapLimit :: Int Source #
See HeapLimit
.
supportsJit :: Bool Source #
Was PCRE2 built with JIT support?
jitTarget :: Maybe Text Source #
A nice description of the CPU architecture JIT support is compiled for, if any.
defaultMatchLimit :: Int Source #
See MatchLimit
.
defaultNewline :: Newline Source #
See Newline
.
defaultParensLimit :: Int Source #
See ParensLimit
.
defaultTablesLength :: Int Source #
Size in bytes of PCRE2's built-in character processing tables.
unicodeVersion :: Maybe Text Source #
Unicode version string such as 8.0.0
, if Unicode is supported at all.
supportsUnicode :: Bool Source #
Was PCRE2 built with Unicode support?
pcreVersion :: Text Source #
Version of the built-in C library. The versioning scheme is that PCRE
legacy is 8.x and PCRE2 is 10.x, so this should be 10.
something.