0% found this document useful (0 votes)
23 views

Advanced Regular Expressions

Regular expressions are a set of regular expressions used to find patterns in text. Regular expressions can be greedy, lazy, possessive, lookahead, and atomic. Lookahead checks upcoming characters for a positive match: q(?=u)i applied on "quit" Lookarounds are zero-width assertions and atomic Lookarounds can be positive or negative.

Uploaded by

Andrei Ursuleanu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Advanced Regular Expressions

Regular expressions are a set of regular expressions used to find patterns in text. Regular expressions can be greedy, lazy, possessive, lookahead, and atomic. Lookahead checks upcoming characters for a positive match: q(?=u)i applied on "quit" Lookarounds are zero-width assertions and atomic Lookarounds can be positive or negative.

Uploaded by

Andrei Ursuleanu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 7

Advanced Regular Expressions

Back to Basics

Anchors and matching positions (e.g. \b \B)


More anchors (not in JavaScript): \A \Z \z \G $ \Z \z and the multiline switch

Characters ranges and Unicode


Difference between String.match and RegExp.exec Python, PHP and .NET have support for named capturing groups, also numbered:

(?P<name>group) and (?P=name) (?<name>group) and \k<name>

(?'name'group) and \k'name')

How the RegEx Engine Works


2 engine flavors: DFA (text-directed) and NFA (regex-directed)

Evaluation of text from left to right


2 cursors: one on the text one on the expression The engines create intermediate states

NFA tries each possibility at a time, full featured


DFA tries all possibilities at once, fast but restrictive The match that begins earliest wins (leftmost or first come, first served) Backtracking (NFA): falling back to previous step and trying another permutation

Backreferences and laziness: only for NFA

Greediness and Laziness

Quantifiers are by default GREEDY: they first match everything and then fallback if forced Adding ? after a quantifier makes it LAZY (non-greedy) \d+? _??(\w+)

Examples: .*?

Difference between greedy and lazy: choosing between make or skip the attempt
Are alternations greedy ? Traditional NFA uses the ordered alternation (tourn|to|tournament)

Possessive Quantifiers Atomic Grouping

Atomic groups are handled as a single unit (?> )

After exiting the atomic group all inner states are thrown away
Atomic groups can eliminate several permutations / paths and speed up the fail (?>.*?) (?>.+?) (?>\w+) \b(?>int|integer)\b applied on "integer" Possessive quantifiers are greedy and never give up the match

Adding + after a quantifier makes it possessive


\d++ _?+(\w+)

Examples: .*+

Lookahead and Lookbehind


Non capturing groups: (?: ) Lookarounds are zero-width assertions and atomic

Lookarounds can be positive or negative Positive lookahead (?= ) checks upcoming characters for a positive match: q(?=u)i applied on "quit"

Negative lookahead (?! ) checks upcoming not to match q(?!i)u applied on "quit"

Lookahead can contain a full regular expression Lookbehind can only contain fixed-length strings Positive lookbehind: (?<= ) (?<=a)b thingamabob

Negative lookbehind:(?<! ) (?<!a)b thingamabob

References

http://www.regular-expressions.info/

You might also like