100% found this document useful (1 vote)
273 views

Regex Clinic

Andrei Zmievski hosted a regular expression clinic on March 12, 2010 at ConFoo 2010. The document provides terminology for regular expressions like regex, subject string, match, and engine. It also explains how regex engines work and covers syntax elements like characters, literals, metacharacters, character classes, quantifiers, greediness, assertions and anchors. Special attention is given to explaining common quantifiers like ?, *, +, {}, explaining character classes and ranges, and overcoming regex greediness.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
273 views

Regex Clinic

Andrei Zmievski hosted a regular expression clinic on March 12, 2010 at ConFoo 2010. The document provides terminology for regular expressions like regex, subject string, match, and engine. It also explains how regex engines work and covers syntax elements like characters, literals, metacharacters, character classes, quantifiers, greediness, assertions and anchors. Special attention is given to explaining common quantifiers like ?, *, +, {}, explaining character classes and ranges, and overcoming regex greediness.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Andrei Zmievski andrei@php.

net @a

ConFoo 2010

Andreis Regex Clinic

Friday, March 12, 2010

userfriendly: line noise


Friday, March 12, 2010

what are they good for?


Literal string searches are fast but inexible With regular expressions you can:
Find out whether a certain pattern occurs in the text Locate strings matching a pattern and remove them or replace them with something else Extract the strings matching the pattern

Friday, March 12, 2010

terminology

Regex
a pattern describing a set of strings

abcdef
Friday, March 12, 2010

terminology

apple
Subject String
text that the regex is applied to
Friday, March 12, 2010

terminology

a apple
Match
a portion of the string that is successfully described by the regex
Friday, March 12, 2010

terminology

Engine
A program or a library that obtains matches given a regex and a string

PCRE
Friday, March 12, 2010

how an NFA engine works


The engine bumps along the string trying to match the regex Sometimes it goes back and tries again

Friday, March 12, 2010

how an NFA engine works


Two basic things to understand about the engine
It will always return the earliest (leftmost) match it nds
The topic of the day is isotopes.

Given a choice it always favors match over a nonmatch

Friday, March 12, 2010

color legend
regular expression

subject string

match

Friday, March 12, 2010

Syntax

Friday, March 12, 2010

characters
Special set is a well-dened subset of ASCII Ordinary set consist of all characters not designated special Special characters are also called metacharacters

a 4 0 K x . ! * ? ^

Friday, March 12, 2010

matching literals

123
Friday, March 12, 2010

The most basic regex consists of a single ordinary character It matches the rst occurrence of that character in the string Characters can be added together to form longer regexes

extended characters

To match an extended character, use \xhh notation where hh are hexadecimal digits To match Unicode characters (in UTF-8 mode) mode use \x{hhh..} notation

Friday, March 12, 2010

metacharacters
To use one of these literally, escape it, that is prepend it with a backslash

\$
Friday, March 12, 2010

. [] () ^$ *+? {} |

metacharacters
To escape a sequence of characters, put them between \Q and \E

Price is \Q$12.36\E
will match

Price is $12.36

. [] () ^$ *+? {} |

Friday, March 12, 2010

metacharacters
So will the backslashed version

Price is \$12\.36
will match

Price is $12.36

. [] () ^$ *+? {} |

Friday, March 12, 2010

character classes

[]

Consist of a set of characters placed inside square brackets Matches one and only one of the characters specied inside the class

Friday, March 12, 2010

character classes
matches an English vowel (lowercase)

[]
[aeiou] [st]urf

matches surf or turf

Friday, March 12, 2010

negated classes
Placing a caret as the rst character after the opening bracket negates the class Will match any character not in the class, including newlines [^<>] would match a character that is not left or right bracket

[^]

Friday, March 12, 2010

character ranges
Placing a dash (-) between two characters creates a range from the rst one to the second one Useful for abbreviating a list of characters

[-]

[a-z]
Friday, March 12, 2010

character ranges
Ranges can be reversed

[-]

[z-a]
Friday, March 12, 2010

character ranges
Ranges can be reversed A class can have more than one range and combine ranges with normal lists

[-]

[a-z0-9:]
Friday, March 12, 2010

\w \d \s \W \D \S

word character decimal digit whitespace not a word character not a decimal digit not whitespace

[A-Za-z0-9_] [0-9] [ \n\r\t\f] [^A-Za-z0-9_] [^0-9] [^ \n\r\t\f]

shortcuts for ranges


Friday, March 12, 2010

[-]

classes and metacharacters

] ^
Friday, March 12, 2010

\ -

Inside a character class, most metacharacters lose their meaning Exceptions are:

classes and metacharacters

] ^
Friday, March 12, 2010

\ -

Inside a character class, most metacharacters lose their meaning Exceptions are:
closing bracket

classes and metacharacters

] ^
Friday, March 12, 2010

\ -

Inside a character class, most metacharacters lose their meaning Exceptions are:
closing bracket backslash

classes and metacharacters

] ^
Friday, March 12, 2010

\ -

Inside a character class, most metacharacters lose their meaning Exceptions are:
closing bracket backslash caret

classes and metacharacters

] ^
Friday, March 12, 2010

\ -

Inside a character class, most metacharacters lose their meaning Exceptions are:
closing bracket backslash caret dash

classes and metacharacters

[ab\]] [ab^] [a-z-]


Friday, March 12, 2010

To use them literally, either escape them with a backslash or put them where they do not have special meaning

dot metacharacter
By default matches any single character

Friday, March 12, 2010

dot metacharacter
By default matches any single character Except a newline

\n

Friday, March 12, 2010

dot metacharacter

Is equivalent to

[^\n]

Friday, March 12, 2010

dot metacharacter
12345
Use dot carefully - it might match something you did not intend 12.45 will match literal 12.45 But it will also match these:

.
12945 12a45 12-45 78812 45839

Friday, March 12, 2010

quantiers

We are almost never sure about the contents of the text.

Friday, March 12, 2010

quantiers ?
Quantiers help us deal with this uncertainty

* + {}

Friday, March 12, 2010

quantiers ?
They specify how many times a regex component must repeat in order for the match to be successful

* + {}

Friday, March 12, 2010

repeatable components a
literal character

.
dot metacharacter

\w \d \s \W \D \S
range shortcuts

[]
character class
Friday, March 12, 2010

subpattern backreference

zero-or-one
Indicates that the preceding component is optional Regex welcome!? will match either welcome or welcome!

Regex super\s?strong means that super and strong may have an optional whitespace character between them Regex hello[!?]? Will match hello, hello!, or hello?

Friday, March 12, 2010

Indicates that the preceding component has to appear once or more Regex a+h will match ah, aah, aaah, etc Regex -\d+ will match negative integers, such as -33 Regex [^]+ means to match a sequence (more than one) of characters until the next quote

one-or-more
Friday, March 12, 2010

zero-or-more

Indicates that the preceding component can match zero or more times Regex \d+\.\d* will match 2., 3.1, 0.001 Regex <[a-z][a-z0-9]*> will match an opening HTML tag with no attributes, such as <b> or <h2>, but not <> or </i>

Friday, March 12, 2010

general repetition

{}

Species the minimum and the maximum number of times a component has to match Regex ha{1,3} matches ha, haa, haaa Regex \d{8} matches exactly 8 digits If second number is omitted, no upper range is set Regex go{2,}al matches gooal, goooal, gooooal, etc

Friday, March 12, 2010

general repetition

{}
? + *

{0,1} {1,} {0,}


Friday, March 12, 2010

= = =

greediness

matching as much as possible, up to a limit

Friday, March 12, 2010

greediness

PHP 5? \d{2,4}
Friday, March 12, 2010

PHP 5 is better than Perl 6

10/26/2004 2004

greediness

Quantiers try to grab as much as possible by default Applying <.+> to <i>greediness</i> matches the whole string rather than just <i>

Friday, March 12, 2010

greediness

If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed

Friday, March 12, 2010

greediness
To nd words ending in ness, you will probably use \w+ness On the rst run \w+ takes the whole word But since ness still has to match, it gives up the last 4 characters and the match succeeds

Friday, March 12, 2010

overcoming greediness
The simplest solution is to make the repetition operators non-greedy, or lazy Lazy quantiers grab as little as possible If the overall match fails, they grab a little more and the match is tried again

Friday, March 12, 2010

overcoming greediness
*? +? { , }? ??
Friday, March 12, 2010

To make a greedy quantier lazy, append ? Note that this use of the question mark is different from its use as a regular quantier

overcoming greediness
*?
Applying <.+?>

+?
to <i> <i>greediness</i>

{ , }? ??
Friday, March 12, 2010

gets us <i>

overcoming greediness

Another option is to use negated character classes More efcient and clearer than lazy repetition

Friday, March 12, 2010

overcoming greediness
<.+?> can be turned into <[^>]+> Note that the second version will match tags spanning multiple lines Single-line version: <[^>\r\n]+>

Friday, March 12, 2010

assertions and anchors

An assertion is a regex operator that


expresses a statement about the current matching point consumes no characters

Friday, March 12, 2010

assertions and anchors

The most common type of an assertion is an anchor Anchor matches a certain position in the subject string

Friday, March 12, 2010

caret

^
^F

Caret, or circumex, is an anchor that matches at the beginning of the subject string ^F basically means that the subject string has to start with an F

Fandango F

Friday, March 12, 2010

dollar sign
Dollar sign is an anchor that matches at the end of the subject string or right before the string-ending newline \d$ means that the subject string has to end with a digit The string may be top 10 or top 10\n, but either one will match

$
\d$

top 10 0

Friday, March 12, 2010

multiline matching
Often subject strings consist of multiple lines If the multiline option is set:
Caret (^) also matches immediately after any newlines Dollar sign ($) also matches immediately before any newlines

^t.+

one two three

Friday, March 12, 2010

absolute start/end
Sometimes you really want to match the absolute start or end of the subject string when in the multiline mode These assertions are always valid: \A matches only at the very beginning \z matches only at the very end \Z matches like $ used in single-line
mode

\At.+

three tasty trufes

Friday, March 12, 2010

word boundaries

\b \B

\bto\b

A word boundary is a position in the string with a word character (\w) on one side and a non-word character (or string boundary) on the other \b matches when the current position is a word boundary

right |to| vote to

\B matches when the current position is not a word boundary

Friday, March 12, 2010

word boundaries

\b \B

\B2\B

A word boundary is a position in the string with a word character (\w) on one side and a non-word character (or string boundary) on the other \b matches when the current position is a word boundary

doc2html 2

\B matches when the current position is not a word boundary

Friday, March 12, 2010

subpatterns

()

Parentheses can be used group a part of the regex together, creating a subpattern You can apply regex operators to a subpattern as a whole

Friday, March 12, 2010

grouping

()

Regex is(land)? matches both is and island Regex (\d\d,)*\d\d will match a commaseparated list of double-digit numbers

Friday, March 12, 2010

capturing subpatterns

()

All subpatterns by default are capturing A capturing subpattern stores the corresponding matched portion of the subject string in memory for later use

Friday, March 12, 2010

capturing subpatterns
Subpatterns are numbered by counting their opening parentheses from left to right Regex (\d\d-(\w+)-\d{4}) has two subpatterns

()

(\d\d-(\w+)-\d{4})

12-May-2004

Friday, March 12, 2010

capturing subpatterns
Subpatterns are numbered by counting their opening parentheses from left to right Regex (\d\d-(\w+)-\d{4}) has two subpatterns When run against 12-May-2004 the second subpattern will capture May

()

(\d\d-(\w+)-\d{4}) (\w+)

12-May-2004 May

Friday, March 12, 2010

non-capturing subpatterns

The capturing aspect of subpatterns is not always necessary It requires more memory and more processing time

Friday, March 12, 2010

non-capturing subpatterns

Using ?: after the opening parenthesis makes a subpattern be a purely grouping one Regex box(?:ers)? will match boxers but will not capture anything The (?:) subpatterns are not included in the subpattern numbering

Friday, March 12, 2010

named subpatterns
It can be hard to keep track of subpattern numbers in a complicated regex Using ?P<name> after the opening parenthesis creates a named subpattern Named subpatterns are still assigned numbers Pattern (?P<number>\d+) will match and capture 99 into subpattern named number when run against 99 bottles

Friday, March 12, 2010

Alternation operator allows testing several sub-expressions at a given point The branches are tried in order, from left to right, until one succeeds Empty alternatives are permitted Regex sailing|cruising will match either sailing or cruising

alternation
Friday, March 12, 2010

Since alternation has the lowest precedence, grouping is often necessary sixth|seventh sense will match the word sixth or the phrase seventh sense (sixth|seventh) sense will match sixth sense or seventh sense

alternation
Friday, March 12, 2010

Remember that the regex engine is eager It will return a match as soon as it nds one camel|came|camera will only match came when run against camera Put more likely pattern as the rst alternative

alternation
Friday, March 12, 2010

Applying ? to assertions is not permitted but.. The branches may contain assertions, such as anchors for example (^|my|your) friend will match friend at the beginning of the string and after my or your

alternation
Friday, March 12, 2010

backtracking
Also known as if at rst you dont succeed, try, try again When faced with several options it could try to achieve a match, the engine picks one and remembers the others

Friday, March 12, 2010

backtracking

If the picked option does not lead to an overall successful match, the engine backtracks to the decision point and tries another option

Friday, March 12, 2010

backtracking

This continues until an overall match succeeds or all the options are exhausted The decision points include quantiers and alternation

Friday, March 12, 2010

backtracking
Two important rules to remember
With greedy quantiers the engine always attempts the match, and with lazy ones it delays the match If there were several decision points, the engine always goes back to the most recent one

Friday, March 12, 2010

backtracking example

\d+00

12300
start

Friday, March 12, 2010

backtracking example

\d+00

12300 1
add 1

Friday, March 12, 2010

backtracking example

\d+00

12300 12
add 2

Friday, March 12, 2010

backtracking example

\d+00

12300 123
add 3

Friday, March 12, 2010

backtracking example

\d+00

12300 1230
add 0

Friday, March 12, 2010

backtracking example

\d+00

12300
add 0

Friday, March 12, 2010

backtracking example

\d+00

12300
string exhausted still need to match 00

Friday, March 12, 2010

backtracking example

\d+00

12300 1230
give up 0

Friday, March 12, 2010

backtracking example

\d+00

12300 123
give up 0

Friday, March 12, 2010

backtracking example

\d+00

12300
add 00

Friday, March 12, 2010

backtracking example

\d+00

12300
success

Friday, March 12, 2010

backtracking example

\d+ff

123dd
start

Friday, March 12, 2010

backtracking example

\d+ff

123dd 1
add 1

Friday, March 12, 2010

backtracking example

\d+ff

123dd 12
add 2

Friday, March 12, 2010

backtracking example

\d+ff

123 123dd
add 3

Friday, March 12, 2010

backtracking example

\d+ff

123 123dd
cannot match f here

Friday, March 12, 2010

backtracking example

\d+ff

123dd 12
give up 3 still cannot match f

Friday, March 12, 2010

backtracking example

\d+ff

123dd 1
give up 2 still cannot match f

Friday, March 12, 2010

backtracking example

\d+ff

123dd 1
cannot give up more because of +

Friday, March 12, 2010

backtracking example

\d+ff

123dd
failure

Friday, March 12, 2010

atomic grouping

Disabling backtracking can be useful The main goal is to speed up failed matches, especially with nested quantiers

Friday, March 12, 2010

atomic grouping

(?>regex) will treat regex as a single atomic token, no backtracking will occur inside it All the saved states are forgotten

Friday, March 12, 2010

atomic grouping

(?>\d+)ff will lock up all available digits and fail right away if the next two characters are not ff Atomic groups are not capturing

Friday, March 12, 2010

possessive quantiers

Atomic groups can be arbitrarily complex and nested Possessive quantiers are simpler and apply to a single repeated item

Friday, March 12, 2010

possessive quantiers

To make a quantier possessive append a single + \d++ff is equivalent to (?>\d+)ff

Friday, March 12, 2010

possessive quantiers

Other ones are *+, ?+, and {m,n}+ Possessive quantiers are always greedy

Friday, March 12, 2010

do not over-optimize
Keep in mind that atomic grouping and possessive quantiers can change the outcome of the match When run against string abcdef
\w+d will match abcd \w++d will not match at all \w+ will match the whole string

Friday, March 12, 2010

backreferences

\n

A backreference is an alias to a capturing subpattern It matches whatever the referent capturing subpattern has matched

Friday, March 12, 2010

backreferences
(re|le)\w+\1 matches words that start with re or le and end with the same thing For example, retire and legible, but not revocable or lecture Reference to a named subpattern can be made with (?P=name)

\n

Friday, March 12, 2010

lookaround
Assertions that test whether the characters before or after the current point match the given regex Consume no characters Do not capture anything Includes lookahead and lookbehind

Friday, March 12, 2010

positive lookahead

(?=)

Tests whether the characters after the current point match the given regex (\w+)(?=:)(.*) matches surng: a sport but colon ends up in the second subpattern

Friday, March 12, 2010

negative lookahead
Tests whether the characters after the current point do not match the given regex sh(?!ing) matches sh not followed by ing Will match sherman and shed

(?!)

Friday, March 12, 2010

negative lookahead
Difcult to do with character classes sh[^i][^n][^g] might work but will consume more than needed and fail on subjects shorter than 7 letters Character classes are no help at all with something like sh(?!hook|ing)

(?!)

Friday, March 12, 2010

positive lookbehind

(?<=)

Tests whether the characters immediately preceding the current point match the given regex The regex must be of xed size, but branches are allowed (?<=foo)bar matches bar only if preceded by foo, e.g. my foobar

Friday, March 12, 2010

negative lookbehind
Tests whether the characters immediately preceding the current point do not match the given regex Once again, regex must be of xed size (?<!foo)bar matches bar only if not preceded by foo, e.g. in the bar but not my foobar

(?<!)

Friday, March 12, 2010

conditionals
Conditionals let you apply a regex selectively or to choose between two regexes depending on a previous match (?(condition)yes-regex) (?(condition)yes-regex|no-regex) There are 3 kinds of conditions
Subpattern match Lookaround assertion Recursive call (not discussed here)

Friday, March 12, 2010

subpattern conditions

(?(n))

This condition is satised if the capturing subpattern number n has previously matched ()? \b\w+\b (?(1)) matches words optionally enclosed by quotes There is a difference between ()? and (?) in this case: the second one will always capture

Friday, March 12, 2010

assertion conditions
This type of condition relies on lookaround assertions to choose one path or the other href=(? (?=[]) ([])\S+\1 | \S+)
Matches href=, then If the next character is single or double quote match a sequence of non-whitespace inside the matching quotes Otherwise just match it without quotes

Friday, March 12, 2010

inline options
The matching can be modied by options you put in the regular expression

(?i) (?m) (?s) (?x) (?U)


Friday, March 12, 2010

enables case-insensitive mode enables multiline matching for ^ and $ makes dot metacharacter match newline also ignores literal whitespace makes quantiers ungreedy (lazy) by default

inline options
(?i) (?m) (?s) (?x) (?U)
Friday, March 12, 2010

Options can be combined and unset (?im-sx) At top level, apply to the whole pattern Localized inside subpatterns (a(?i)b)c

comments

?#

Heres a regex I wrote when working on Smarty templating engine


^\$\w+(?>(\[(\d+|\$\w+|\w+(\.\w+)?)\])|((\.|->)\$?\w+))*(?>\|@?\w+(:(?>"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]*(?\\\\.[^\'\\\\]*) *\'|[^|]+))*)*$

Friday, March 12, 2010

comments
Let me blow that up for you

?#

^\$\w+(?>(\[(\d+|\$\w+|\w+(\.\w+)?)\])| ((\.|->)\$?\w+))*(?>\|@?\w+(:(?>"[^"\\\\]* (?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]* (?\\\\.[^\'\\\\]*)*\'|[^|]+))*)*$


Would you like some comments with that?

Friday, March 12, 2010

comments

?#

Most regexes could denitely use some comments (?#) species a comment

\d+(?# match some digits)

Friday, March 12, 2010

comments
If (?x) option is set, anything after # outside a character class and up to the next newline is considered a comment To match literal whitespace, escape it

?#

(?x) \w+ # start with word characters [?!] # and end with ? or !

Friday, March 12, 2010

Regex Toolkit

Friday, March 12, 2010

regex toolkit
In your day-to-day development, you will frequently nd yourself running into situations calling for regular expressions It is useful to have a toolkit from which you can quickly draw the solution It is also important to know how to avoid problems in the regexes themselves

Friday, March 12, 2010

matching vs. validation

In matching (extraction) the regex must account for boundary conditions In validation your boundary conditions are known the whole string

Friday, March 12, 2010

matching vs. validation


Matching an English word starting with a capital letter

\b[A-Z][a-zA-Z-]*\b
Validating that a string fullls the same condition

^[A-Z][a-zA-Z-]*$
Do not forget ^ and $ anchors for validation!

Friday, March 12, 2010

using dot properly


One of the most used operators One of the most misused Remember - dot is a shortcut for [^\n] May match more than you really want <.> will match <b> but also <!>, < >, etc Be explicit about what you want <[a-z]> is better

Friday, March 12, 2010

using dot properly


When dot is combined with quantiers it becomes greedy <.+> will consume any characters between the rst bracket in the line and the last one Including any other brackets!

Friday, March 12, 2010

using dot properly


Its better to use negated character class instead <[^>]+> if bracketed expression spans lines <[^>\r\n]+> otherwise Lazy quantiers can be used, but they are not as efcient, due to backtracking

Friday, March 12, 2010

optimizing unlimited repeats


One of the most common problems is combining an inner repetition with an outer one If the initial match fails, the number of ways to split the string between the quantiers grows exponentially The problem gets worse when the inner regex contains a dot, because it can match anything!

(regex1|regex2|..)* (regex*)+ (regex+)* (.*?bar)*

Friday, March 12, 2010

optimizing unlimited repeats


(regex1|regex2|..)*
PCRE has an optimization that helps in certain cases, and also has a hardcoded limit for the backtracking The best way to solve this is to prevent unnecessary backtracking in the rst place via atomic grouping or possessive quantiers

(regex*)+ (regex+)* (.*?bar)*

Friday, March 12, 2010

optimizing unlimited repeats


Consider the expression that is supposed to match a sequence of words or spaces inside a quoted string [](\w+|\s{1,2})*[] When applied to the string aaaaaaaaaa (with nal quote), it matches quickly When applied to the string aaaaaaaaaa (no nal quote), it runs 35 times slower!

Friday, March 12, 2010

optimizing unlimited repeats


We can prevent backtracking from going back to the matched portion by adding a possessive quantier: [](\w+|\s{1,2})*+[] With nested unlimited repeats, you should lock up as much of the string as possible right away

Friday, March 12, 2010

extracting markup
Possible to use preg_match_all() for grabbing marked up portions But for tokenizing approach, preg_split() is better

$s = 'a <b><I>test</I></b> of <br /> markup'; $tokens = preg_split( '!( < /? [a-zA-Z][a-zA-Z0-9]* [^/>]* /? > ) !x', $s, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE); result is array('a','<b>','<I>','test','</I>', '</b>','of','<br />','markup')

Friday, March 12, 2010

restricting markup
Suppose you want to strip all markup except for some allowed subset. What are your possible approaches?
Use strip_tags() - which has limited functionality Multiple invocations of str_replace() or preg_replace() to remove script blocks, etc Custom tokenizer and processor, or..

Friday, March 12, 2010

restricting markup
$s = preg_replace_callback( '! < (/?) ([a-zA-Z][a-zA-Z0-9]*) ([^/>]*) (/?) > !x', 'my_strip', $s); function my_strip($match) { static $allowed_tags = array('b', 'i', 'p', 'br', 'a'); $tag = $match[2]; $attrs = $match[3]; if (!in_array($tag, $allowed_tags)) return ; if (!empty($match[1])) return "</$tag>"; /* strip evil attributes here */ if ($tag == 'a') { $attrs = ''; } /* any other kind of processing here */ return "<$tag$attrs$match[4]>"; }

Friday, March 12, 2010

matching numbers
Integers are easy: \b\d+\b Floating point numbers:
integer.fractional .fractional

Can be covered by (\b\d+|\B)\.\d+\b

Friday, March 12, 2010

matching numbers
To match both integers and oating point numbers, either combine them with alternation or use: ((\b\d+)?\.)?\b\d+\b [+-]? can be prepended to any of these, if sign matching is needed \b can be substituted by more appropriate assertions based on the required delimiters

Friday, March 12, 2010

matching quoted strings


A simple case is a string that does not contain escaped quotes inside it Matching a quoted string that spans lines: [^]* Matching a quoted string that does not span lines: [^\r\n]*

Friday, March 12, 2010

matching quoted strings


Matching a string with escaped quotes inside

([^]+|(?<=\\\\))*+ ( [^]+ | (?<=\\\\) )*+


opening quote a component that is a segment without any quotes or a quote preceded by a backslash component repeated zero or more times without backtracking closing quote

Friday, March 12, 2010

matching e-mail addresses

Yeah, right The complete regex is about as long as a book page in 10-point type Buy a copy of Jeffrey Friedls book and steal it from there

Friday, March 12, 2010

matching phone numbers


Assuming we want to match US/Canada-style phone numbers 800-555-1212 800.555.1212 (800) 555-1212 How would we do it? 1-800-555-1212 1.800.555.1212 1 (800) 555-1212

Friday, March 12, 2010

matching phone numbers


The simplistic approach could be: (1[ .-])? \(? \d{3} \)? [ .-] \d{3} [.-] \d{4} But this would result in a lot of false positives: 1.(800)-555 1212 1-800 555-1212 800).555-1212 (800 555-1212

Friday, March 12, 2010

matching phone numbers


^(?: (?:1([.-]))? \d{3} ( (?(1) \1 | [.-] ) ) \d{3} \2 \d{4} | 1[ ]?\(\d{3}\)[ ]\d{3}-\d{4} )$ anchor to the start of the string may have 1. or 1- (remember the separator) three digits if we had a separator match the same (and remember), otherwise match . or - as a separator (and remember) another three digits same separator as before nal four digits or just match the third format anchor to the end of the string

Friday, March 12, 2010

tips
Dont do everything in regex a lot of tasks are best left to PHP Use string functions for simple tasks Make sure you know how backtracking works

Friday, March 12, 2010

tips
Be aware of the context Capture only what you intend to use Dont use single-character classes

Friday, March 12, 2010

tips

Lazy vs. greedy, be specic Put most likely alternative rst in the alternation list Think!

Friday, March 12, 2010

regex tools
Rubular.com Regex buddy Komodo Regex tester (Rx toolkit) Reggy (reggyapp.com) RegExr (http://www.gskinner.com/RegExr/) http://www.spaweditor.com/scripts/regex/index.php http://regex.larsolavtorvik.com/

Friday, March 12, 2010

Thank You!
Questions?
http://zmievski.org/talks

Friday, March 12, 2010

You might also like