Network Security - 4.2 Reg Ex Primer
Network Security - 4.2 Reg Ex Primer
Why Regex? Regex is useful in situations where you need to detect complicated patterns in text,
especially when those patterns can appear multiple times. Regular expressions are portable because
they can be used in just about any of the modern programming languages (like Javascript, C#, Java,
Ruby, Python, PHP, etc.). There are regular expressions for finding email addresses, URLs, HTML tags
or just about any other kind of a pattern that can be expressed in text.
How can we use Regex? Usually, we ask Regex to tell us if it finds a certain pattern in text, and if so,
where. Regex can also tell us where the next place is that it sees the pattern. So if we take the
statement, “there is a frog on a log” and ask Regex where the word “frog” is located, then it will tell us
“position 11.” This is the numbered position in the sentence where the word frog begins:
0 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23
Ther e i s a f r o g o n a l o g
The expression in this case is literally “frog” but we can ask Regex to do more complicated things for
us. For example, we can ask Regex to find any word that ends in “og” and tell us where it is. In order to
say this to Regex, we break it down as follows: we want a word that has any amount of letters (we don't
really care what they are) followed by og. In Regex, the term “\w” signifies a “word letter” (meaning
any letter in the alphabet). The “\” tells Regex “this is a special character” and the “w” says “this is a
word letter (A - Z).” We can follow the “\w” with an “*” meaning “zero to many” or a “+” meaning
“one to many.” So to make an expression that says “one to many letters followed by 'og'” we construct
the following regular expression: \w+og
Now, when we ask Regex to tell us all of the places where it sees the pattern “\w+og”, it tells us
“positions 11 and 21.”
0 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23
Ther e i s a f r o g o n a l o g
This is the tip of the iceberg when it comes to the power of regular expressions to perform pattern
matching on text. Here, we tell Regex that we want one to many word characters. What if we want
exactly one? Just omit the “+” so that we have the following expression: \wog
But what if we wanted exactly thirteen characters followed by “og”? This situation probably does not
come along often, but you never know. we could write “\w\w\w\w\w\w\w\w\w\w\w\w\wog” but that is
getting a bit ridiculous. Apparently, the creators of regular expressions thought so too, so they created a
special notation that you can use to tell Regex exactly how many times you want it to do something.
Let's say we want to find all words in some text starting with an “s” and ending with an “n” with
exactly 7 letters in between. Here is how we write it: s\w{7}n
This says “s” followed by exactly 7 word characters followed by “n.” Pretty cool, huh? But, what if we
wanted, say, anywhere between 3 and 7 characters between? We just need to modify our last expression
a bit to accommodate this: s\w{3,7}n
When we wrote our expression “s\w{7}n” we were actually telling Regex “s\w{7,7}n” but we can just
put “{7}” for short.
Now on to escaped characters. What are escape characters? And why were they escaping in the first
place? Worry not, because we actually want some of our characters to escape. In regular expressions,
we use the backslash, \, to “escape”certain characters from expressions. This means that when a “w” or
an “s” is in front of a backslash, it carries special meaning of some kind for Regex. So far we have only
used the “\w” in our expressions to detect word characters. Here is a larger list of commonly used
escaped characters:
Character Meaning
\w Word Character
\W Non-word characters
. ANY character
\D Non-numeric characters
So now, with our newly acquired expert knowledge of Regex escaped characters, we can track down
things like phone numbers and email addresses. There is no escape! Well, you might think that anyway.
Before we move on, we have to point out something here: the “.” character. Alone, in an expression, it
has special meaning: any character. If we were to say “s..p” then it would match with ship, soap, slip,
etc. But, if we say “s\.\.p” then literally “s..p” is what will be matched. Why would this be backwards
from everything else? We have no idea. But it is worth knowing when you make your expressions. A
“.” means any character, and a “\.” means “.” literally. Now, moving on. Let's take a good example of
something that we might search for in text: a phone number. Phone numbers can take a variety of
forms:
(555)555-5555
(555) 555-5555
555-555-5555
1-555-555-5555
555.555.5555
Let's try to match the “555-555-5555” example. There are 3 digits, followed by a dash, followed by
three more digits, followed by a dash, followed by four more digits. So, three digits, \d{3}, followed by
a dash, \d{3}-, followed by three more digits, \d{3}-\d{3}, and another dash, \d{3}-\d{3}-, followed by
four more digits, \d{3}-\d{3}-\d{4}, and there you have it: \d{3}-\d{3}-\d{4}
This pattern will match phone numbers with the pattern “XXX-XXX-XXXX”. But if you look closely,
we are actually repeating information again (like in the “\w\w\w\w\” example). Right at the beginning
of the pattern, you see “\d{3}-” twice. Is there a way to compact this any more? There is, with logical
grouping. In Regex you can logically group different parts of a pattern together with parentheses, and
then treat the group as an individual element. So we could rewrite the expression, \d{3}-\d{3}-, as the
expression, (\d{3}-){2}, and that would make our final expression as follows: (\d{3}-){2}\d{4}
So parentheses, curly braces (the “{” and “}” characters) and backslashes have special meaning in these
patterns: curly braces indicate multiplicity, parentheses logically group things, and backslashes indicate
that the next character after it should be treated specially. This is all fine and dandy, until we want to
find curly braces, parentheses, or backslashes in text. Imagine a situation where we wanted to find, say,
a phone number like “(555) 555-5555.” If you tried to use the pattern, (\d{3})\s\d{3}-\d{4}, where the
first three digits are surrounded by parentheses, then this would actually match “XXX XXX-XXXX”
because Regex thinks that you are trying to logically group the three digits. In order to disillusion our
faithful search algorithm, we need to escape our parentheses. Run away! You escape parentheses just
like anything else: with a backslash. So in order to get this expression right, we have to write it like
this: \(\d{3}\)\s\d{3}-\d{4}
Now that is what we meant by cryptic looking. If we had shown you something like this in the
beginning of this discussion, you probably would have run screaming in the opposite direction.
Fortunately for you, we're taking it one step at a time, which should hopefully make it easier to
understand. This new pattern is great, but it could be a little better. There is a difference between a
“(XXX) XXX-XXXX” phone number and a “(XXX)XXX-XXXX” number, although they look almost
exactly the same. One has a space after the parentheses, and one does not. It might not seem like much
of a difference, but to a computer the difference is significant. Is there a way to make a pattern that can
handle either scenario? Well, we could put a multiplicity constraint in front of the whitespace character
to make it zero to one, like this:\(\d{3}\)\s{0,1}\d{3}-\d{4}
That works, but there is a quicker way to say the same thing. Remember the “*” and “+” operators for
“zero to many” and “one to many”? There is also an operator for zero to one, the “?” character. So if we
rewrite our pattern as, \(\d{3}\)\s?\d{3}-\d{4}, then we have an equivalent expression that says,
“exactly three digits surrounded with parentheses followed by an optional space, followed by exactly
three digits, one dash, then exactly four digits.”
Hopefully, this has helped you get somewhat of a basic understanding of what Regex can do. There are
entire books written on the subject, and millions of programs and web sites all over the world that use
them.
By Wolfgang Meyers