Lesson 1: Introducing Regular Expressions
Lesson 1: Introducing Regular Expressions
In this lesson you’ll learn what regular expressions are and what they can do for you.
You are searching for a file containing the text car (regardless of case) but do not want to also locate car in the
middle of a word (for example, scar, carry, and incarcerate).
You are generating a Web page and need to display text retrieved from a database. Text may contain URLs, and you
want those URLs to be clickable in the generated page (so that instead of generating just text, you generate a valid
HTML <a href></a>).
You create an app with a form that prompts for user information including e-mail address. You need to verify that
specified addresses are formatted correctly (that they are syntactically valid).
You are editing source code and need to replace all occurrences of size with iSize, but only size and
not size as part of another word.
You are displaying a list of all files in your computer file system and want to filter so that you locate only files
containing the text Application.
You are importing data into an application. The data is tab delimited and your application supports CSV format files
(one row per line, comma-delimited values, each possibly enclosed with quotes).
You need to search a file for some specific text, but only at a specific location (perhaps at the start of a line or at the
end of a sentence).
All these scenarios present unique programming challenges. And all of them can be solved in just about any language
that supports conditional processing and string manipulation. But how complex a task would the solution become?
You would need to loop through words or characters one at a time, perform all sorts of if statement tests, track lots of
flags so as to know what you had found and what you had not, check for whitespace and special characters, and more.
And you would need to do it all manually, over and over.
Or you could use regular expressions. Each of the preceding challenges can be solved using well-crafted statements—
highly concise strings containing text and special instructions—statements that may look like this:
\b[Cc][Aa][Rr]\b
Note
Don’t worry if the previous line does not make sense yet; it will shortly.
It is worth noting that testing for equality (for example, does this user-specified e-mail address match this regular
expression) is a search operation. The entire user-provided string is being searched for a match (in contrast to a
substring search, which is what searches usually are).
RegEx Replaces
Regular expression searches are immensely powerful, very useful, and not that difficult to learn. As such, many of the
lessons and examples that you will run into are matches. However, the real power of regex is seen in replace
operations, such as in the earlier scenario in which you replace URLs with clickable URLs. For starters, this requires
that you be able to locate URLs within text (perhaps searching for strings that start
with http:// or https:// and ending with a period or a comma or whitespace). Then it also requires that you
replace the found URL with two occurrences of the matched string with embedded HTML so that:
http://www.forta.com/
is replaced with
Or perhaps the text being located is just an address, and not a fully qualified URL, like this:
www.forta.com
The Search and Replace option in most applications could not handle this type of replace operation, but this task is
incredibly easy using a regular expression.
The regular expression language is not a full programming language. It is usually not even an actual program or utility
that you can install and use. More often than not, regular expressions are mini-languages built-in to other languages
or products. The good news is that just about any decent language or tool these days supports regular expressions.
The bad news is that the regular expression language itself is not going to look anything like the language or tool you
are using it with. The regular expression language is a language unto itself—and not the most intuitive or obvious
language at that.
Note
Regular expressions originate from research in the 1950s in the field of mathematics. Years later, the principles and
ideas derived from this early work made their way into the Unix world into the Perl language and utilities such
as grep. For many years, regular expressions (used in the scenarios previously described) were the exclusive domain
of the Unix community, but this has changed, and now regular expressions are supported in a variety of forms on just
about every computing platform.
To put all this into perspective, the following are all valid regular expressions (and all will make sense shortly):
Ben
.
www\.forta\.com
[a-zA-Z0-9_.]*
<[Hh]1>.*</[Hh]1>
\r\n\r\n
\d{3,3}-\d{3,3}-\d{4,4}
It is important to note that syntax is the easiest part of mastering regular expressions. The real challenge, however, is
learning how to apply that syntax, how to dissect problems into solvable regex solutions. That is something that
cannot be taught by simply reading a book, but like any language, mastery comes with practice.
How regular expressions are used and how regular expression functionality is exposed varies from one application to
the next. Some applications have menu options and dialog boxes used to access regular expressions, whereas
programming languages typically provide functions or classes or objects that expose regex functionality.
Furthermore, not all regular expression implementations are the same. There are often subtle (and sometimes not so
subtle) differences between syntax and features.
Appendix A, “Regular Expressions in Popular Applications and Languages,” provides usage details and notes for
many of the applications and languages that support regular expressions. Before you proceed to the next lesson,
consult that appendix to learn the specifics pertaining to the application or language that you will be using.
To help you get started quickly, you’ll find links to online regular expression testing tools on this book’s Web page at
These online tools are often the simplest way to experiment with regular expressions.
SUMMARY
Regular expressions are one of the most powerful tools available for text manipulation. The regular expressions
language is used to construct regular expressions (the actual constructed string is called a regular expression), and
regular expressions are used to perform both search and replace operations.