COMP3.RegEx
COMP3.RegEx
• A set of words
A Natural Language is also governed by the rules that define a relationship between a sentence
and concepts in the real world.
Formal Language
A formal language is defined using two components:
• The alphabet
A formal Language for describing the hardness of lead in pencils would be:
Alphabet is:
{H, B, 2, 3, 4}
This formal language has a finite set of strings, 9. Formal Languages do not always have a finite
set of strings.
To describe the construction of these strings out of the alphabet, we use notations called met
languages.
Two met languages are Regular Expressions and Backus Naur Form (BNF)
Regular Language
A Regular Language is any language that a Finite State Machine (see page Error! Bookmark
not defined.) will accept.
a (a I b) *
would generate these strings {a, aa, ab, aab, abb, . }, where I means OR and * represents an
iteration of 0 or more.
A regular expression, a deterministic finite state automaton and a non-deterministic finite state
machine are equivalent ways of defining a regular language. To find a language that is not
regular, we must construct a language that requires an infinite number of states.
Regular Expressions
A regular expression is a notation for defining all the valid strings of a formal language or a
special text string for describing a search patterns
One of the most basic and important computing tasks is the manipulation of text, e.g. word-
processing documents, DNA sequences, Pascal or VB.net programs.
Another important text-processing problem is pattern matching. This often involves searching for
one occurrence or all occurrences of some specific string in a large document.
Often there is a need to check whether or not a string follows a set pattern, such as an email
address.
Regular expressions offer a way to specify such patterns and also to solve the corresponding
pattern-matching problem.
Regular expressions are used to match/recognise strings by defining patterns. They are useful
whenever it is necessary to match an input to what is expected (i.e. to validate the input).
The smallest building block for regular expressions is expressed by a single letter. Operators are
applied to those letters to expand the language that they represent.
A single building block can be {a} or cat. The difference between these two is that the first
building block lives in a set whereas the second building block is a single word.
142
Regular Expressions Computing Unit 3
2. Concatenation is the process of appending one string to the end of another, for instance
joining c with at gives cat.
4. Repeating a specific sequence is also another powerful tool and is represented by a * which
means zero or more repeats of that language.
Eg cat* represents the following strings: ca, cat, catt, cattt, catttt, cattttt…
If grouping were used such as (cat)* this expression would represent the following strings:
nothing or the Null set, cat, catcat, catcatcat, catcatcatcat, etc.
Metacharacters
There are 12 characters with special meanings:
Examples
s[ea]d?
• Must start with a “s”
• Followed by “e” or “a”
• Ending in nothing or a d
10*1
• Must start with a “1”
• Followed by 0 or more “0”
• Must end with “1”
10+1
• Must start with a “1”
• Followed by 1 or more “0”
• Must end with “1”
.an
• Start with a single character
• Followed by “an”
String defined include ban, can, dan,fan, ian, lan, man, pan, ran, tan, van
col(o|ou)r
• starts with “col”
• followed by “o” or “ou”
• followed by r
Mich(a?e|el)l
• starts with Mich
• followed by ae OR e OR el
• ending with l
144
Regular Expressions Computing Unit 3
[a-zA-Z\d\-\.]+@([a-zA-Z\d\-]+(\.[_a-zA-Z\d\-\.]+)+)
Regular
Meaning
Expression
[a-z] is a character set matching any single character that exists in the set. If there
[a-z] exists a character that starts with a lower case alphabetic character then the regular
expression matches.
[a-z0-9] is a character set matching any single character that exists in the set. If the
character is an alphanumeric character – which means either a letter (which in this
[a-z0-9]
instance would be lower case) or a number ranging from 0 to 9 – the regular
expression will match.
[a-z]* iteration: repeat 0 or more times. Therefore this will match patterns such as
[a-z]*
aaaaa, abcahqonbfk, bbbafe and nothing.
[a-z]+ is the same as [a-z]* apart from the fact that the + means one or more rather
[a-z] + than the * which means zero or more. If the string contains [a-z] once or more then
the regular expression matches the string.
The ‘?’ means that [a-z] can exist zero or one time. This matches any single lower-
[a-z]?
case alphabetic character or matches nothing.
[a-f] matches the characters a, b, c, d, e or f. It might be useful for recognising
[a-f]
hexadecimal numbers for example.
[0-3] [0-3] matches the numbers 0,1,2 and 3.
^abc ^ indicates the start of a string search. ^abc would match abc, abcdefg, abc123, ...
abc$ $ indicates the end of string match. abc$ would match abc, endsinabc, 123abc, ...
If the caret appears inside the [ ] then it will negate the character. The will search
a[^t]
for any string that contains an a followed by a character that is NOT t.
[a-z&&[^l-p]] the && notation means union and is equivalent to a logical AND. This
[a-z && ^p]
matches any lower case letter excluding p.
The ‘.’ symbol matches any character. Therefore the plus means that any character
.+end$ has to be repeated 1 or more times followed by the string end at the end of the
string. Examples of this include: ‘Here was the end’.
The full stop character (or dot, i.e. ‘.’) represents any character, so this expression
start(.)*end represents any string of characters (or nothing) starting with start and finishing with
end.
\s \s is shorthand and represents a space.
\S \S matches any single non-space character and is equivalent of [^\s]
\d \d is shorthand and represents a digit, i.e. it is the same as [0-9].
\D \D is the equivalent of ^\d, in other words any character that is NOT a digit.
Placing a backslash (\) in front of a special character means that the expression
\.\* matches the character instead of interpreting the character. So \. would match a ‘.’
and \* would match a ‘*’, so the expression to the left would match ‘.*’
146
Regular Expressions Computing Unit 3
The most basic regular expression consists of a single literal character. It will match the first
occurrence of that character in the string. The regex can match a second literal character only if
the regex engine is instructed to start searching through the string after the first match. In a text
editor, you can do this by using the Find Next or Search Forward function.
Often when editing text, you need to search for a word in a block of text. You can use a regular
expression (regex) to find a word, even if it is misspelled.
For example, if searching for the word separate (correct spelling) the regex sep[ae] r [ae]te would
find separate, sepearate, seperete and separete.
The regex [A-Za-z] [A-Za-z_0-9]* could be used to search for an identifier in a programming
language.
• validating data-entry fields (e-mail, dates, URLs, debit and credit card numbers).
Backus-Naur form is a notation for expressing the rules for constructing valid strings in a regular
language
Defining the syntax of a formal language by means of regular expressions can be very tedious for
languages with large alphabets. Some types of language whose syntax cannot be defined by a
regular expression. Instead we express the rules of the language in a notation known as Backus-
Naur form.
Backus-Naur Form (BNF) is a notation which is used to express the syntax of a language.
The basic structure of a BNF statement consists of a meta-component (the thing being defined)
and a definition.
The meta-component is enclosed in angle brackets (‘<>’), and comes first in a statement. Next
comes the ‘::=’ symbol, which indicates that the following statements are the definition of the
meta-component.
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The pipe symbol (‘|’) represents ‘OR’, meaning in this case that a digit can be one of 1, 2, 3, etc.
Entities can also be defined in terms of previously defined entities, or in terms of themselves
(recursion).
This statement defines an integer as either a single digit, or as an integer followed by a digit. This
allows strings of digits to be classed as a single entity, an integer.
Using the rules for pencil hardness (see page 141) we can construct the BNF for all valid strings
if the alphabet is:
{H, B, 2, 3, 4}
148
Regular Expressions Computing Unit 3
Recursive Definitions
If we wanted to define an expression to represent an unsigned integer, we would need a BNF
definition similar to
The solution is to use a recursive definition, in which the term being defined is also used in the
definition:
NOTE: Unlike regular expressions, BNF does not support iteration (looping) so recursion is
needed.
<sentence>
<definate <definate
<noun> <verb> <preposition> <noun>
article> article>
150