0% found this document useful (0 votes)
2 views

COMP3.RegEx

The document discusses regular expressions and their role in defining formal languages, including the components of natural and formal languages. It explains the concept of regular languages, the use of regular expressions for pattern matching, and introduces metacharacters and predefined shorthand classes. Additionally, it touches on Backus-Naur Form (BNF) as a notation for expressing the rules of constructing valid strings in a regular language.

Uploaded by

User.9463820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

COMP3.RegEx

The document discusses regular expressions and their role in defining formal languages, including the components of natural and formal languages. It explains the concept of regular languages, the use of regular expressions for pattern matching, and introduces metacharacters and predefined shorthand classes. Additionally, it touches on Backus-Naur Form (BNF) as a notation for expressing the rules of constructing valid strings in a regular language.

Uploaded by

User.9463820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Regular Expressions Computing Unit 3

13. Regular Expressions


Natural Language
A natural language is whichever language we learned as a child. All Natural Languages have the
following features

• A set of words

• A set of rules – the grammar or syntax of the language

A Natural Language is also governed by the rules that define a relationship between a sentence
and concepts in the real world.

Clever young students work hard

The peanut the monkey

Work students your hard clever

Formal Language
A formal language is defined using two components:

• The alphabet

• Its rules of syntax

A formal Language for describing the hardness of lead in pencils would be:

Alphabet is:

{H, B, 2, 3, 4}

The strings of the alphabet in this language are:

{“4B”, “3B”, “2B”, “B”, “HB”, “H”, “4H”, “3H”, “2H”}

This formal language has a finite set of strings, 9. Formal Languages do not always have a finite
set of strings.

To describe the construction of these strings out of the alphabet, we use notations called met
languages.

 A meta language is a language that describes a language.

Two met languages are Regular Expressions and Backus Naur Form (BNF)

KPY - August 15 141


Regular Expressions Computing Unit 3

Regular Language

 A Regular Language is any language that a Finite State Machine (see page Error! Bookmark
not defined.) will accept.

If a language composed of the strings generated by

a (a I b) *

would generate these strings {a, aa, ab, aab, abb, . }, where I means OR and * represents an
iteration of 0 or more.

A regular expression, a deterministic finite state automaton and a non-deterministic finite state
machine are equivalent ways of defining a regular language. To find a language that is not
regular, we must construct a language that requires an infinite number of states.

Regular Expressions

 A regular expression is a notation for defining all the valid strings of a formal language or a
special text string for describing a search patterns

One of the most basic and important computing tasks is the manipulation of text, e.g. word-
processing documents, DNA sequences, Pascal or VB.net programs.

Another important text-processing problem is pattern matching. This often involves searching for
one occurrence or all occurrences of some specific string in a large document.

Often there is a need to check whether or not a string follows a set pattern, such as an email
address.

Regular expressions offer a way to specify such patterns and also to solve the corresponding
pattern-matching problem.

Regular expressions are used to match/recognise strings by defining patterns. They are useful
whenever it is necessary to match an input to what is expected (i.e. to validate the input).

The smallest building block for regular expressions is expressed by a single letter. Operators are
applied to those letters to expand the language that they represent.

A single building block can be {a} or cat. The difference between these two is that the first
building block lives in a set whereas the second building block is a single word.

The set {a,b,c} can represent a OR b OR c.

142
Regular Expressions Computing Unit 3

Regular expression operators


1. OR/Union operator denoted as |.The OR symbol is used to choose between two definitions of
regular expressions. The regular expression may be defined as {a |cat}, indicating that either the
character ‘a’ or the string ‘cat’ is acceptable.

2. Concatenation is the process of appending one string to the end of another, for instance
joining c with at gives cat.

3. Grouping expressions together is another useful operation.

Eg, col(or|our) which represents either colour or color.

4. Repeating a specific sequence is also another powerful tool and is represented by a * which
means zero or more repeats of that language.

Eg cat* represents the following strings: ca, cat, catt, cattt, catttt, cattttt…

If grouping were used such as (cat)* this expression would represent the following strings:
nothing or the Null set, cat, catcat, catcatcat, catcatcatcat, etc.

Metacharacters
There are 12 characters with special meanings:

Symbol Meaning Example


[] Defines an explicit set [ab] defines either a
or b
() defines a logical group of an expression
| Alternative (a|b) would define
either a or b
* Iteration of 0 or more a(a|b)* would define
a, aa, aaa, ab, abb,
aba,
? Iteration of 0 or 1 times a(a|b)? would define
a, aa, ab
+ Iteration of 1 or more times a(a|b)+ would define
aa, aaa, ab, abb, aab,
...
\ Use the following character as a function
not data
caret ^ start of string or if inside [ ] NOT
$ end of string search
period or any single character
dot

These special characters are often called Meta characters.

KPY - August 15 143


Regular Expressions Computing Unit 3

Examples
s[ea]d?
• Must start with a “s”
• Followed by “e” or “a”
• Ending in nothing or a d

Strings defined are: se, sa, sed, sad

10*1
• Must start with a “1”
• Followed by 0 or more “0”
• Must end with “1”

Strings defined include 11, 101, 1001, 10001

10+1
• Must start with a “1”
• Followed by 1 or more “0”
• Must end with “1”

Strings defined include 101, 1001, 10001

.an
• Start with a single character
• Followed by “an”

String defined include ban, can, dan,fan, ian, lan, man, pan, ran, tan, van

col(o|ou)r
• starts with “col”
• followed by “o” or “ou”
• followed by r

This can also be re-expressed in many forms such as


color | colour
col(or|our)

Mich(a?e|el)l
• starts with Mich
• followed by ae OR e OR el
• ending with l

Michael, Michel, Michell

144
Regular Expressions Computing Unit 3

Predefined shorthand classes


Class Description Example
\d Matches a single numeric digit a\d gives
a0, a1, a2, a3, a4, a5,
a6, a7, a8, a9
\w matches a single alphanumeric zx\w includes
character zxa, axb, axA. zxB, zx0,
zx1
\W Matches a single non alphanumeric. f\W includes
Shorthand for [^\w] f”, f£, f$, f%, f^
\D Matches a single non-numeric.
Shorthand for [^\d]
\s Matches a single space
\S Matches any single non-spaces
character. Shorthand for [^\s]
\ Use special symbol following \ \- gives a hyphen
\. Gives a .

[a-zA-Z\d\-\.]+@([a-zA-Z\d\-]+(\.[_a-zA-Z\d\-\.]+)+)

Defines a valid email address:

[a-zA-Z\d\-\.]+ At least 1 alphanumeric character, numeric character, a hyphen or full stop

[a-zA-Z\d\-]+ At least 1 alphanumeric character, numeric character or hyphen

(\.[_a-zA-Z\d\-\.]+) dot followed by At least 1 alphanumeric character, numeric character or


hyphen

KPY - August 15 145


Regular Expressions Computing Unit 3

Examples of Regular Expressions:

Regular
Meaning
Expression
[a-z] is a character set matching any single character that exists in the set. If there
[a-z] exists a character that starts with a lower case alphabetic character then the regular
expression matches.
[a-z0-9] is a character set matching any single character that exists in the set. If the
character is an alphanumeric character – which means either a letter (which in this
[a-z0-9]
instance would be lower case) or a number ranging from 0 to 9 – the regular
expression will match.
[a-z]* iteration: repeat 0 or more times. Therefore this will match patterns such as
[a-z]*
aaaaa, abcahqonbfk, bbbafe and nothing.
[a-z]+ is the same as [a-z]* apart from the fact that the + means one or more rather
[a-z] + than the * which means zero or more. If the string contains [a-z] once or more then
the regular expression matches the string.
The ‘?’ means that [a-z] can exist zero or one time. This matches any single lower-
[a-z]?
case alphabetic character or matches nothing.
[a-f] matches the characters a, b, c, d, e or f. It might be useful for recognising
[a-f]
hexadecimal numbers for example.
[0-3] [0-3] matches the numbers 0,1,2 and 3.
^abc ^ indicates the start of a string search. ^abc would match abc, abcdefg, abc123, ...

abc$ $ indicates the end of string match. abc$ would match abc, endsinabc, 123abc, ...
If the caret appears inside the [ ] then it will negate the character. The will search
a[^t]
for any string that contains an a followed by a character that is NOT t.
[a-z&&[^l-p]] the && notation means union and is equivalent to a logical AND. This
[a-z && ^p]
matches any lower case letter excluding p.
The ‘.’ symbol matches any character. Therefore the plus means that any character
.+end$ has to be repeated 1 or more times followed by the string end at the end of the
string. Examples of this include: ‘Here was the end’.
The full stop character (or dot, i.e. ‘.’) represents any character, so this expression
start(.)*end represents any string of characters (or nothing) starting with start and finishing with
end.
\s \s is shorthand and represents a space.
\S \S matches any single non-space character and is equivalent of [^\s]
\d \d is shorthand and represents a digit, i.e. it is the same as [0-9].
\D \D is the equivalent of ^\d, in other words any character that is NOT a digit.
Placing a backslash (\) in front of a special character means that the expression
\.\* matches the character instead of interpreting the character. So \. would match a ‘.’
and \* would match a ‘*’, so the expression to the left would match ‘.*’

A notations such as a (a I b) * is called a regular expression, regex or pattern.


Regular

146
Regular Expressions Computing Unit 3

Searching using Regular Expressions


A regular expression is also a special text string for describing a search pattern. Regular
expressions define patterns of characters that, applied to a block of text, enable specific strings of
characters to be located within the text.

The most basic regular expression consists of a single literal character. It will match the first
occurrence of that character in the string. The regex can match a second literal character only if
the regex engine is instructed to start searching through the string after the first match. In a text
editor, you can do this by using the Find Next or Search Forward function.

Applications of regular expressions


Regular expressions are used extensively in operating systems for pattern matching in
commands and when performing a search for files or folders.

Often when editing text, you need to search for a word in a block of text. You can use a regular
expression (regex) to find a word, even if it is misspelled.

For example, if searching for the word separate (correct spelling) the regex sep[ae] r [ae]te would
find separate, sepearate, seperete and separete.

The regex [A-Za-z] [A-Za-z_0-9]* could be used to search for an identifier in a programming
language.

Other applications are


• scanning for virus signatures,

• search and replace in word processors,

• searching for information using Google,

• filtering text (spam, Net Nanny, Carnivore, malware, firewall traffic),

• validating data-entry fields (e-mail, dates, URLs, debit and credit card numbers).

KPY - August 15 147


Regular Expressions Computing Unit 3

Backus-Naur Form (BNF)

 Backus-Naur form is a notation for expressing the rules for constructing valid strings in a regular
language

Defining the syntax of a formal language by means of regular expressions can be very tedious for
languages with large alphabets. Some types of language whose syntax cannot be defined by a
regular expression. Instead we express the rules of the language in a notation known as Backus-
Naur form.

Backus-Naur Form (BNF) is a notation which is used to express the syntax of a language.

The basic structure of a BNF statement consists of a meta-component (the thing being defined)
and a definition.

The meta-component is enclosed in angle brackets (‘<>’), and comes first in a statement. Next
comes the ‘::=’ symbol, which indicates that the following statements are the definition of the
meta-component.

Meta-component ::= definition

To define what is meant by a digit, the BNF would be

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The pipe symbol (‘|’) represents ‘OR’, meaning in this case that a digit can be one of 1, 2, 3, etc.
Entities can also be defined in terms of previously defined entities, or in terms of themselves
(recursion).

<integer> ::= <digit> | <integer> <digit>

This statement defines an integer as either a single digit, or as an integer followed by a digit. This
allows strings of digits to be classed as a single entity, an integer.

Using the rules for pencil hardness (see page 141) we can construct the BNF for all valid strings
if the alphabet is:

{H, B, 2, 3, 4}

<lead hardness> ::= HB | <scale of hardness> | <simple hardness>


<scale of hardness> ::= <numeric value><Simple Hardness>
<numeric value> ::= 2 | 3 | 4
<simple hardness> ::= H | B

148
Regular Expressions Computing Unit 3

Recursive Definitions
If we wanted to define an expression to represent an unsigned integer, we would need a BNF
definition similar to

<unsigned integer> ::= <digit>|<digit><digit>|<digit><digit><digit>|<digit><digit><digit><digit>|


……
<digit>::=0|1|2|3|4|5|6|7|8|9

The solution is to use a recursive definition, in which the term being defined is also used in the
definition:

<unsigned integer>::= <digit>|<digit><unsigned integer>

NOTE: Unlike regular expressions, BNF does not support iteration (looping) so recursion is
needed.

Natural language and BNF


Natural language has a very complicated set of rules for constructing grammatically correct
sentences. Sentences are constructed from nouns, verbs, adjectives, prepositions, the definite
article, the indefinite article and adverbs.
In BNF a subset of the English language can be specified as follows:

<sentence>::=<noun phrase><verb phrase><noun phrase>


<noun phrase> ::= <definite article><noun> |<preposition><definite
article><noun>
<verb phrase> ::= <verb> |<adverb><verb>
<article> ::= The | the | a | an | A | An
<preposition> : := in | on | at | of
<noun> ::= cat | mat | fire | front | mouse
<verb> ::= sat | lay | eat | caught | slept
<adverb> : := slowly | quickly | languidly

The non-terminal symbol <sentence> is the starting symbol.

The cat sat on the mat

KPY - August 15 149


Regular Expressions Computing Unit 3

<sentence>

<noun phrase> <verb phrase> <noun phrase>

<definate <definate
<noun> <verb> <preposition> <noun>
article> article>

The cat sat on the mat

150

You might also like