0% found this document useful (0 votes)
48 views48 pages

RegEx 1

Regular expressions (RegEx) allow users to search for patterns in text. In Python, the re module provides functions for working with RegEx. A regular expression pattern is compiled into a Regex object which can then be used to search strings and return match objects. Match objects have methods like group() to extract the matched text. Groups in patterns matched with parentheses can be accessed individually or together to retrieve specific parts of the matched string. The pipe symbol | is used to match one of multiple expressions.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views48 pages

RegEx 1

Regular expressions (RegEx) allow users to search for patterns in text. In Python, the re module provides functions for working with RegEx. A regular expression pattern is compiled into a Regex object which can then be used to search strings and return match objects. Match objects have methods like group() to extract the matched text. Groups in patterns matched with parentheses can be accessed individually or together to retrieve specific parts of the matched string. The pipe symbol | is used to match one of multiple expressions.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Python

Regular Expressions
(RegEx)
“Some people, when
confronted with a problem,
think ‘I know, I'll use regular
expressions.’ Now they
have two problems.”

-- Jamie Zawinski
Regular Expressions

• A RegEx is a powerful tool for matching text,


based on a pre-defined pattern.
• It can detect the presence or absence of a
text by matching it with a particular pattern.
• It split a pattern into one or more sub-
patterns.
• The Python standard library provides a re
module for regular expressions.
Regular Expressions

• Regular expressions are a powerful string


manipulation tool
• All modern languages have similar library
packages for regular expressions
• Use regular expressions to:
• Search a string (search and match)
• Replace parts of a string (sub)
• Break strings into smaller pieces (split)
Python’s Regular Expression Syntax

• Most characters match themselves


The regular expression “test” matches the
string ‘test’, and only that string
• [x] matches any one of a list of characters
“[abc]” matches ‘a’,‘b’,or ‘c’
• [^x] matches any one character that is not
included in x
“[^abc]” matches any single character except
‘a’,’b’,or ‘c’
Python’s Regular Expression Syntax

• “.” matches any single character


• Parentheses can be used for grouping
“(abc)+” matches ’abc’, ‘abcabc’,
‘abcabcabc’, etc.
• x|y matches x or y
“this|that” matches ‘this’ and ‘that’,
but not ‘thisthat’.
Python’sRegular Expression Syntax

• x* matches zero or more x’s


“a*” matches ’’, ’a’, ’aa’, etc.
• x+ matches one or more x’s
“a+” matches ’a’,’aa’,’aaa’, etc.
• x? matches zero or one x’s
“a?” matches ’’ or ’a’
• x{m, n} matches i x‘s, where m<i< n
“a{2,3}” matches ’aa’ or ’aaa’
Regular Expression Syntax
• “\d” matches any digit; “\D” any non-digit
• “\s” matches any whitespace character; “\S”
any non-whitespace character
• “\w” matches any alphanumeric character;
“\W” any non-alphanumeric character
• “^” matches the beginning of the string; “$” the
end of the string
• “\b” matches a word boundary; “\B” matches a
character that is not a word boundary
Search and Match
• The two basic functions are re.search and
re.match
• Search looks for a pattern anywhere in a string
• Match looks for a match staring at the beginning
• Both return None (logical false) if the pattern
isn’t found and a “match object” instance if it is
>>> import re
>>> pat = "a*b”
>>> re.search(pat,"fooaaabcde")
<_sre.SRE_Match object at 0x809c0>
>>> re.match(pat,"fooaaabcde")
Q: What’s a match object?
• A: an instance of the match class with the details
of the match result
>>> r1 = re.search("a*b","fooaaabcde")
>>> r1.group() # group returns string
matched
'aaab'
>>> r1.start() # index of the match start
3
>>> r1.end() # index of the match end
7
>>> r1.span() # tuple of (start, end)
(3, 7)
What got matched?
• Here’s a pattern to match simple email
addresses
\w+@(\w+\.)+(com|org|net|edu)

>>> pat1 =
"\w+@(\w+\.)+(com|org|net|edu)"
>>> r1 =
re.match(pat,"[email protected]")
>>> r1.group()
'[email protected]
• We might want to extract the pattern parts, like
the email name and host
What got matched?
• We can put parentheses around groups we
want to be able to reference
>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))"
>>> r2 = re.match(pat2,"[email protected]")
>>> r2.group(1)
'finin'
>>> r2.group(2)
'cs.umbc.edu'
>>> r2.groups()
r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
• Note that the ‘groups’ are numbered in a
preorder traversal of the forest
What got matched?
• We can ‘label’ the groups as well…
>>> pat3
="(?P<name>\w+)@(?P<host>(\w+\.)+(com|o
rg|net|edu))"
>>> r3 =
re.match(pat3,"[email protected]")
>>> r3.group('name')
'finin'
>>> r3.group('host')
'cs.umbc.edu’
• And reference the matching parts by the
labels
More re functions
• re.split() is like split but can use patterns
>>> re.split("\W+", “This... is a test,
short and sweet, of split().”)
['This', 'is', 'a', 'test', 'short’,
'and', 'sweet', 'of', 'split’, ‘’]
• re.sub() substitutes one string for a pattern
>>> re.sub('(blue|white|red)', 'black', 'blue
socks and red shoes')
'black socks and black shoes’
• re.findall() finds al matches
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’]
Compiling regular expressions
• If you plan to use a re pattern more than once,
compile it to a re object
• Python produces a special data structure that
speeds up matching
>>> capt3 = re.compile(pat3)
>>> cpat3
<_sre.SRE_Pattern object at 0x2d9c0>
>>> r3 = cpat3.search("[email protected]")
>>> r3
<_sre.SRE_Match object at 0x895a0>
>>> r3.group()
'[email protected]'
Pattern object methods
Pattern objects have methods that parallel the re
functions (e.g., match, search, split, findall, sub),
e.g.:
>>> p1 = re.compile("\w+@\w+\.+com|org|net|edu")
>>> p1.match("[email protected]").group(0) email address
'[email protected]'
>>> p1.search(”Email [email protected] today.").group(0)
'[email protected]
>>> p1.findall("Email [email protected] and [email protected] now.")
['[email protected]', '[email protected]’] sentence boundary
>>> p2 = re.compile("[.?!]+\s+")
>>> p2.split("Tired? Go to bed! Now!! ")
['Tired', 'Go to bed', 'Now', ’ ']
Example: pig latin

• Rules
• If word starts with consonant(s)
— Move them to the end, append “ay”
• Else word starts with vowel(s)
— Keep as is, but add “zay”
The pattern

([bcdfghjklmnpqrstvwxyz]+)(\w+)
piglatin.py

import re
pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’
cpat = re.compile(pat)

def piglatin(string):
return " ".join( [piglatin1(w) for w in string.split()] )
piglatin.py
def piglatin1(word):
"""Returns the pig latin form of a word. e.g.:
piglatin1("dog”) => "ogday". """
match = cpat.match(word)
if match:
consonants = match.group(1)
rest = match.group(2)
return rest + consonents + “ay”
else:
return word + "zay"
Python
Regular Expressions
(RegEx)
Pattern matching in Python
with RegEx
• Regular expressions, called regexes for short,
are descriptions for a pattern of text.
• Following regex is used in Python to match a
string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers.
Eg: \d\d\d-\d\d\d-\d\d\d\d
• adding a 3 in curly brackets ({3}) after a pattern
is like saying, “ Match this pattern three times.”
So the slightly shorter regex
\d{3}-\d{3}-\d{4}
Creating RegEx Object
• All the regex functions in Python are in the re
module
import re
• To create a Regex object that matches the
phone number pattern, enter the following
into the interactive shell.
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-
\d\d\d\d')
Now the phoneNumRegex variable
contains a Regex object.
Steps of Regular Expression
Matching
Steps:
1.Import the regex module with import re.
2.Create a Regex object with the re.compile()
function. (Remember to use a raw string.)
3.Pass the string you want to search into the
Regex object’s search() method. This returns
a Match object.
4.Call the Match object’s group() method to
return a string of the actual matched text.
Matching RegEx Objects
• A Regex object’s search() method searches the
string it is passed for any matches to the regex.
• Match objects have a group() method that will return
the actual matched text from the searched string.

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-
\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-
555-4242.')
print('Phone number found: ' + mo.group())

Output:
Phone number found: 415-555-4242
Grouping with parentheses
• Separate the area code from the rest of the phone
number
• Adding parentheses will create groups in the regex:
(\d\d\d)-(\d\d\d-\d\d\d\d).
• Use the group() match object method to grab the
matching text from just one group.
import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-
4242.')
print(mo.group(1))
Output:
'415'
Retrieve all the groups at once

• use the groups(), method—note the plural


form for the name.
import re
phoneNumRegex = re.compile(r'(\d\d\d)-
(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number
is 415-555-4242.')
print(mo.groups())

Output:
('415', '555-4242')
Using mo.groups
• mo.groups() will return a tuple of multiple values
• can use the multiple-assignment trick to assign each
value to a separate variable, as in the following
areaCode, mainNumber = mo.groups() line.

phoneNumRegex = re.compile(r'(\d\d\d)-
(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-
555-4242.')
areaCode, mainNumber = mo.groups()
print(mainNumber)
Output:
'555-4242'
Match a parenthesis
• Parentheses have a special meaning in regular
expressions, but what do you do if you need to match a
parenthesis in your text.
• For instance, maybe the phone numbers you are trying
to match have the area code set in parentheses.
• In this case, you need to escape the (and) characters
with a backslash.
import re
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-
\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415)
555-4242.')
print(mo.group(1))

Output:
'(415)'
Matching Multiple Groups with
the Pipe
• The | character is called a pipe.
• We can use it anywhere, to match one of many
expressions.
• Eg: the regular expression r’Batman|Tina Fey’
will match either ‘Batman’ or ‘Tina Fey’.
• When both Batman and Tina Fey occur in the
searched string, the first occurrence of
matching text will be returned as the Match
object.
Matching Multiple Groups with
the Pipe
import re
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())

Output:
'Batman'
Matching Specific Repetitions
with Curly Brackets
• If you have a group that you want to repeat a specific
number of times, follow the group in your regex with a
number in curly brackets.
• For example, the regex (Ha){3} will match the string
‘HaHaHa’, but it will not match ‘HaHa’, since the latter
has only two repeats of the (Ha) group.
• Instead of one number, you can specify a range by
writing a minimum, a comma, and a maximum in
between the curly brackets.
• For example, the regex (Ha){3, 5} will match ‘HaHaHa’,
‘HaHaHaHa’, and ‘HaHaHaHaHa’.
Matching Specific Repetitions
with Curly Brackets
• leave out the first or second number in the curly
brackets to leave the minimum or maximum
unbounded.
• For example, (Ha){3, } will match three or more
instances of the (Ha) group, while (Ha){, 5} will match
zero to five instances.
• Curly brackets can help make your regular expressions
shorter.
• These two regular expressions match identical patterns:
(Ha){3}
(Ha)(Ha)(Ha)
• (Ha){3, 5}
• ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(H
a))
Matching Specific Repetitions
• import re
with Curly Brackets
• haRegex = re.compile(r'(Ha){3}')
• mo1 = haRegex.search('HaHaHa')
• print(mo1.group())
Output:
'HaHaHa‘
• import re
• haRegex = re.compile(r'(Ha){3}')
• mo2 = haRegex.search('Ha')== None
• print(mo2)
Output:
True
• Here, (Ha){3} matches ‘HaHaHa’ but not ‘Ha’. Since it
doesn’t match ‘Ha’, search() returns None.
Optional Matching with the
Question Mark
• Sometimes there is a pattern that you want to match
only optionally.
• That is, the regex should find a match whether or not
that bit of text is there.
• The ? character flags the group that precedes it as an
optional part of the pattern.
import re
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
Output:
'Batman'
Optional Matching with the
Question Mark
import re
batRegex = re.compile(r'Bat(wo)?man')
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
Output:
'Batwoman‘
• The (wo)? part of the regular expression means that the
pattern wo is an optional group.
• The regex will match text that has zero instances or one
instance of wo in it. This is why the regex matches both
‘Batwoman’ and ‘Batman’.
• You can think of the ? as saying, “Match zero or one of the
group preceding this question mark.”
If you need to match an actual question mark character,
escape it with \?.
Matching Zero or More with the
Star
• The * (called the star or asterisk) means “match zero or more”—the
group that precedes the star can occur any number of times in the
text.
• It can be completely absent or repeated over and over again.
import re
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
Output:
'Batman‘
import re
batRegex = re.compile(r'Bat(wo)*man')
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
Output:
'Batwoman'
Matching Zero or More with the
Star
import re
batRegex = re.compile(r'Bat(wo)*man')
mo3 = batRegex.search('The Adventures of
Batwowowowoman')
print(mo3.group())
Output:
'Batwowowowoman‘
• For ‘Batman’, the (wo)* part of the regex matches zero
instances of wo in the string;
• for ‘Batwoman’, the (wo)* matches one instance of wo;
• for ‘Batwowowowoman’, (wo)* matches four instances
of wo.
• If you need to match an actual star character, prefix the
star in the regular expression with a backslash, \*.
Matching One or More with the
Plus
• While * means “match zero or more, ” the + (or plus)
means “match one or more.”
• Unlike the star, which does not require its group to
appear in the matched string, the group preceding a
plus must appear at least once. It is not optional.
import re
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
Output:
'Batwoman'
Matching One or More with the
Plus
import re
batRegex = re.compile(r'Bat(wo)+man')
mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())
Output:
'Batwowowowoman‘
• batRegex = re.compile(r’Bat(wo)+man’)
import re
batRegex = re.compile(r'Bat(wo)+man')
mo3 = batRegex.search('The Adventures of Batman')== None
print(mo3)
Output:
True
• The regex Bat(wo)+man will not match the string ‘The Adventures
of Batman’ because at least one wo is required by the plus sign.
• If you need to match an actual plus sign character, prefix the plus
Search and Match
• The two basic functions are re.search and
re.match
• Search looks for a pattern anywhere in a string
• Match looks for a match staring at the beginning
• Both return None (logical false) if the pattern
isn’t found and a “match object” instance if it is
>>> import re
>>> pat = "a*b”
>>> re.search(pat,"fooaaabcde")
<_sre.SRE_Match object at 0x809c0>
>>> re.match(pat,"fooaaabcde")
Q: What’s a match object?
• A: an instance of the match class with the details
of the match result
>>> r1 = re.search("a*b","fooaaabcde")
>>> r1.group() # group returns string
matched
'aaab'
>>> r1.start() # index of the match start
3
>>> r1.end() # index of the match end
7
>>> r1.span() # tuple of (start, end)
(3, 7)
What got matched?
• Here’s a pattern to match simple email
addresses
\w+@(\w+\.)+(com|org|net|edu)

>>> pat1 =
"\w+@(\w+\.)+(com|org|net|edu)"
>>> r1 =
re.match(pat,"[email protected]")
>>> r1.group()
'[email protected]
• We might want to extract the pattern parts, like
the email name and host
What got matched?
• We can put parentheses around groups we
want to be able to reference
>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))"
>>> r2 = re.match(pat2,"[email protected]")
>>> r2.group(1)
'finin'
>>> r2.group(2)
'cs.umbc.edu'
>>> r2.groups()
r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
• Note that the ‘groups’ are numbered in a
preorder traversal of the forest
What got matched?
• We can ‘label’ the groups as well…
>>> pat3
="(?P<name>\w+)@(?P<host>(\w+\.)+(com|o
rg|net|edu))"
>>> r3 =
re.match(pat3,"[email protected]")
>>> r3.group('name')
'finin'
>>> r3.group('host')
'cs.umbc.edu’
• And reference the matching parts by the
labels
More re functions
• re.split() is like split but can use patterns
>>> re.split("\W+", “This... is a test,
short and sweet, of split().”)
['This', 'is', 'a', 'test', 'short’,
'and', 'sweet', 'of', 'split’, ‘’]
• re.sub() substitutes one string for a pattern
>>> re.sub('(blue|white|red)', 'black', 'blue
socks and red shoes')
'black socks and black shoes’
• re.findall() finds al matches
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’]
Compiling regular expressions
• If you plan to use a re pattern more than once,
compile it to a re object
• Python produces a special data structure that
speeds up matching
>>> capt3 = re.compile(pat3)
>>> cpat3
<_sre.SRE_Pattern object at 0x2d9c0>
>>> r3 = cpat3.search("[email protected]")
>>> r3
<_sre.SRE_Match object at 0x895a0>
>>> r3.group()
'[email protected]'
Pattern object methods
Pattern objects have methods that parallel the re
functions (e.g., match, search, split, findall, sub),
e.g.:
>>> p1 = re.compile("\w+@\w+\.+com|org|net|edu")
>>> p1.match("[email protected]").group(0) email address
'[email protected]'
>>> p1.search(”Email [email protected] today.").group(0)
'[email protected]
>>> p1.findall("Email [email protected] and [email protected] now.")
['[email protected]', '[email protected]’] sentence boundary
>>> p2 = re.compile("[.?!]+\s+")
>>> p2.split("Tired? Go to bed! Now!! ")
['Tired', 'Go to bed', 'Now', ’ ']

You might also like