Regular Expression
Regular Expression
RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.
Import the re module:
import re
RegEx in Python
When you have imported the re module, you can start using regular expressions:
Search the string to see if it starts with "The" and ends with "Spain":
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
Output:
YES! We have a match!
RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:
Functio Description
n
split Returns a list where the string has been split at each match
| Either or "falls|stays"
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special
meaning:
Characte Description Example
r
\D Returns a match where the string DOES NOT contain digits "\D"
Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:
Set Description
[arn] Returns a match where one of the specified characters (a, r, or n) is present
[a-n] Returns a match for any lower case character, alphabetically between a and n
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[a-zA- Returns a match for any character alphabetically between a and z, lower case
Z] OR upper case
[+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match
for any + character in the string
The list contains the matches in the order they are found.
If no matches are found, an empty list is returned:
Example
Return an empty list if no match was found:
import re
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)
Output:
[]
No match
You can control the number of occurrences by specifying the maxsplit parameter:
Example
Split the string only at the first occurrence:
import re
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
Output:
['The', 'rain in Spain']
You can control the number of replacements by specifying the count parameter:
Example
Replace the first 2 occurrences:
import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)
Output:
The9rain9in Spain
Match Object
A Match Object is an object containing information about the search and the result.
Note: If there is no match, the value None will be returned, instead of the Match Object.
Example
Do a search that will return a Match Object:
import re
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object
Output:
<_sre.SRE_Match object; span=(5, 7), match='ai'>
The Match object has properties and methods used to retrieve information about the search,
and the result:
.span() returns a tuple containing the start-, and end positions of the match.
.string returns the string passed into the function
.group() returns the part of the string where there was a match
Example
Print the position (start- and end-position) of the first match occurrence.
The regular expression looks for any words that starts with an upper case "S":
import re
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())
Output:
(12, 17)
Example
Print the string passed into the function:
import re
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)
Output:
The rain in Spain
Example
Print the part of the string where there was a match.
The regular expression looks for any words that starts with an upper case "S":
import re
Metacharacters
Metacharacters are characters with a special meaning:
[] A set of characters "[a-m]"
import re
txt = "The rain in Spain"
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)
Output:
['h', 'e', 'a', 'i', 'i', 'a', 'i']
\ Signals a special sequence (can also be used to escape special characters) "\d"
import re
txt = "That will be 59 dollars"
#Find all digit characters:
x = re.findall("\d", txt)
print(x)
Output:
['5', '9']
Output
['hello']
| Either or "falls|stays"
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['falls']
Yes, there is at least one match!
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special
meaning:
Characte Description Example
r
Output:
[]
No match
import re
txt = "The rain in Spain"
#Check if "ain" is present at the end of a WORD:
x = re.findall(r"ain\b", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['ain', 'ain']
Yes, there is at least one match!
\ Returns a match where the specified characters are present, but NOT at the r"\
B beginning (or at the end) of a word Bain"
(the "r" in the beginning is making sure that the string is being treated as a
"raw string") r"ain\
B"
import re
txt = "The rain in Spain"
#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r"\Bain", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['ain', 'ain']
Yes, there is at least one match!
\d Returns a match where the string contains digits (numbers from 0-9) "\d"
import re
txt = "The rain in Spain"
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
[]
No match
\D Returns a match where the string DOES NOT contain digits "\D"
import re
txt = "The rain in Spain"
#Return a match at every no-digit character:
x = re.findall("\D", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!
\s Returns a match where the string contains a white space character "\s"
import re
txt = "The rain in Spain"
#Return a match at every white-space character:
x = re.findall("\s", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
[' ', ' ', ' ']
Yes, there is at least one match!
\S Returns a match where the string DOES NOT contain a white space character "\S"
import re
txt = "The rain in Spain"
#Return a match at every NON white-space character:
x = re.findall("\S", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!
\ Returns a match where the string contains any word characters (characters from "\
w a to Z, digits from 0-9, and the underscore _ character) w"
import re
txt = "The rain in Spain"
#Return a match at every word character (characters from a to Z, digits from 0-9, and the
underscore _ character):
x = re.findall("\w", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!
\W Returns a match where the string DOES NOT contain any word characters "\W"
import re
Output:
[' ', ' ', ' ']
Yes, there is at least one match!
\Z Returns a match if the specified characters are at the end of the string "Spain\Z"
import re
txt = "The rain in Spain"
#Check if the string ends with "Spain":
x = re.findall("Spain\Z", txt)
print(x)
if x:
print("Yes, there is a match!")
else:
print("No match")
Output:
['Spain']
Yes, there is a match!
Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:
[arn] Returns a match where one of the specified characters (a, r, or n) is present
import re
txt = "The rain in Spain"
#Check if the string has any a, r, or n characters:
x = re.findall("[arn]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!
[a-n] Returns a match for any lower case character, alphabetically between a and n
import re
txt = "The rain in Spain"
#Check if the string has any characters between a and n:
x = re.findall("[a-n]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!
Output:
['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
Yes, there is at least one match!
[0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present
import re
txt = "The rain in Spain"
#Check if the string has any 0, 1, 2, or 3 digits:
x = re.findall("[0123]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
[]
No match
Output:
['8', '1', '1', '4', '5']
Yes, there is at least one match!
Output:
['11', '45']
Yes, there is at least one match!
[a-zA-Z] Returns a match for any character alphabetically between a and z, lower case
OR upper case
import re
txt = "8 times before 11:45 AM"
#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[a-zA-Z]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!
[+ In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for
] any + character in the string
import re
txt = "8 times before 11:45 AM"
#Check if the string has any + characters:
x = re.findall("[+]", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Output:
[]
No match
The “re” module which comes with every python installation provides regular expression
support.
Matching patterns
Regular expressions are complicated mini-language. They rely on special characters to match
unknown strings, but let's start with literal characters, such as letters, numbers, and the space
character, which always match them. Let's see a basic example:
Output:
regex matches: King