CHAPTER 10
CHAPTER 10
1. Introduction
Regular expressions (regex) are a powerful tool used across various programming
languages and tools to search, match, and manipulate text. They provide a concise and
flexible means for matching strings of text, such as particular characters, words, or
patterns of characters. Regular expressions are widely used for string parsing, data
validation, data extraction, and transformation.
import re
# Replacing text
replaced_text = pattern.sub("baz", "The foo bar.")
print(replaced_text)
Basic Operations
• re.match(): Determines if the regex matches at the beginning of the string.
• re.search(): Scans a string for a regex match.
• re.findall(): Finds all substrings where the regex matches and returns them as a list.
• re.sub(): Replaces occurrences of the regex pattern with another string.
Examples
• Match an email address: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
• Validate a phone number: \+?\d{1,3}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-
.\s]?\d{1,4}[-.\s]?\d{1,9}
• Find all hashtags in a tweet: #\w+
Tips
• Regular expressions are powerful but can be complex for beginners. Start with simple
patterns and gradually introduce more complexity.
• Debug and test your regex patterns using online tools like regex101.com, which provides
real-time regex matching, explanations, and a testing sandbox.
• Remember that very complex regexes can be difficult to read and maintain. Sometimes,
it's better to use several simpler expressions or other string manipulation techniques.
Literal Characters
The most basic regex pattern is matching literal characters. This means you can search
for exact matches within a string.
import re
In this example, re.search() looks for the exact sequence "World" in the string "Hello,
World!" and prints "Match found: World" when it finds a match.
Character Sets
You can match any one of several characters using square brackets []. A character set
matches any one of the characters enclosed in the brackets.
pattern = "[Hh]ello"
This pattern matches both "hello" and "Hello", as either "H" or "h" is allowed by the
character set [Hh].
Ranges
Within character sets, you can specify a range of characters using a hyphen -. This is
particularly useful for matching any single letter or number within a specific range.
Special Characters
If you need to match a character that has a special meaning in regex (like ., *, ?, etc.), you
must escape it with a backslash \.
# Match a period
pattern = "\."
match = re.search(pattern, "Hello, world.")
if match:
print("Match found:", match.group())
These examples illustrate the flexibility of regex for text searching tasks, even
with just simple character matches. As you become more familiar with regex,
you can combine these simple building blocks to match very complex patterns
with precision and efficiency.
3. Special Characters
In regular expressions (regex), special characters are symbols that have a specific meaning
beyond their literal interpretation. They're used to define patterns for matching a wide
variety of string sequences. Understanding these characters is key to leveraging the full
power of regex for tasks like pattern matching, validation, and parsing. Here's an
overview of commonly used special characters in regex:
1. Period .
• Matches any single character except for a newline character.
• Example: a.b matches "acb", "a&b", but not "ab" or "a\nb".
2. Caret ^
• Matches the start of a string.
• Example: ^Hello matches "Hello" in "Hello, world!" but not in "He said Hello".
3. Dollar Sign $
• Matches the end of a string or the end of a line if multiline mode is enabled.
• Example: world!$ matches "world!" in "Hello, world!" but not in "Hello, world! How are
you?".
4. Asterisk *
• Matches zero or more occurrences of the preceding character.
• Example: a*b matches "b", "ab", "aab", "aaab", etc.
5. Plus Sign +
• Matches one or more occurrences of the preceding character.
• Example: a+b matches "ab", "aab", "aaab", etc., but not "b".
6. Question Mark ?
• Makes the preceding character optional (matches zero or one occurrence).
• Example: colou?r matches both "color" and "colour".
7. Braces {} (Quantifiers)
• Matches a specific number of occurrences of the preceding character.
• Example: a{2}b matches "aab"; a{2,4}b matches "aab", "aaab", or "aaaab".
8. Brackets [] (Character Class)
• Matches any single character contained within the brackets.
• Example: [aeiou] matches any vowel.
9. Pipe | (Alternation)
• Logical OR operator between patterns.
• Example: cat|dog matches "cat" or "dog".
10. Parentheses ()
• Groups multiple characters into a single unit and captures matches for use with
backreferences.
• Example: (abc)+ matches "abc", "abcabc", "abcabcabc", etc.
11. Backslash \
• Escapes special characters or denotes character classes.
• Example: \d matches any digit; \\ matches a backslash.
12. Dot and Star .*
• Together, they're often used to match any sequence of characters.
• Example: a.*b matches any string that starts with "a" and ends with "b".
Character Classes
• \d: Matches any digit (equivalent to [0-9]).
• \D: Matches any non-digit.
• \w: Matches any word character (equivalent to [a-zA-Z0-9_]).
• \W: Matches any non-word character.
• \s: Matches any whitespace character.
• \S: Matches any non-whitespace character.
Using these special characters, regular expressions can create powerful patterns for
matching and manipulating strings in a highly flexible and efficient manner.
4. Character Classes
Character classes in regular expressions (regex) provide a way to match any one out of a
set of characters, making it easier to define patterns that can match various characters in
a single position. Character classes can be predefined (also known as "shorthand
character classes") or custom-defined within square brackets []. Here's an overview:
Understanding and effectively utilizing character classes can significantly enhance your
capability to perform sophisticated text processing tasks using regular expressions.
Validating an email address using regular expressions and character classes involves
creating a pattern that matches the general structure of email addresses. Typically, an
email address consists of a local part, an @ symbol, and a domain part, with specific rules
for what characters can appear in each section.
Here's a basic approach to email validation using Python's re module. Keep in mind, this
example aims for simplicity and may not cover all edge cases or fully comply with the RFC
5322 standard for email addresses.
def validate_email(email):
# Basic email pattern
# \w matches any word character (alphanumeric plus underscore)
# The character class [.-] allows for dots and hyphens within the name
# {2,} in the domain part ensures at least two characters, a simple way to enforce a minimal TLD length
pattern = r'^[\w\.-]+@[\w\.-]+\.\w{2,}$'
if re.match(pattern, email):
print(f"{email} is a valid email address.")
else:
print(f"{email} is not a valid email address.")
# Examples
validate_email("[email protected]") # Valid
validate_email("[email protected]") # Valid
validate_email("invalid-email@") # Invalid
validate_email("[email protected]") # Invalid
validate_email("special@char*acters.com") # Invalid
Explanation
• ^ and $ are anchors that match the start and end of the string, respectively,
ensuring the entire string conforms to the pattern.
• [\w\.-]+ matches one or more word characters, dots, or hyphens. This is used for
both the local part of the email and the domain but before the top-level domain
(TLD).
• @ matches the literal "@" character that separates the local part from the domain.
• \.\w{2,} matches a dot followed by two or more word characters, aiming to
validate the TLD part. Note that {2,} enforces a minimal length but doesn't cap the
maximum, allowing for longer TLDs.
• This regex is simplified for common use cases and might not validate all valid
email addresses according to the full specifications. For instance, it doesn't support
quoted strings or comments in the local part, which are rarely used features.
• Email validation through regex can get extremely complicated if aiming for full RFC
compliance. In many practical applications, further validation (like sending a
confirmation email) is used in conjunction with regex checks.
• Adjustments might be needed based on the specific requirements or expected
format of email addresses in your application context.
5. Quantifiers
1. * (Asterisk)
• Meaning: Matches zero or more occurrences of the preceding element.
• Example: ab* will match "a", "ab", "abb", "abbb", etc.
2. + (Plus)
• Meaning: Matches one or more occurrences of the preceding element.
• Example: ab+ will match "ab", "abb", "abbb", etc., but not "a".
3. ? (Question Mark)
• Meaning: Makes the preceding element optional. It matches zero or one occurrence of
the preceding element.
• Example: ab? will match "a" or "ab", but not "abb".
4. {n} (Exact Number)
• Meaning: Matches exactly n occurrences of the preceding element.
• Example: a{3} will match exactly three "a" characters in a row, such as in "aaa".
5. {n,} (At Least n)
• Meaning: Matches n or more occurrences of the preceding element.
• Example: a{2,} will match "aa", "aaa", "aaaa", etc.
6. {n,m} (Between n and m)
• Meaning: Matches from n to m occurrences of the preceding element. If m is omitted, it's
considered as infinity.
• Example: a{2,4} will match "aa", "aaa", or "aaaa".
7. *?, +?, ??, {n,m}? (Non-Greedy or Lazy)
• Meaning: By default, quantifiers are greedy, meaning they match as many occurrences as
possible. Adding ? after a quantifier makes it lazy (non-greedy), meaning it matches as few
occurrences as possible.
• Example: In the string "aaaa", a+? will match a single "a", whereas a+ would match all four
"a"s.
Practical Examples
• Matching an HTML Tag: <.*?> is a lazy quantifier pattern that matches any HTML
tag. The ? after * makes the match non-greedy, ensuring it stops at the first >.
• Validating Password Strength: ^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$ ensures a
string has at least one letter, at least one digit, and is at least 8 characters long.
The {8,} quantifier enforces the minimum length requirement.
Usage Tips
• Use quantifiers to create flexible and powerful regex patterns that can match a wide range
of text data.
• Be mindful of greedy vs. lazy quantifiers, especially when matching patterns in large texts
or strings with multiple potential matches.
• Regular expressions can become difficult to read and maintain; consider breaking complex
patterns into simpler, well-commented parts when possible.
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@#$%]).{8,}$
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@#$%]).{8,}$
Explanation
• ^ and $ are anchors that match the start and the end of the string, respectively,
ensuring that the entire string fits the pattern.
• (?=.*[a-z]) is a positive lookahead assertion that ensures there's at least one
lowercase letter somewhere in the string. The .* part (any character except
newline, 0 or more times) allows characters before the lowercase letter.
• (?=.*[A-Z]) ensures at least one uppercase letter is present.
• (?=.*\d) checks for at least one digit.
• (?=.*[@#$%]) requires at least one special character from the set @, #, $, %.
• .{8,} stipulates that the entire string must be at least 8 characters long. The dot .
matches any character except newline, and the quantifier {8,} denotes 8 or more
of those characters.
Python Example
Let's use this pattern in a Python function to validate passwords:
import re
def validate_password(password):
pattern = r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@#$%]).{8,}$"
if re.match(pattern, password):
return "Password is valid."
else:
return "Password is invalid."
# Examples
print(validate_password("Password123@")) # Valid
print(validate_password("pass")) # Invalid - too short, lacks uppercase, digit, special
char
print(validate_password("Password123")) # Invalid - lacks special character
print(validate_password("password123@")) # Invalid - lacks uppercase
Conclusion
This example showcases how you can use quantifiers along with other regex features to
create a powerful pattern for password validation. By adjusting the pattern, you can
enforce various password policies, making your applications more secure. Regular
expressions are a versatile tool for string matching and validation tasks, allowing for
precise and efficient text processing.
import re
pattern = r'a.c'
text = "abc aac adc"
In this example, a.c will match "abc", "aac", "adc", and so on, where the dot . matches 'b',
'a', 'd', respectively.
if match:
print(match.group(1)) # Output: Hello, World!
Here, "(.*?)" matches the shortest sequence of any characters between double quotes.
The ? after * makes the quantifier lazy (non-greedy), so it matches as few characters as
possible, allowing it to stop at the first closing quote it encounters.
pattern = r'.*'
Second line."""
if match:
pattern = r'\.'
In this case, \. will match only the literal dot characters in the text.
Conclusion
The dot character is a versatile tool in regex patterns, enabling the matching of
variable text sequences with ease. Its utility can be further enhanced by
combining it with quantifiers and flags to suit complex matching needs.
Understanding how to effectively use the dot character and its nuances is
crucial for mastering regular expressions in Python.
7. Greedy Matches
In the context of regular expressions (regex), "greedy" matching refers to the
tendency of certain quantifiers to match as much of the input text as possible. These
quantifiers include * (zero or more), + (one or more), and ? (zero or one), along with
{n,} (at least n), and {n,m} (between n and m, inclusive). When applied, they attempt to
consume as much of the string as they can while still allowing the overall pattern to
match.
import re
pattern = r'<.*>'
text = "<div>Hello, <span>World!</span></div>"
output:
<div>Hello, <span>World!</span></div>
The pattern <.*> is intended to match HTML-like tags. However, due to the greedy nature
of .*, it matches the entire string from the first < to the last >, instead of stopping at the
first closing >.
pattern = r'<.*?>'
print(match.group())
output:
<div>
Here, <.*?> matches the shortest possible string that starts with < and ends with >, which
is the opening <div> tag.
# Greedy matching
greedy_pattern = r'<title>.*</title>'
# Non-Greedy matching
lazy_pattern = r'<title>.*?</title>'
Output:
Greedy match: <title>The Greedy Fox</title><body>Once upon a time...</body>
The greedy pattern mistakenly captures beyond the intended </title> tag, including
<body> tag content, whereas the non-greedy pattern correctly stops at the first </title>
tag encountered.
Conclusion
Greedy matching is powerful but requires careful handling to avoid over-matching. In
situations where precision is crucial, or you're dealing with nested structures, switching to
non-greedy matching by appending a ? to your quantifiers can provide more control and
accuracy. Remember, the choice between greedy and non-greedy matching depends on
the specific requirements of your pattern matching task.
8. Grouping
Grouping in regular expressions is a powerful feature that allows you to
match and isolate parts of a pattern. This is achieved using parentheses ()
around the parts of the regex pattern you wish to group. Grouping serves
multiple purposes: it can be used to apply a quantifier to a part of a pattern,
to isolate data within a match, or to specify alternatives.
Basic Grouping
Here's a simple example to illustrate grouping:
import re
Non-Capturing Groups
If you want to group parts of your pattern without creating a capturing
group, use (?:...):
Named Groups
For complex patterns, you can name your groups for easier access:
text = "Goaaaaaal!"
pattern = r"goa{2,}l!" # Without grouping, matches 'oa' repeated, but fails
to match the whole text
match = re.search(pattern, text, re.IGNORECASE) # re.IGNORECASE makes the
match case-insensitive
if not match:
pattern = r"go(a{2,})l!" # Group 'a{2,}' to correctly apply the
quantifier
match = re.search(pattern, text, re.IGNORECASE)
if match:
print(match.group(0)) # 'Goaaaaaal!'
import re
if re.search(pattern, text):
print("The text starts with 'Hello'.")
else:
print("The text does not start with 'Hello'.")
pattern = r"you\?$" # Note: The '?' needs to be escaped with '\' because it's a special character in regex.
if re.search(pattern, text):
print("The text ends with 'you?'.")
else:
print("The text does not end with 'you?'.")
Key Points:
• Use ^ to match the start of a string.
• Use $ to match the end of a string.
• These symbols help us find out if our text starts or ends with specific words or characters.
Example Usage
Let's see an example to illustrate the use of these methods and attributes:
import re
if match:
# Entire match
print("Matched text:", match.group())
# Subgroups by name
print("Email:", match.group('email'))
# All subgroups
print("All groups:", match.groups())
# Match position
print("Start, End:", match.span())
Output:
11. Substituting
Substituting text using regular expressions in Python involves replacing parts of a string
that match a specific pattern with another string. This is commonly done with the re.sub()
function from Python's re module. The re.sub() function is powerful for various tasks,
such as cleaning data, making corrections, or formatting text according to specific rules.
Basic Syntax of re.sub()
The basic syntax of re.sub() is as follows:
date = "2023-04-01"
formatted_date = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", date)
print(formatted_date) # Output: 01/04/2023
In the replacement string, \1, \2, and \3 refer to the contents of the first, second, and third
capturing groups in the pattern, respectively.
def censor(match):
word = match.group(0) # The matched word
return "*" * len(word) # Replace with asterisks
print(censored_text) # Output: Avoid using words like ***** or **** in user messages.
Substituting text with regular expressions in Python is a versatile way to search for and
replace patterns within strings, making it a valuable tool for text processing and data
cleaning tasks.
import re
text = "one, two,three, four , five"
parts = re.split(r",\s*", text) # Split by comma followed by zero or more spaces
print(parts) # Output: ['one', 'two', 'three', 'four', 'five']: ['one', 'two', 'three', 'four', 'five']
print(parts) # Output: ['The', ' ', 'rain', ' ', 'in', ' ', 'Spain.']
In this example, the string is only split at the first two spaces, leaving the rest of the string
intact.
Conclusion
The re.split() function offers flexible and powerful options for splitting strings,
outperforming the basic str.split() method when dealing with complex patterns or
multiple delimiters. It's especially useful in data parsing, cleaning, and preprocessing tasks
where the structure of the input data might not be uniform.
13. Compiling Regular Expressions
Compiling a regular expression in Python refers to converting a regex pattern, specified
as a string, into a re.Pattern object using the re.compile() method. This compiled object
can then be used to perform various regex operations, such as matching, searching, and
replacing, more efficiently, especially when the same pattern is used multiple times.
pattern = re.compile(r'\bfoo\b')
# Replace
replaced = pattern.sub("bar", "foo foofoo foo")
print("After replacement:", replaced))
pattern = re.compile(r'(?P<word>\bfoo\b)')
if match:
Conclusion
Compiling regular expressions is a best practice when the same pattern is applied
multiple times, improving both the efficiency and readability of your code. It offers a
structured approach to performing regex operations in Python, making your pattern
matching tasks more manageable and maintainable.
14. Flags
In regular expressions (regex), flags are special settings that modify the behavior of
the pattern matching. Flags can make your regex more powerful and adaptable,
allowing for case-insensitive matching, multiline searches, and more. In Python's re
module, flags are passed as additional arguments to functions like re.compile(),
re.search(), re.match(), re.findall(), and others.
re.IGNORECASE or re.I
Makes the regex case-insensitive, allowing it to match letters regardless of case.
import re
pattern = r'python'
text = 'I love Python!'
re.MULTILINE or re.M
Changes the behavior of ^ and $ so they match the beginning and end of each line,
not just the beginning and end of the entire string.
re.DOTALL or re.S
Allows the dot . to match any character, including newline characters. By default, .
does not match newlines.
text = 'Hello\nWorld'
re.VERBOSE or re.X
Allows you to add whitespace and comments to the regex string, making it more
readable.
pattern = re.compile(r"""
\b # Word boundary
\w+ # One or more word characters
\b # Word boundary
""", re.VERBOSE)
Combining Flags
You can combine multiple flags using the bitwise OR operator |.
text = '''Python
python
PYTHON'''
pattern = r'^python$'
15. % method
The % method, often referred to as the string formatting or interpolation operator in
Python, is a way to format strings. It allows you to insert values into a string template. The
% operator is considered somewhat outdated in modern Python versions, with the
str.format() method and formatted string literals (f-strings, introduced in Python 3.6)
being preferred for readability and flexibility. However, understanding the % method can
still be useful, especially for maintaining older codebases.
Basic Usage
The % operator is used with a format string on the left and the values to be inserted on
the right, which can be a single value, a tuple of values, or a dictionary of values.
# Single value
name = "John"
greeting = "Hello, %s!" % name
print(greeting) # Output: Hello, John!
# Multiple values
age = 30
info = "Name: %s, Age: %d" % (name, age)
print(info) # Output: Name: John, Age: 30
Format Specifiers
The % method uses format specifiers like %s for strings, %d for integers, and %f for floating-
point numbers. Each specifier can be preceded by additional parameters to specify
minimum widths, alignment, padding, decimal precision, etc.
price = 9.99
message = "Price: $%.2f" % price # 2 decimal places for a float
print(message) # Output: Price: $9.99
Dictionary-Based Formatting
You can also use a dictionary for string formatting with the % operator. This approach
uses named placeholders.
data = {'name': 'Alice', 'age': 28}
message = "Name: %(name)s, Age: %(age)d" % data
print(message) # Output: Name: Alice, Age: 28
name = "Alice"
age = 28
While the % method is part of Python's history and is still supported, the newer string
formatting options are generally recommended for their improved functionality and ease
of use.