Lec 07 II Dsfa23
Lec 07 II Dsfa23
1
Deal with a major challenge of EDA:
cleaning text
• Operate on text data using str
methods
• Apply regex to identify patterns
Goals for this in strings
Lecture
Lecture 07
2
This Week
? Question &
Problem
Formulation
Data
Acquisitio
n
Prediction Exploratory
and Data
Inference Analysis
Reports,
Decisions, and
Solutions (Next)
(Last weeks) (Today)
Working with Text Visualization
Data Wrangling
Data Code for plotting
Intro to EDA
Regular Expressions data
3
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions
Agenda
Lecture 07
4
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
Text?
Lecture 07
5
Why Work With Text? Two Common Goals
1. Canonicalization: Convert
data that has more than one
possible presentation into a
standard form.
6
Why Work With Text? Two Common Goals
1. Canonicalization: Convert 2. Extract information into a new
data that has more than one feature.
possible presentation into a
standard form.
7
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
Methods
Lecture 07
8
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:
s.split(…) s[1:4]
split substring
9
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str
operator
Series.str.string_operation()
10
.str Methods
Most base Python string operations have a pandas str equivalent
s.lower() ser.str.lower()
transformation
s.upper() ser.str.upper()
replacement/ s.replace(…) ser.str.replace(…)
deletion
'ab' in s ser.str.contains(…)
membership
11
Demo 1: Canonicalization
def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
Example .str.replace(' ', '') # remove
space
.str.replace('&', 'and') #
replace &
.str.replace('.', '') # remove
dot
.str.replace('county', '') 12
.str.replace('parish', '')
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions
Why regex?
Lecture 07
13
Flexibility Matters!
169.237.46.168 - - 193.205.203.3 - -
[26/Jan/2014:10:47:58 -0800] [2/Feb/2005:17:23:6 -0800] "GET
"GET /stat141/Winter04/ HTTP/1.1" /stat141/Notes/dim.html HTTP/1.0" 404
200 2585 302 "
"http://anson.ucdavis.edu/courses/ http://eeyore.ucdavis.edu/stat141/Note
" s/session.html
"
Formatting varies:
● Different # of characters before the date
● Different format for the day of the month
● Different # of characters after the date
14
Flexibility Matters!
We made a big assumption in the previous example: knowing for certain what
changes needed to be made to the text. “Eyeballing” the steps needed for
canonicalization.
Consider our data extraction task from before – pulling out dates from log data:
169.237.46.168 - - 193.205.203.3 - -
[26/Jan/2014:10:47:58 -0800] [2/Feb/2005:17:23:6 -0800] "GET
"GET /stat141/Winter04/ HTTP/1.1" /stat141/Notes/dim.html HTTP/1.0" 404
200 2585 302 "
"http://anson.ucdavis.edu/courses/ http://eeyore.ucdavis.edu/stat141/Note
" s/session.html
"
15
Demo: Extracting Date Information
169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800]
"GET /stat141/Winter04/ HTTP/1.1"
200 2585
"http://anson.ucdavis.edu/courses/
"
16
String Extraction: An Alternate Approach
While we can hack together code pertinent = line.split("[")[1].split(']')[0]
that uses replace/split… day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')
Seem impossible? 17
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions
Regex Basics
Lecture 07
18
A Regular Expression Describes a Set of Strings Through Patterns
A regular expression (“regex”) is a sequence of characters that specifies a search
pattern.
“Regex” pronunciation?
19
Goals for regex
The goal of today is NOT to memorize regex!
Instead:
1. Understand what regex is capable of.
2. Parse and create regex, with a reference table. high-
level
20
Regex Basics
There are four basic operations in regex.
* – “zero or ( ) – “consider a
more” group”
AB*A matches AA, ABA, ABBA, … (AB)*A matches A, ABA, ABABA,
…
A(A|B)AAB matches AAAAB or
ABAAB
*, ( ), and | are called metacharacters – they represent an operation, rather than a literal
text character
21
Resources for Practicing regex
There are many nice resources out there to experiment with regular expressions
(e.g. regex101.com, regexone.com, Sublime Text).
22
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.
Make an empty notebook and play with the examples therein. Regex101 is
phenomenal for learning basic regex syntax, but less so for learning the full
functionality of regular expressions in programming (matching, splitting, search
and replace, group management, …).
The examples can be pasted in directly:
23
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.
Make an empty notebook and play with the examples therein. Regex101 is
phenomenal for learning basic regex syntax, but less so for learning the full
functionality of regular expressions in programming (matching, splitting, search
and replace, group management, …).
The examples can be pasted in directly:
24
Summary So Far
Ord
Operation Example Matches Doesn’t match
er
AA every other
or, | 4 AA|BAAB
BAAB string
* AA AB
2 AB*A
(zero or more) ABBBBBBA ABABA
[ ] – “define a character
. – “look for any
character” class”
.U.U.U. matches CUMULUS, [A-Za-z] matches A, a, B, b…
JUGULUM
(yes, it means these are the same: * = {0,}, + = {1,}, and ? = {0,1}) 26
Character Classes
A character class describes a set of characters belonging to the class.
Use ^ to negate a class = match any character other than what follows
Equivalently, the capital versions of the regex built-in classes are negations: \
W, \D, and \S 27
Summary So Far
repeated
j[aeiou]{3}hn jaoehn jhn
exactly a
jooohn jaeiouhn
times: {a}
repeated from
john jhn
a to b times: j[ou]{1,2}hn
juohn jooohn
{a,b}
john jhn
at least one jo+hn
joooooohn jjohn
28
Greediness
Regex is greedy – it will look for the longest possible match in a string
<div>.*</div>
29
Greediness
Regex is greedy – it will look for the longest possible match in a string
<div>.*</div>
In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"
30
Greediness
Regex is greedy – it will look for the longest possible match in a string
<div>.*</div>
In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"
We can fix this by making the pattern non-greedy: This is another meaning of
the ? modifier!
It tags multipliers as non-
<div>.*?</ greedy.
31
div>
Regex Even More Expanded
The last set.
dark
end of line ark$ ark two
ark o ark
escape
cow\.com cow.com cowscom
character
33
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions
Regex Functions
Lecture 07
34
Before We Begin: Raw Strings in Python
When specifying a pattern, we strongly suggest using raw
strings.
pattern = r"[0-9]+"
● A raw string is created by prepending r to the string
delimiters (r"...", r'...', r"""...""", r'''...''')
● The exact reason is a bit tedious.
○ Rough idea: Regular expressions
and Python strings both use \ as a special
character.
○ Using non-raw strings leads to uglier regular
expressions.
We stopped
here on
Lecture 6
For more information see “The Backslash Plague” under
https://docs.python.org/3/howto/regex.html#the-backslash-pla
gue
35
Extraction
re.findall(pattern, text) docs
Return a list of all matches to pattern.
['123-45-6789', '321-45-6789']
A match is a substring
that matches the provided
regex. 36
Extraction
re.findall(pattern, text) docs ser.str.findall(pattern) docs
Return a list of all matches to pattern. Returns a Series of lists
38
Extraction with Capture Groups
39
Substitution
re.sub(pattern, repl, text)
docs
Moo
How it works:
● pattern matches HTML tags
● Then, sub/replace HTML tags with
repl='' (i.e., empty string)
40
Substitution
re.sub(pattern, repl, text) docs ser.str.replace(pattern, repl,
Moo
s.lower() ser.str.lower()
s.upper() ser.str.upper()
s[1:4] ser.str[1:4]
re.findall(…) ser.str.findall(…)
ser.str.extractall(…)
ser.str.extract(…)
len(s) ser.str.len()
s.strip() ser.str.strip()
🌟 42
Limitations of Regular Expressions
Writing regular expressions is like writing a program.
● Need to know the syntax well.
● Can be easier to write than to read.
● Can be difficult to debug.
Regular expressions sometimes jokingly referred to as a “write only language”.
A famous 1997 quote from Jamie Zawinski (co-creator of Firefox's predecessor)
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
44