0% found this document useful (0 votes)
9 views44 pages

Lec 07 II Dsfa23

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views44 pages

Lec 07 II Dsfa23

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

LECTURE 7

Text Wrangling and Regex


Using string methods and regular expressions (regex) to work with
textual data

Data Science@ Knowledge Stream


Sana Jabbar

1
Deal with a major challenge of EDA:
cleaning text
• Operate on text data using str
methods
• Apply regex to identify patterns
Goals for this in strings

Lecture
Lecture 07

2
This Week

? Question &
Problem
Formulation
Data
Acquisitio
n

Prediction Exploratory
and Data
Inference Analysis
Reports,
Decisions, and
Solutions (Next)
(Last weeks) (Today)
Working with Text Visualization
Data Wrangling
Data Code for plotting
Intro to EDA
Regular Expressions data

3
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Agenda
Lecture 07

4
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

Why Work With • Regex functions

Text?
Lecture 07

5
Why Work With Text? Two Common Goals
1. Canonicalization: Convert
data that has more than one
possible presentation into a
standard form.

Ex Join tables with mismatched


labels

6
Why Work With Text? Two Common Goals
1. Canonicalization: Convert 2. Extract information into a new
data that has more than one feature.
possible presentation into a
standard form.

Ex Extract dates and times from log


Ex Join tables with mismatched files
labels 169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800]
"GET /stat141/Winter04/ HTTP/1.1"
200 2585
"http://anson.ucdavis.edu/courses/
join? "

day, month, year = "26", "Jan", "2014"


hour, minute, seconds = "10", "47", "58"

7
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

pandas str • Regex functions

Methods
Lecture 07

8
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:

s.lower() replacement/ s.replace(…)


transformation
s.upper() deletion

s.split(…) s[1:4]
split substring

'ab' in s length len(s)


membership

Problem: Python assumes we are working with one string at a time


Need to loop over each entry – slow in large datasets!

9
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str
operator

Series.str.string_operation()

Apply the function string_operation to every string contained in the


Series
populations[“County”].str.lowe populations[“County”].str.replace('&',
r() 'and')

10
.str Methods
Most base Python string operations have a pandas str equivalent

Python (single pandas (Series of strings)


Operation
string)

s.lower() ser.str.lower()
transformation
s.upper() ser.str.upper()
replacement/ s.replace(…) ser.str.replace(…)
deletion

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

'ab' in s ser.str.contains(…)
membership

length len(s) ser.str.len()

11
Demo 1: Canonicalization

def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
Example .str.replace(' ', '') # remove
space
.str.replace('&', 'and') #
replace &
.str.replace('.', '') # remove
dot
.str.replace('county', '') 12
.str.replace('parish', '')
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Why regex?
Lecture 07

13
Flexibility Matters!

169.237.46.168 - - 193.205.203.3 - -
[26/Jan/2014:10:47:58 -0800] [2/Feb/2005:17:23:6 -0800] "GET
"GET /stat141/Winter04/ HTTP/1.1" /stat141/Notes/dim.html HTTP/1.0" 404
200 2585 302 "
"http://anson.ucdavis.edu/courses/ http://eeyore.ucdavis.edu/stat141/Note
" s/session.html
"
Formatting varies:
● Different # of characters before the date
● Different format for the day of the month
● Different # of characters after the date

We don’t always know the exact format of our data in advance.

14
Flexibility Matters!

We made a big assumption in the previous example: knowing for certain what
changes needed to be made to the text. “Eyeballing” the steps needed for
canonicalization.

Consider our data extraction task from before – pulling out dates from log data:

169.237.46.168 - - 193.205.203.3 - -
[26/Jan/2014:10:47:58 -0800] [2/Feb/2005:17:23:6 -0800] "GET
"GET /stat141/Winter04/ HTTP/1.1" /stat141/Notes/dim.html HTTP/1.0" 404
200 2585 302 "
"http://anson.ucdavis.edu/courses/ http://eeyore.ucdavis.edu/stat141/Note
" s/session.html
"

15
Demo: Extracting Date Information

169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800]
"GET /stat141/Winter04/ HTTP/1.1"
200 2585
"http://anson.ucdavis.edu/courses/
"

day, month, year = "26", "Jan", "2014"


hour, minute, seconds = "10", "47", "58"

One possible solution:


Example
pertinent = line.split("[")[1].split(']')[0]
day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')

16
String Extraction: An Alternate Approach
While we can hack together code pertinent = line.split("[")[1].split(']')[0]
that uses replace/split… day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')

An alternate approach is to use a regular expression:


● Implementation provided in the Python re library and the pandas str
accessor.
● Next, we’ll spend some time working up to expressions like this one:
import re
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, line)[0]

Related: How would you extract all the


"moon moo moooooon mon moooon"
moon-like patterns in this string?

Seem impossible? 17
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Regex Basics
Lecture 07

18
A Regular Expression Describes a Set of Strings Through Patterns
A regular expression (“regex”) is a sequence of characters that specifies a search
pattern.

The language of Social Security


Example: [0-9]{3}-[0-9]{2}-[0-9]{4} Numbers is described by this
3 of any digit, then a dash, regular expression.
then 2 of any digit, then a dash,
Formal language, described
then 4 of any digit. implicitly

“Regex” pronunciation?

19
Goals for regex
The goal of today is NOT to memorize regex!
Instead:
1. Understand what regex is capable of.
2. Parse and create regex, with a reference table. high-
level

3. Use vocabulary (metacharacter, escape character, groups,


etc.) details;
to describe regex metacharacters. hone
4. Differentiate between (), [], {} with
practice
5. Design your own character classes with \d, \w, \s, […-…],
^, etc.
6. Use Python and pandas regex methods.

20
Regex Basics
There are four basic operations in regex.

Concatenation – “look for consecutive | – “or”


characters”
AABAAB matches AABAAB AA|BAAB matches AA or BAAB

* – “zero or ( ) – “consider a
more” group”
AB*A matches AA, ABA, ABBA, … (AB)*A matches A, ABA, ABABA,

A(A|B)AAB matches AAAAB or
ABAAB

*, ( ), and | are called metacharacters – they represent an operation, rather than a literal
text character
21
Resources for Practicing regex
There are many nice resources out there to experiment with regular expressions
(e.g. regex101.com, regexone.com, Sublime Text).

I recommend trying out regex101.com, which provides a visually appealing and


easy to use platform for experimenting with regular expressions.
● Important: choose the Python “flavor” in the left sidebar

22
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.

Make an empty notebook and play with the examples therein. Regex101 is
phenomenal for learning basic regex syntax, but less so for learning the full
functionality of regular expressions in programming (matching, splitting, search
and replace, group management, …).
The examples can be pasted in directly:

23
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.

Make an empty notebook and play with the examples therein. Regex101 is
phenomenal for learning basic regex syntax, but less so for learning the full
functionality of regular expressions in programming (matching, splitting, search
and replace, group management, …).
The examples can be pasted in directly:

24
Summary So Far

Ord
Operation Example Matches Doesn’t match
er

concatenatio every other


n 3 AABAAB AABAAB string
(consecutive
chars)

AA every other
or, | 4 AA|BAAB
BAAB string

* AA AB
2 AB*A
(zero or more) ABBBBBBA ABABA

AAAAB every other


A(A|B)AAB
ABAAB string
group
1
(parenthesis)
A AA
(AB)*A
ABABABABA ABBA
The regex order of operations. Grouping is
evaluated first. 25
Regex Expanded
Six more regex operations.

[ ] – “define a character
. – “look for any
character” class”
.U.U.U. matches CUMULUS, [A-Za-z] matches A, a, B, b…
JUGULUM

+ – “one or ? – “zero or one”


more” ("optional")
AB+ matches AB, ABB, ABBB, … AB? matches A, AB

{x} – “repeat exactly x {x, y} – “repeat between x and y


times” times”
AB{2} matches ABB AB{0,2} matches A, AB, ABB

(yes, it means these are the same: * = {0,}, + = {1,}, and ? = {0,1}) 26
Character Classes
A character class describes a set of characters belonging to the class.

Regex built-in classes:


[A-Z] – any uppercase letter between A
and Z \w is equivalent to [A-Za-z0-9]

[0-9] – any digit between 0 and 9 \d is equivalent to [0-9]


\s matches whitespace
[A-Za-z0-9] – any letter, any digit

Use ^ to negate a class = match any character other than what follows

[^A-Z] – anything that is not an uppercase letter between A and Z

Equivalently, the capital versions of the regex built-in classes are negations: \
W, \D, and \S 27
Summary So Far

Operation Example Matches Doesn’t match

any character CUMULUS SUCCUBUS


.U.U.U.
(except newline) JUGULUM TUMULTUOUS

[A-Za-z][a- word camelCase


character class
z]* Capitalized 4illegal

repeated
j[aeiou]{3}hn jaoehn jhn
exactly a
jooohn jaeiouhn
times: {a}

repeated from
john jhn
a to b times: j[ou]{1,2}hn
juohn jooohn
{a,b}

john jhn
at least one jo+hn
joooooohn jjohn

28
Greediness
Regex is greedy – it will look for the longest possible match in a string

<div>.*</div>

“This is a <div>example</div> of greediness <div>in</div>


regular expressions.”

29
Greediness
Regex is greedy – it will look for the longest possible match in a string

<div>.*</div>

In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"

“This is a <div>example</div> of greediness <div>in</div>


regular expressions.”

30
Greediness
Regex is greedy – it will look for the longest possible match in a string

<div>.*</div>

In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"

“This is a <div>example</div> of greediness <div>in</div>


regular expressions.”

We can fix this by making the pattern non-greedy: This is another meaning of
the ? modifier!
It tags multipliers as non-
<div>.*?</ greedy.
31
div>
Regex Even More Expanded
The last set.

\ – “read the next character


literally”
a\+b matches a+b

^ – “match the beginning of a $ – “match the end of a


string” string”
^abc does not match “123 abc” abc$ does not match “abc 123”

Be careful: ^ has different


behavior inside/outside of
character classes!
32
Summary So Far

Operation Example Matches Doesn’t match

beginning of ark two


^ark dark
line ark o ark

dark
end of line ark$ ark two
ark o ark

escape
cow\.com cow.com cowscom
character

33
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Regex Functions
Lecture 07

34
Before We Begin: Raw Strings in Python
When specifying a pattern, we strongly suggest using raw
strings.
pattern = r"[0-9]+"
● A raw string is created by prepending r to the string
delimiters (r"...", r'...', r"""...""", r'''...''')
● The exact reason is a bit tedious.
○ Rough idea: Regular expressions
and Python strings both use \ as a special
character.
○ Using non-raw strings leads to uglier regular
expressions.

We stopped
here on
Lecture 6
For more information see “The Backslash Plague” under
https://docs.python.org/3/howto/regex.html#the-backslash-pla
gue
35
Extraction
re.findall(pattern, text) docs
Return a list of all matches to pattern.

text = "My social security number is 123-45-6789


bro, or actually maybe it’s 321-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)

['123-45-6789', '321-45-6789']

A match is a substring
that matches the provided
regex. 36
Extraction
re.findall(pattern, text) docs ser.str.findall(pattern) docs
Return a list of all matches to pattern. Returns a Series of lists

text = "My social security number is 123-45-6789 df["SSN"].str.findall(pattern)


bro, or actually maybe it’s 321-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)

['123-45-6789', '321-45-6789'] 0 [987-65-4321]


1 []
2 [123-45-6789, 321-45-6789]
3 [999-99-9999]
Name: SSN, dtype: object
37
Extraction with Capture Groups
Earlier we used parentheses to specify the order of operations.

Parenthesis can have another meaning:


● When using certain regex functions, ( ) specifies a capture group.
● Extract only the portion of the regex pattern inside the capture group

text = """I will meet you at 08:30:00 pm tomorrow"""


The capture groups each
pattern = ".*(\d\d):(\d\d):(\d\d).*"
capture two digits.
matches = re.findall(pattern, text)
matches

[('08', '30', '00')]

38
Extraction with Capture Groups

ser.str.extract(pattern) docs ser.str.extractall(pattern) docs


Returns a DataFrame of each capture Returns a multi-indexed DataFrame of all
group’s first match in the string matches for each capture group

pattern_cg = r"([0-9]{3})-([0-9]{2})-([0-9]{4})" df["SSN"].str.extractall(pattern_cg)


df["SSN"].str.extract(pattern_cg)

39
Substitution
re.sub(pattern, repl, text)
docs

Returns text with all instances of


pattern replaced by repl.
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
re.sub(pattern, '', text) # returns Moo

Moo

How it works:
● pattern matches HTML tags
● Then, sub/replace HTML tags with
repl='' (i.e., empty string)
40
Substitution
re.sub(pattern, repl, text) docs ser.str.replace(pattern, repl,

Returns text with all instances of regex=True ) docs


pattern replaced by repl. Returns Series with all instances of
pattern in Series ser replaced by repl.
text = '<div><td valign="top">Moo</td></div>' df["Html"].str.replace(pattern, '')
pattern = r"<[^>]+>"
re.sub(pattern, '', text) # returns Moo

Moo

How it works: 0 Moo


● pattern matches HTML tags 1 Link
● Then, sub/replace HTML tags with 2 Bold text
Name: Html, dtype: object
repl='' (i.e., empty string)
41
String Function Summary

Base Python re pandas str

s.lower() ser.str.lower()
s.upper() ser.str.upper()

s.replace(…) re.sub(…) ser.str.replace(…)

s.split(…) re.split(…) ser.str.split(…)

s[1:4] ser.str[1:4]

re.findall(…) ser.str.findall(…)
ser.str.extractall(…)
ser.str.extract(…)

'ab' in s re.search(…) ser.str.contains(…)

len(s) ser.str.len()

s.strip() ser.str.strip()

🌟 42
Limitations of Regular Expressions
Writing regular expressions is like writing a program.
● Need to know the syntax well.
● Can be easier to write than to read.
● Can be difficult to debug.
Regular expressions sometimes jokingly referred to as a “write only language”.
A famous 1997 quote from Jamie Zawinski (co-creator of Firefox's predecessor)
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.

Regular expressions are terrible at certain types of problems:


● For parsing a hierarchical structure, such as JSON, use the json.load() parser, not
regex!
● Parsing real-world HTML/xml (lots of <div>...<tag>..</tag>..</div>): use
html.parser.
● Complex features (e.g. valid email address).
● Counting (same number of instances of a and b). (impossible) 43
● Complex properties (palindromes, balanced parentheses). (impossible)
Start Work on notebook

44

You might also like