0% found this document useful (0 votes)

9 views44 pages

Lec 07 II Dsfa23

Uploaded by

Malik Arslan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views44 pages

Lec 07 II Dsfa23

Uploaded by

Malik Arslan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

LECTURE 7

Text Wrangling and Regex

Using string methods and regular expressions (regex) to work with
textual data

Data Science@ Knowledge Stream

Sana Jabbar

1
Deal with a major challenge of EDA:
cleaning text
• Operate on text data using str
methods
• Apply regex to identify patterns
Goals for this in strings

Lecture
Lecture 07

2
This Week

? Question &
Problem
Formulation
Data
Acquisitio
n

Prediction Exploratory
and Data
Inference Analysis
Reports,
Decisions, and
Solutions (Next)
(Last weeks) (Today)
Working with Text Visualization
Data Wrangling
Data Code for plotting
Intro to EDA
Regular Expressions data

3
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Agenda
Lecture 07

4
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

Why Work With • Regex functions

Text?
Lecture 07

5
Why Work With Text? Two Common Goals
1. Canonicalization: Convert
data that has more than one
possible presentation into a
standard form.

Ex Join tables with mismatched

labels

6
Why Work With Text? Two Common Goals
1. Canonicalization: Convert 2. Extract information into a new
data that has more than one feature.
possible presentation into a
standard form.

Ex Extract dates and times from log

Ex Join tables with mismatched files
labels 169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800]
"GET /stat141/Winter04/ HTTP/1.1"
200 2585
"http://anson.ucdavis.edu/courses/
join? "

day, month, year = "26", "Jan", "2014"

hour, minute, seconds = "10", "47", "58"

7
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics

pandas str • Regex functions

Methods
Lecture 07

8
From String to str
In “base” Python, we have various string operations to work with text data.
Recall:

s.lower() replacement/ s.replace(…)

transformation
s.upper() deletion

s.split(…) s[1:4]
split substring

'ab' in s length len(s)

membership

Problem: Python assumes we are working with one string at a time

Need to loop over each entry – slow in large datasets!

9
str Methods
Fortunately, pandas offers a method of vectorizing text operations: the .str
operator

Series.str.string_operation()

Apply the function string_operation to every string contained in the

Series
populations[“County”].str.lowe populations[“County”].str.replace('&',
r() 'and')

10
.str Methods
Most base Python string operations have a pandas str equivalent

Python (single pandas (Series of strings)

Operation
string)

s.lower() ser.str.lower()
transformation
s.upper() ser.str.upper()
replacement/ s.replace(…) ser.str.replace(…)
deletion

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

'ab' in s ser.str.contains(…)
membership

length len(s) ser.str.len()

11
Demo 1: Canonicalization

def canonicalize_county(county_series):
return (county_series
.str.lower() # lowercase
Example .str.replace(' ', '') # remove
space
.str.replace('&', 'and') #
replace &
.str.replace('.', '') # remove
dot
.str.replace('county', '') 12
.str.replace('parish', '')
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Why regex?
Lecture 07

13
Flexibility Matters!

169.237.46.168 - - 193.205.203.3 - -
[26/Jan/2014:10:47:58 -0800] [2/Feb/2005:17:23:6 -0800] "GET
"GET /stat141/Winter04/ HTTP/1.1" /stat141/Notes/dim.html HTTP/1.0" 404
200 2585 302 "
"http://anson.ucdavis.edu/courses/ http://eeyore.ucdavis.edu/stat141/Note
" s/session.html
"
Formatting varies:
● Different # of characters before the date
● Different format for the day of the month
● Different # of characters after the date

We don’t always know the exact format of our data in advance.

14
Flexibility Matters!

We made a big assumption in the previous example: knowing for certain what
changes needed to be made to the text. “Eyeballing” the steps needed for
canonicalization.

Consider our data extraction task from before – pulling out dates from log data:

15
Demo: Extracting Date Information

169.237.46.168 - -
[26/Jan/2014:10:47:58 -0800]
"GET /stat141/Winter04/ HTTP/1.1"
200 2585
"http://anson.ucdavis.edu/courses/
"

day, month, year = "26", "Jan", "2014"

hour, minute, seconds = "10", "47", "58"

One possible solution:

Example
pertinent = line.split("[")[1].split(']')[0]
day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')

16
String Extraction: An Alternate Approach
While we can hack together code pertinent = line.split("[")[1].split(']')[0]
that uses replace/split… day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')

An alternate approach is to use a regular expression:

● Implementation provided in the Python re library and the pandas str
accessor.
● Next, we’ll spend some time working up to expressions like this one:
import re
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, line)[0]

Related: How would you extract all the

"moon moo moooooon mon moooon"
moon-like patterns in this string?

Seem impossible? 17
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Regex Basics
Lecture 07

18
A Regular Expression Describes a Set of Strings Through Patterns
A regular expression (“regex”) is a sequence of characters that specifies a search
pattern.

The language of Social Security

Example: [0-9]{3}-[0-9]{2}-[0-9]{4} Numbers is described by this
3 of any digit, then a dash, regular expression.
then 2 of any digit, then a dash,
Formal language, described
then 4 of any digit. implicitly

“Regex” pronunciation?

19
Goals for regex
The goal of today is NOT to memorize regex!
Instead:
1. Understand what regex is capable of.
2. Parse and create regex, with a reference table. high-
level

3. Use vocabulary (metacharacter, escape character, groups,

etc.) details;
to describe regex metacharacters. hone
4. Differentiate between (), [], {} with
practice
5. Design your own character classes with \d, \w, \s, […-…],
^, etc.
6. Use Python and pandas regex methods.

20
Regex Basics
There are four basic operations in regex.

Concatenation – “look for consecutive | – “or”

characters”
AABAAB matches AABAAB AA|BAAB matches AA or BAAB

* – “zero or ( ) – “consider a
more” group”
AB*A matches AA, ABA, ABBA, … (AB)*A matches A, ABA, ABABA,
…
A(A|B)AAB matches AAAAB or
ABAAB

*, ( ), and | are called metacharacters – they represent an operation, rather than a literal
text character
21
Resources for Practicing regex
There are many nice resources out there to experiment with regular expressions
(e.g. regex101.com, regexone.com, Sublime Text).

I recommend trying out regex101.com, which provides a visually appealing and

easy to use platform for experimenting with regular expressions.
● Important: choose the Python “flavor” in the left sidebar

22
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.

Make an empty notebook and play with the examples therein. Regex101 is
phenomenal for learning basic regex syntax, but less so for learning the full
functionality of regular expressions in programming (matching, splitting, search
and replace, group management, …).
The examples can be pasted in directly:

23
In the Python docs: the RegEx HOWTO, Not The re Module Documentation
The Python Regular Expression HOWTO:
https://docs.python.org/3/howto/regex.html.

24
Summary So Far

Ord
Operation Example Matches Doesn’t match
er

concatenatio every other

n 3 AABAAB AABAAB string
(consecutive
chars)

AA every other
or, | 4 AA|BAAB
BAAB string

* AA AB
2 AB*A
(zero or more) ABBBBBBA ABABA

AAAAB every other

A(A|B)AAB
ABAAB string
group
1
(parenthesis)
A AA
(AB)*A
ABABABABA ABBA
The regex order of operations. Grouping is
evaluated first. 25
Regex Expanded
Six more regex operations.

[ ] – “define a character
. – “look for any
character” class”
.U.U.U. matches CUMULUS, [A-Za-z] matches A, a, B, b…
JUGULUM

+ – “one or ? – “zero or one”

more” ("optional")
AB+ matches AB, ABB, ABBB, … AB? matches A, AB

{x} – “repeat exactly x {x, y} – “repeat between x and y

times” times”
AB{2} matches ABB AB{0,2} matches A, AB, ABB

(yes, it means these are the same: * = {0,}, + = {1,}, and ? = {0,1}) 26
Character Classes
A character class describes a set of characters belonging to the class.

Regex built-in classes:

[A-Z] – any uppercase letter between A
and Z \w is equivalent to [A-Za-z0-9]

[0-9] – any digit between 0 and 9 \d is equivalent to [0-9]

\s matches whitespace
[A-Za-z0-9] – any letter, any digit

Use ^ to negate a class = match any character other than what follows

[^A-Z] – anything that is not an uppercase letter between A and Z

Equivalently, the capital versions of the regex built-in classes are negations: \
W, \D, and \S 27
Summary So Far

Operation Example Matches Doesn’t match

any character CUMULUS SUCCUBUS

.U.U.U.
(except newline) JUGULUM TUMULTUOUS

[A-Za-z][a- word camelCase

character class
z]* Capitalized 4illegal

repeated
j[aeiou]{3}hn jaoehn jhn
exactly a
jooohn jaeiouhn
times: {a}

repeated from
john jhn
a to b times: j[ou]{1,2}hn
juohn jooohn
{a,b}

john jhn
at least one jo+hn
joooooohn jjohn

28
Greediness
Regex is greedy – it will look for the longest possible match in a string

“This is a <div>example</div> of greediness <div>in</div>

regular expressions.”

29
Greediness
Regex is greedy – it will look for the longest possible match in a string

In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"

“This is a <div>example</div> of greediness <div>in</div>

regular expressions.”

30
Greediness
Regex is greedy – it will look for the longest possible match in a string

In English:
● “Look for the exact string <div>"
● then, “look for any character 0 or more times”
● then, “look for the exact string </div>"

“This is a <div>example</div> of greediness <div>in</div>

regular expressions.”

We can fix this by making the pattern non-greedy: This is another meaning of
the ? modifier!
It tags multipliers as non-
<div>.*?</ greedy.
31
div>
Regex Even More Expanded
The last set.

\ – “read the next character

literally”
a\+b matches a+b

^ – “match the beginning of a $ – “match the end of a

string” string”
^abc does not match “123 abc” abc$ does not match “abc 123”

Be careful: ^ has different

behavior inside/outside of
character classes!
32
Summary So Far

Operation Example Matches Doesn’t match

beginning of ark two

^ark dark
line ark o ark

dark
end of line ark$ ark two
ark o ark

escape
cow\.com cow.com cowscom
character

33
• Why work with text?
• pandas str methods
• Why regex?
• Regex basics
• Regex functions

Regex Functions
Lecture 07

34
Before We Begin: Raw Strings in Python
When specifying a pattern, we strongly suggest using raw
strings.
pattern = r"[0-9]+"
● A raw string is created by prepending r to the string
delimiters (r"...", r'...', r"""...""", r'''...''')
● The exact reason is a bit tedious.
○ Rough idea: Regular expressions
and Python strings both use \ as a special
character.
○ Using non-raw strings leads to uglier regular
expressions.

We stopped
here on
Lecture 6
For more information see “The Backslash Plague” under
https://docs.python.org/3/howto/regex.html#the-backslash-pla
gue
35
Extraction
re.findall(pattern, text) docs
Return a list of all matches to pattern.

text = "My social security number is 123-45-6789

bro, or actually maybe it’s 321-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)

['123-45-6789', '321-45-6789']

A match is a substring
that matches the provided
regex. 36
Extraction
re.findall(pattern, text) docs ser.str.findall(pattern) docs
Return a list of all matches to pattern. Returns a Series of lists

text = "My social security number is 123-45-6789 df["SSN"].str.findall(pattern)

bro, or actually maybe it’s 321-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)

['123-45-6789', '321-45-6789'] 0 [987-65-4321]

1 []
2 [123-45-6789, 321-45-6789]
3 [999-99-9999]
Name: SSN, dtype: object
37
Extraction with Capture Groups
Earlier we used parentheses to specify the order of operations.

Parenthesis can have another meaning:

● When using certain regex functions, ( ) specifies a capture group.
● Extract only the portion of the regex pattern inside the capture group

text = """I will meet you at 08:30:00 pm tomorrow"""

The capture groups each
pattern = ".*(\d\d):(\d\d):(\d\d).*"
capture two digits.
matches = re.findall(pattern, text)
matches

[('08', '30', '00')]

38
Extraction with Capture Groups

ser.str.extract(pattern) docs ser.str.extractall(pattern) docs

Returns a DataFrame of each capture Returns a multi-indexed DataFrame of all
group’s first match in the string matches for each capture group

pattern_cg = r"([0-9]{3})-([0-9]{2})-([0-9]{4})" df["SSN"].str.extractall(pattern_cg)

df["SSN"].str.extract(pattern_cg)

39
Substitution
re.sub(pattern, repl, text)
docs

Returns text with all instances of

pattern replaced by repl.
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
re.sub(pattern, '', text) # returns Moo

Moo

How it works:
● pattern matches HTML tags
● Then, sub/replace HTML tags with
repl='' (i.e., empty string)
40
Substitution
re.sub(pattern, repl, text) docs ser.str.replace(pattern, repl,

Returns text with all instances of regex=True ) docs

pattern replaced by repl. Returns Series with all instances of
pattern in Series ser replaced by repl.
text = '<div><td valign="top">Moo</td></div>' df["Html"].str.replace(pattern, '')
pattern = r"<[^>]+>"
re.sub(pattern, '', text) # returns Moo

Moo

How it works: 0 Moo

● pattern matches HTML tags 1 Link
● Then, sub/replace HTML tags with 2 Bold text
Name: Html, dtype: object
repl='' (i.e., empty string)
41
String Function Summary

Base Python re pandas str

s.lower() ser.str.lower()
s.upper() ser.str.upper()

s.replace(…) re.sub(…) ser.str.replace(…)

s.split(…) re.split(…) ser.str.split(…)

s[1:4] ser.str[1:4]

re.findall(…) ser.str.findall(…)
ser.str.extractall(…)
ser.str.extract(…)

'ab' in s re.search(…) ser.str.contains(…)

len(s) ser.str.len()

s.strip() ser.str.strip()

🌟 42
Limitations of Regular Expressions
Writing regular expressions is like writing a program.
● Need to know the syntax well.
● Can be easier to write than to read.
● Can be difficult to debug.
Regular expressions sometimes jokingly referred to as a “write only language”.
A famous 1997 quote from Jamie Zawinski (co-creator of Firefox's predecessor)
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.

Regular expressions are terrible at certain types of problems:

● For parsing a hierarchical structure, such as JSON, use the json.load() parser, not
regex!
● Parsing real-world HTML/xml (lots of <div>...<tag>..</tag>..</div>): use
html.parser.
● Complex features (e.g. valid email address).
● Counting (same number of instances of a and b). (impossible) 43
● Complex properties (palindromes, balanced parentheses). (impossible)
Start Work on notebook

Python Re
No ratings yet
Python Re
101 pages
Untitled
No ratings yet
Untitled
53 pages
Sundeep Agarwal Understanding Python Re Gex
No ratings yet
Sundeep Agarwal Understanding Python Re Gex
228 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Module 4 - Regular Expressions1
No ratings yet
Module 4 - Regular Expressions1
37 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Lecture 04
No ratings yet
Lecture 04
18 pages
Advanced Python Programming - Lesson No.002
No ratings yet
Advanced Python Programming - Lesson No.002
20 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
No ratings yet
A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
18 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
Python Course: Session 6b - Regular Expressions
No ratings yet
Python Course: Session 6b - Regular Expressions
11 pages
06 - Regular Expressions and Network Programming
No ratings yet
06 - Regular Expressions and Network Programming
55 pages
Python Regular Expression
100% (1)
Python Regular Expression
31 pages
Python Re
No ratings yet
Python Re
18 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Regex
No ratings yet
Regex
44 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Regular Expressions Cheat Sheet
No ratings yet
Regular Expressions Cheat Sheet
5 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
Subtitle
No ratings yet
Subtitle
3 pages
Module5 RegularExpressions
No ratings yet
Module5 RegularExpressions
10 pages
Spacy Regex
No ratings yet
Spacy Regex
5 pages
Advanced - Regular Expressions Tutorial
No ratings yet
Advanced - Regular Expressions Tutorial
8 pages
RegEx in Python
No ratings yet
RegEx in Python
6 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
Text Processing For NLP Understanding Regex
No ratings yet
Text Processing For NLP Understanding Regex
16 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Regular Expressions (Slides)
No ratings yet
Regular Expressions (Slides)
20 pages
Python Reg Expressions
No ratings yet
Python Reg Expressions
8 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Beginners Tutorial For Regular Expressions in Python - Python Learning
No ratings yet
Beginners Tutorial For Regular Expressions in Python - Python Learning
23 pages
Deepak Upadhyay BI Resume Updated
No ratings yet
Deepak Upadhyay BI Resume Updated
4 pages
A Guide To R Regular Expressions
No ratings yet
A Guide To R Regular Expressions
15 pages
Top 30+ Best Oracle Apex Interview Questions and Answers in 2022
No ratings yet
Top 30+ Best Oracle Apex Interview Questions and Answers in 2022
15 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
20 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Regular Expression Python
No ratings yet
Regular Expression Python
23 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
18 pages
Howto Regex
No ratings yet
Howto Regex
17 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Laboratorio 9
No ratings yet
Laboratorio 9
21 pages
Python Regex
No ratings yet
Python Regex
8 pages
Python Regular Expressions Quick Reference
No ratings yet
Python Regular Expressions Quick Reference
2 pages
Battlemaps APOCALYPSE 16 Wasteland Ruins 2 FC SQ 01
100% (1)
Battlemaps APOCALYPSE 16 Wasteland Ruins 2 FC SQ 01
84 pages
3dquickmold Training Manual: Quick
No ratings yet
3dquickmold Training Manual: Quick
161 pages
Appian StepbyStep 4 (Interfaces 101)
No ratings yet
Appian StepbyStep 4 (Interfaces 101)
17 pages
C Programming Lab - Manual Final
No ratings yet
C Programming Lab - Manual Final
55 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Quiz - Docx - Verilog
No ratings yet
Quiz - Docx - Verilog
5 pages
Welcome To: Daksh Family
No ratings yet
Welcome To: Daksh Family
18 pages
Rights Issue Application Process (R-Wap) Step 1: Click On The Link
No ratings yet
Rights Issue Application Process (R-Wap) Step 1: Click On The Link
3 pages
GL865 V3/V3.1 HW User Guide: 1vv0301018 Rev. 15 - 2019-01-07
No ratings yet
GL865 V3/V3.1 HW User Guide: 1vv0301018 Rev. 15 - 2019-01-07
72 pages
Ec8791 Embedded and Real Time Systems IV Year Vii Sem Part A B Questions With Answer
No ratings yet
Ec8791 Embedded and Real Time Systems IV Year Vii Sem Part A B Questions With Answer
98 pages
p4 d2 2017 p4 16 Tutorial
No ratings yet
p4 d2 2017 p4 16 Tutorial
94 pages
Radeonx 1800 Crossfireug
No ratings yet
Radeonx 1800 Crossfireug
90 pages
Vitamin Deficiency Detection (Base Paper)
No ratings yet
Vitamin Deficiency Detection (Base Paper)
3 pages
Analysis of Technology Architectures and Cybersecurity Assessment
No ratings yet
Analysis of Technology Architectures and Cybersecurity Assessment
9 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
4 pages
Uwu2x Guide (Must Read)
No ratings yet
Uwu2x Guide (Must Read)
5 pages
Oracle Biapps
No ratings yet
Oracle Biapps
12 pages
Rciub122 14
No ratings yet
Rciub122 14
20 pages
8Cspl6241-Advanced Java Programming: Prof. Mohan Reddy - Y SOSS, CMR University
No ratings yet
8Cspl6241-Advanced Java Programming: Prof. Mohan Reddy - Y SOSS, CMR University
10 pages
Password Cracking
No ratings yet
Password Cracking
8 pages
Weekly Quiz-Lv6-Kids-Ch8
No ratings yet
Weekly Quiz-Lv6-Kids-Ch8
2 pages
Automatic Traffic Sign Detection and Recognition Using Deeplearning For Autonomous Driverless Vehicles
No ratings yet
Automatic Traffic Sign Detection and Recognition Using Deeplearning For Autonomous Driverless Vehicles
4 pages
Document Details Rev. Format Notes Location Document No. Date Compiled Current Revision Date Revision Status DOC Originator (Company)
No ratings yet
Document Details Rev. Format Notes Location Document No. Date Compiled Current Revision Date Revision Status DOC Originator (Company)
13 pages
Revision Sheet 03
No ratings yet
Revision Sheet 03
2 pages
Functional SPOCs For ERP
No ratings yet
Functional SPOCs For ERP
4 pages
Handheld High Resolution Inkjet Printer: Multiple Files Printing
No ratings yet
Handheld High Resolution Inkjet Printer: Multiple Files Printing
2 pages
Problem 1: Sort Integers Saved in A File: Do Not Distribute Without Written Permission From Prof. Xiaoning Ding
No ratings yet
Problem 1: Sort Integers Saved in A File: Do Not Distribute Without Written Permission From Prof. Xiaoning Ding
2 pages
Learning R Programming
From Everand
Learning R Programming
Kun Ren
5/5 (3)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Gabriele Modena
4/5 (1)
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet

Lec 07 II Dsfa23

Uploaded by

Lec 07 II Dsfa23

Uploaded by

LECTURE 7

Text Wrangling and Regex

Data Science@ Knowledge Stream

Why Work With • Regex functions

Ex Join tables with mismatched

Ex Extract dates and times from log

day, month, year = "26", "Jan", "2014"

pandas str • Regex functions

s.lower() replacement/ s.replace(…)

'ab' in s length len(s)

Problem: Python assumes we are working with one string at a time

Apply the function string_operation to every string contained in the

Python (single pandas (Series of strings)

split s.split(…) ser.str.split(…)

substring s[1:4] ser.str[1:4]

length len(s) ser.str.len()

We don’t always know the exact format of our data in advance.

day, month, year = "26", "Jan", "2014"

One possible solution:

An alternate approach is to use a regular expression:

Related: How would you extract all the

The language of Social Security

3. Use vocabulary (metacharacter, escape character, groups,

Concatenation – “look for consecutive | – “or”

I recommend trying out regex101.com, which provides a visually appealing and

concatenatio every other

AAAAB every other

+ – “one or ? – “zero or one”

{x} – “repeat exactly x {x, y} – “repeat between x and y

Regex built-in classes:

[0-9] – any digit between 0 and 9 \d is equivalent to [0-9]

[^A-Z] – anything that is not an uppercase letter between A and Z

Operation Example Matches Doesn’t match

any character CUMULUS SUCCUBUS

[A-Za-z][a- word camelCase

“This is a <div>example</div> of greediness <div>in</div>

“This is a <div>example</div> of greediness <div>in</div>

“This is a <div>example</div> of greediness <div>in</div>

\ – “read the next character

^ – “match the beginning of a $ – “match the end of a

Be careful: ^ has different

Operation Example Matches Doesn’t match

beginning of ark two

text = "My social security number is 123-45-6789

text = "My social security number is 123-45-6789 df["SSN"].str.findall(pattern)

['123-45-6789', '321-45-6789'] 0 [987-65-4321]

Parenthesis can have another meaning:

text = """I will meet you at 08:30:00 pm tomorrow"""

[('08', '30', '00')]

ser.str.extract(pattern) docs ser.str.extractall(pattern) docs

pattern_cg = r"([0-9]{3})-([0-9]{2})-([0-9]{4})" df["SSN"].str.extractall(pattern_cg)

Returns text with all instances of

Returns text with all instances of regex=True ) docs

How it works: 0 Moo

Base Python re pandas str

s.replace(…) re.sub(…) ser.str.replace(…)

s.split(…) re.split(…) ser.str.split(…)

'ab' in s re.search(…) ser.str.contains(…)

Regular expressions are terrible at certain types of problems:

You might also like