0% found this document useful (0 votes)
157 views

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points

The document provides instructions for a homework assignment on regular expressions. It contains two main tasks: 1. Download movie review data from a provided URL and combine the reviews from four reviewers into a single dataframe with two columns for the review text and reviewer name. 2. Perform data cleaning on a COVID-19 tweet dataset using regular expressions. This includes downloading the data, creating a dataframe with tweet text, converting hashtags to lowercase, removing "RT" from tweets, and removing URLs from the text.

Uploaded by

Stephen Kamau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points

The document provides instructions for a homework assignment on regular expressions. It contains two main tasks: 1. Download movie review data from a provided URL and combine the reviews from four reviewers into a single dataframe with two columns for the review text and reviewer name. 2. Perform data cleaning on a COVID-19 tweet dataset using regular expressions. This includes downloading the data, creating a dataframe with tweet text, converting hashtags to lowercase, removing "RT" from tweets, and removing URLs from the text.

Uploaded by

Stephen Kamau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HW_regex

September 18, 2021

1 Instructions HW - Regular Expression - 10 Points


• You have to submit two files for this part of the HW (1) ipynb (colab notebook) and (2) pdf
file # You have to use only regular expressions for this HW. Please do not use
spacy for the tasks in this notebook

[2]: import os
import re
import json
import pandas as pd
import numpy as np
from pathlib import Path
import tarfile

import warnings
warnings.filterwarnings("ignore")

[3]: from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

2 Task1: Download data and combine data from multiple files into
a single dataframe - 2 Points.
In this task you have to download the moview reviews from the following link:
https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Instructions:
• The data has movie reviews from four different reviewers:
(1) Dennis+Schwartz, (2) James+Berardinelli, (3) Scott+Renshaw and (4) Steve+Rhodes.
• You have to extract the reviews of the four reviewers in a single dataframe.
• The final dataframe should have two columns (1) Moview Review and (2) Reviewer Name.

[43]: folder =Path('/content/drive/MyDrive/')


movie_rev = folder /'movie_rege1'
!mkdir {str(movie_rev)}

1
[44]: basepath1 = str(movie_rev)

[45]: url = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/


,→scale_whole_review.tar.gz'

!wget {url} -P {basepath1}

--2021-09-18 07:22:45-- https://www.cs.cornell.edu/people/pabo/movie-review-


data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)… 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443…
connected.
HTTP request sent, awaiting response… 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz’

scale_whole_review. 100%[===================>] 8.44M 10.7MB/s in 0.8s

2021-09-18 07:22:47 (10.7 MB/s) -


‘/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz’ saved
[8853204/8853204]

[108]: !tar -xvf '/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz' -C "/


,→content/drive/MyDrive/movie_rege1"

[105]: import os
label_names = []
movies =[]
main_path = "/content/drive/MyDrive/movie_rege1/scale_whole_review"
for path in os.scandir(main_path):
sub_path = main_path + f"/{path.name}/txt.parag"
if os.path.isdir(sub_path):
for each_path in os.listdir(sub_path):
full_path = os.path.join(sub_path , each_path)
#read the file
movie = open(full_path , encoding="utf8", errors='ignore').read()
#add them to the list
movies.append(movie)
label_names.append(path.name)
else:
pass
print("****DONE READING*************")

****DONE READING*************

[106]: # check lengths


len(label_names) , len(movies)

2
[106]: (5006, 5006)

[109]: # create a dataframe for the results


df = pd.DataFrame()

# add the data to the df


df['Moview Review'] = movies
df['Reviewer Name'] = label_names

[110]: # preview the results


df.head(10)

[110]: Moview Review Reviewer Name


0 DENNIS SCHWARTZ "Movie Reviews and Poetry"\nUN… Dennis+Schwartz
1 A brilliant, witty mock documentary of Jean Se… Dennis+Schwartz
2 NOSTALGHIA (director: Andrei Tarkovsky; cast: … Dennis+Schwartz
3 PAYBACK (director: Brian Helgeland; cast:(Port… Dennis+Schwartz
4 WAKING NED DEVINE (director: Kirk Jones (III);… Dennis+Schwartz
5 HAPPINESS (director: Todd Solondz; cast: Dylan… Dennis+Schwartz
6 LEON MORIN, PRIEST (director: Jean-Pierre Melv… Dennis+Schwartz
7 LES BICHES (THE DOES)(director: Claude Chabrol… Dennis+Schwartz
8 FUNNY GAMES ( director: Michael Haneke; cast: … Dennis+Schwartz
9 MEN WITH GUNS (director: John Sayles; cast: Fe… Dennis+Schwartz

3 Task 2 : We will perform following tasks- 8 Points


• Download data (using wget)
• Clean data using regular expression

[4]: folder =Path('/content/drive/MyDrive/')


movie_regex = folder /'movie_rege'
!mkdir {str(movie_regex)}

[6]: basepath = str(movie_regex)

3.1 2.1 Download the data and create dataframe


Download the dats set from foillowing URL: “http://www.trackmyhashtag.com/data/COVID-
19.zip”

[7]: # Now we will use wget to get the data


url = 'http://www.trackmyhashtag.com/data/COVID-19.zip'
!wget {url} -P {basepath}

--2021-09-18 06:20:55-- http://www.trackmyhashtag.com/data/COVID-19.zip


Resolving www.trackmyhashtag.com (www.trackmyhashtag.com)… 138.197.74.186
Connecting to www.trackmyhashtag.com
(www.trackmyhashtag.com)|138.197.74.186|:80… connected.

3
HTTP request sent, awaiting response… 301 Moved Permanently
Location: https://www.trackmyhashtag.com/data/COVID-19.zip [following]
--2021-09-18 06:20:55-- https://www.trackmyhashtag.com/data/COVID-19.zip
Connecting to www.trackmyhashtag.com
(www.trackmyhashtag.com)|138.197.74.186|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 13772919 (13M) [application/zip]
Saving to: ‘/content/drive/MyDrive/movie_rege/COVID-19.zip’

COVID-19.zip 100%[===================>] 13.13M 14.4MB/s in 0.9s

2021-09-18 06:20:56 (14.4 MB/s) -


‘/content/drive/MyDrive/movie_rege/COVID-19.zip’ saved [13772919/13772919]

[9]: !unzip /content/drive/MyDrive/movie_rege/COVID-19.zip

Archive: /content/drive/MyDrive/movie_rege/COVID-19.zip
inflating: COVID.csv
inflating: COVID-images.csv
inflating: COVID-videos.csv

[38]: # read the file


data2 = pd.read_csv("COVID.csv")

[39]: # extract only Tweets content


# extract the text
df = pd.DataFrame(data2['Tweet Content'])

df.columns = ["text"]

# preview
df.head()

[39]: text
0 Also the entire Swiss Football League is on ho…
1 World Health Org Official: Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron…
3 Under repeated questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h…

3.2 2.2 Change the hashtags to lower case


e.g. replace #Coronavirus with #coronavirus
Create a new column clean_text. The clean_text should have modified hashtags.

[40]: hashtag_pattern = r"#\w+"


def convert_hashtag_tolower(text):

4
for word in text.split():
if re.search(hashtag_pattern , word):
replace = re.findall(hashtag_pattern , word)[0].lower()
text = re.sub(hashtag_pattern , replace , text)
#print(text)
else:
text = text

return text

# apply the function to convert


df["clean_text"] = df['text'].apply(convert_hashtag_tolower)

# preview
df.head()

[40]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official:
Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron… I mean, Liberals are
cheer-leading this #tds l…
3 Under repeated questioning, Pompeo refuses to … Under repeated
questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h… #coronavirus comments now
from @larry_kudlow h…

3.3 2.3 Remove RT from text in column clean_text


[41]: # remove RT
def remove_RT(text):
return re.sub(r'[RT]+' , "" , text)

df['clean_text'] = df['text'].apply(remove_RT)

# preview
df.tail()

[41]: text
clean_text
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… @timhquotes: It's my
party, you're invited!\n…

5
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…

3.4 2.4 Remove URLs and links from text in column clean_text
[42]: # function to remove urls
def remove_urls(text):
text = re.sub('https?://[A-Za-z0-9./]+','',text)
return text

df['clean_text'] = df['text'].apply(remove_urls)
# preview
df.tail(10)

[42]: text
clean_text
60150 “There may be specific interactions that are c… “There may be specific
interactions that are c…
60151 El #coronavirus entérico felino (FECV) es un v… El #coronavirus
entérico felino (FECV) es un v…
60152 Mediante microscopía electrónica, investigador… Mediante microscopía
electrónica, investigador…
60153 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60154 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…

3.5 2.5 Removing Punctuations from text in column clean_text.


Hint: Use the following function
text.translate(str.maketrans('', '', string.punctuation))

6
[28]: # function to remove punctuations
def remove_punct(text):
import string
text = text.translate(str.maketrans('', '', string.punctuation))
return text
# apply the function
df['clean_text'] = df['text'].apply(remove_punct)
# preview
df.head()

[28]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official
Trump’s press confer…
2 I mean, Liberals are cheer-leading this #Coron… I mean Liberals are
cheerleading this Coronavi…
3 Under repeated questioning, Pompeo refuses to … Under repeated questioning
Pompeo refuses to s…
4 #coronavirus comments now from @larry_kudlow h… coronavirus comments now
from larrykudlow here…

3.6 2.6 Extract number of Hashtags in a new column num_hashtags


[31]: # extract hash tags
def extract_hashtags(text):
hash_tag_pattern = r"#\w+"
hashtag_list = re.findall(hash_tag_pattern , text)
#remove the hashes
hashtag_list = [word[1:] for word in hashtag_list]
return hashtag_list
# apply the function
df['num_hashtags'] = df['text'].apply(extract_hashtags)
# preview
df.tail()

[31]: text …
num_hashtags
60155 El #coronavirus entérico felino es un virus in… … [coronavirus,
enfermedades, gatos, veterinaria]
60156 RT @timhquotes: It's my party, you're invited!… … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60157 It's my party, you're invited!\n\nPS, this is … … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… … [bariclab, pnnl,
movingon, coronavirus, bsl3, …

7
60159 A review of asymptomatic and sub-clinical Midd… …
[Coronavirus]

[5 rows x 3 columns]

3.7 2.7 Extract number of user mentions in a new column num_mentions


[32]: # extract all user mentions from the columns
def extract_user_mentions(text):
mention_pattern = r"@\w+"
user_mentions = re.findall(mention_pattern , text)
#remove the @ symbol
user_mentions = [word[1:] for word in user_mentions]
return user_mentions

# apply the function


df['num_mentions'] = df['text'].apply(extract_user_mentions)

# preview
df.head()

[32]: text … num_mentions


0 Also the entire Swiss Football League is on ho… … []
1 World Health Org Official: Trump’s press confe… … []
2 I mean, Liberals are cheer-leading this #Coron… … []
3 Under repeated questioning, Pompeo refuses to … … []
4 #coronavirus comments now from @larry_kudlow h… … [larry_kudlow]

[5 rows x 4 columns]

3.8 2.8 Count number of mentions and hashtags


[33]: # count the mentions
df['total_mentions'] = df['num_mentions'].apply(len)

[36]: # count the total numbers of hash tags


df['total_hashtags'] = df['num_hashtags'].apply(len)

[37]: # preview the data


df.head()

[37]: text … total_hashtags


0 Also the entire Swiss Football League is on ho… … 1
1 World Health Org Official: Trump’s press confe… … 1
2 I mean, Liberals are cheer-leading this #Coron… … 2
3 Under repeated questioning, Pompeo refuses to … … 1
4 #coronavirus comments now from @larry_kudlow h… … 1

8
[5 rows x 6 columns]

[ ]:

You might also like