HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
[2]: import os
import re
import json
import pandas as pd
import numpy as np
from pathlib import Path
import tarfile
import warnings
warnings.filterwarnings("ignore")
Mounted at /content/drive
2 Task1: Download data and combine data from multiple files into
a single dataframe - 2 Points.
In this task you have to download the moview reviews from the following link:
https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Instructions:
• The data has movie reviews from four different reviewers:
(1) Dennis+Schwartz, (2) James+Berardinelli, (3) Scott+Renshaw and (4) Steve+Rhodes.
• You have to extract the reviews of the four reviewers in a single dataframe.
• The final dataframe should have two columns (1) Moview Review and (2) Reviewer Name.
1
[44]: basepath1 = str(movie_rev)
[105]: import os
label_names = []
movies =[]
main_path = "/content/drive/MyDrive/movie_rege1/scale_whole_review"
for path in os.scandir(main_path):
sub_path = main_path + f"/{path.name}/txt.parag"
if os.path.isdir(sub_path):
for each_path in os.listdir(sub_path):
full_path = os.path.join(sub_path , each_path)
#read the file
movie = open(full_path , encoding="utf8", errors='ignore').read()
#add them to the list
movies.append(movie)
label_names.append(path.name)
else:
pass
print("****DONE READING*************")
****DONE READING*************
2
[106]: (5006, 5006)
3
HTTP request sent, awaiting response… 301 Moved Permanently
Location: https://www.trackmyhashtag.com/data/COVID-19.zip [following]
--2021-09-18 06:20:55-- https://www.trackmyhashtag.com/data/COVID-19.zip
Connecting to www.trackmyhashtag.com
(www.trackmyhashtag.com)|138.197.74.186|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 13772919 (13M) [application/zip]
Saving to: ‘/content/drive/MyDrive/movie_rege/COVID-19.zip’
Archive: /content/drive/MyDrive/movie_rege/COVID-19.zip
inflating: COVID.csv
inflating: COVID-images.csv
inflating: COVID-videos.csv
df.columns = ["text"]
# preview
df.head()
[39]: text
0 Also the entire Swiss Football League is on ho…
1 World Health Org Official: Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron…
3 Under repeated questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h…
4
for word in text.split():
if re.search(hashtag_pattern , word):
replace = re.findall(hashtag_pattern , word)[0].lower()
text = re.sub(hashtag_pattern , replace , text)
#print(text)
else:
text = text
return text
# preview
df.head()
[40]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official:
Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron… I mean, Liberals are
cheer-leading this #tds l…
3 Under repeated questioning, Pompeo refuses to … Under repeated
questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h… #coronavirus comments now
from @larry_kudlow h…
df['clean_text'] = df['text'].apply(remove_RT)
# preview
df.tail()
[41]: text
clean_text
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… @timhquotes: It's my
party, you're invited!\n…
5
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…
3.4 2.4 Remove URLs and links from text in column clean_text
[42]: # function to remove urls
def remove_urls(text):
text = re.sub('https?://[A-Za-z0-9./]+','',text)
return text
df['clean_text'] = df['text'].apply(remove_urls)
# preview
df.tail(10)
[42]: text
clean_text
60150 “There may be specific interactions that are c… “There may be specific
interactions that are c…
60151 El #coronavirus entérico felino (FECV) es un v… El #coronavirus
entérico felino (FECV) es un v…
60152 Mediante microscopía electrónica, investigador… Mediante microscopía
electrónica, investigador…
60153 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60154 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…
6
[28]: # function to remove punctuations
def remove_punct(text):
import string
text = text.translate(str.maketrans('', '', string.punctuation))
return text
# apply the function
df['clean_text'] = df['text'].apply(remove_punct)
# preview
df.head()
[28]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official
Trump’s press confer…
2 I mean, Liberals are cheer-leading this #Coron… I mean Liberals are
cheerleading this Coronavi…
3 Under repeated questioning, Pompeo refuses to … Under repeated questioning
Pompeo refuses to s…
4 #coronavirus comments now from @larry_kudlow h… coronavirus comments now
from larrykudlow here…
[31]: text …
num_hashtags
60155 El #coronavirus entérico felino es un virus in… … [coronavirus,
enfermedades, gatos, veterinaria]
60156 RT @timhquotes: It's my party, you're invited!… … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60157 It's my party, you're invited!\n\nPS, this is … … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… … [bariclab, pnnl,
movingon, coronavirus, bsl3, …
7
60159 A review of asymptomatic and sub-clinical Midd… …
[Coronavirus]
[5 rows x 3 columns]
# preview
df.head()
[5 rows x 4 columns]
8
[5 rows x 6 columns]
[ ]: