01 Web Data Analytics Pawan
01 Web Data Analytics Pawan
MARATHE COLLEGE OF
ARTS, SCIENCE & COMMERCE.
(Affiliated to University Of Mumbai)
PRACTICAL JOURNAL
PSCSP516c
Web Data Analytics
SUBMITTED BY
PAWAN PRABHUNATH
MALLAHA
SEAT NO :
2023-2024
MUMBAI-400 071
N.G.ACHARYA & D.K.MARATHE COLLEGE OF
CERTIFICATE
This is to certify that Mr. Pawan Prabhunath Mallaha Seat No. studying
in Master of Science in Computer Science Part I Semester II has
satisfactorily completed the Practical of PSCSP516c Web Data Analytics
as prescribed by University of Mumbai, during the academic year 2023-
24.
2. Scrape the details like color, dimensions, material etc. Or customer ratings by
features
For this project, we will use ParseHub a free and powerful web
scraping that can work with any website. Make sure to download and
install ParseHub for free before getting started.
Scraping Amazon Product Data
For this example, we will scrape product data from Amazon.com’s
results page for “computer monitor”. We will extract information
available both on the results page and information available on each
of the product pages.
Getting Started
1. First, make sure to download and install ParseHub. We will use this
web scraper for this project.
2. Open ParseHub, click on “New Project” and use the URL from
Amazon’s result page. The page will now be rendered inside the app.
Scraping Amazon Results Page
1. Once the site is rendered, click on the product name of the first result
on the page. In this case, we will ignore the sponsored listings. The
name you’ve clicked will become green to indicate that it’s been
selected.
2. The rest of the product names will be highlighted in yellow. Click on
the second one on the list. Now all of the items will be highlighted in
green.
3. On the left sidebar, rename your selection to product. You will notice
that ParseHub is now extracting the product name and URL for each
product.
4. On the left sidebar, click the PLUS(+) sign next to the product
selection and choose the Relative Select command.
5. Using the Relative Select command, click on the first product name
on the page and then on its listing price. You will see an arrow
connect the two selections.
6. Expand the new command you’ve created and then delete the URL
that is also being extracted by default.
7. Repeat steps 4 through 6 to also extract the product star rating, the
number of reviews and product image. Make sure to rename your
new selections accordingly.
Pro Tip: The method above will only extract the image URL for each
product. Want to download the actual image file from the site? Read
our guide on how to scrape and download images with ParseHub.
We have now selected all the data we wanted to scrape from the
results page. Your project should now look like this:
Scraping Amazon Product Page
Now, we will tell ParseHub to click on each of the products we’ve
selected and extract additional data from each page. In this case, we
will extract the product ASIN, Screen Size and Screen Resolution.
1. First, on the left sidebar, click on the 3 dots next to the
main_template text.
2. Rename your template to search_results_page. Templates help
ParseHub keep different page layouts separate.
3. Now use the PLUS(+) button next to the product selection and choose
the “Click” command. A pop-up will appear asking you if this link is a
“next page” button. Click “No” and next to Create New Template
input a new template name, in this case, we will use product_page.
4. ParseHub will now automatically create this new template and
render the Amazon product page for the first product on the list.
5. Scroll down the “Product Information” part of the page and using the
Select command, click on the first element of the list. In this case, it
will be the Screen Size item.
6. Like we have done before, keep on selecting the items until they all
turn green. Rename this selection to labels.
7. Expand the labels selection and remove the begin new entry in labels
command.
8. Now click the PLUS(+) sign next to the labels selection and use the
Conditional command. This will allow us to only pull some of the info
from these items.
9. For our first Conditional command, we will use the following
expression:
$e.text.contains(“Screen Size”)
10. We will then use the PLUS(+) sign next to our conditional
command to add a Relative Select command. We will now use this
Relative Select command to first click on the Screen Size text and
then on the actual measurement next to it (in this case, 21.5 inches).
11. Now ParseHub will extract the product’s screen size into its own
column. We can copy-paste the conditional command we just
created to pull other information. Just make sure to edit the
conditional expression. For example, the ASIN expression will be:
$e.text.contains(“ASIN")
12. Lastly, make sure that your conditional selections are aligned
properly so they are not nested amongst themselves. You can drag
and drop the selections to fix this. The final template should look like
this:
Want to scrape reviews as well? Check our guide on how to Scrape
Amazon reviews using a free web scraper.
Adding Pagination
Now, you might want to scrape several pages worth of data for this
project. So far, we are only scraping page 1 of the search results. Let’s
setup ParseHub to navigate to the next 10 results pages.
5. Now, click on the PLUS(+) sign of your next_button selection and use
the Click command.
6. A pop-up will appear asking if this is a “Next” link. Click Yes and enter
the number of pages you’d like to navigate to. In this case, we will
scrape 9 additional pages.
Running and Exporting your Project
Now that we are done setting up the project, it’s time to run our
scrape job.
On the left sidebar, click on the "Get Data" button and click on the
"Run" button to run your scrape. For longer projects, we recommend
doing a Test Run to verify that your data will be formatted correctly.
After the scrape job is completed, you will now be able to download
all the information you’ve requested as a handy spreadsheet or as
a JSON file.
Practical 2
Aim: Scrape an online Social Media Site for Data. Use python to scrape
information from twitter.
Installation
pip install ntscraper
How to use
● None = no logs
● 0 = only warning and error logs
● 1 = previous + informational logs (default)
The skip_instance_check parameter is used to skip the check of the Nitter instances altogether
during the execution of the script. If you use your own instance or trust the instance you are relying
on, then you can skip set it to 'True', otherwise it's better to leave it to false.
Then, choose the proper function for what you want to do from the following.
Scrape tweets
github_hash_tweets = scraper.get_tweets("github", mode='hashtag')
Parameters:
import pandas as pd
scraper = Nitter(0)
final_tweets = []
for x in tweets['tweets']:
data =
[x['link'],x['text'],x['date'],x['stats']['likes'],x['stats']['comments']]
final_tweets.append(data)
return dat
data = get_tweets('narendramodi','user',6)
0 https://twitter.com/naren प्रभु श्रु रु म कु प्रु ण-प्रतिुष्ठु Apr 11, 2024 · 9:19 AM 2587 202
dramodi/status/1778352 UTC
254736592998#m कु निुमुं त्रण ठु करु िुु
वु लु ुं कु उत्तरु ख
ुं ड
सनिुिु दुश क
जिुिुु इिु चिुु वु ुं मुुं
कडु सबक ससखु िुु ज
रिुु िुु।
1 https://twitter.com/naren ऋनिुकु श सनिुिु उत्तरु ख ुं ड Apr 11, 2024 · 9:18 AM 1559 108
dramodi/status/1778351 UTC
927404765363#m कु पनवत्र भुनम मुुं व शनिु
और सु मथथु्य िुु, जु नकसु कु
भ जु िव बदल सकिुु िुु।
इससु जुडु मुरु एक
अिुुभव…
2 https://twitter.com/naren मु दु िुु िव रुुंक िव Apr 11, 2024 · 9:16 AM 1646 147
dramodi/status/1778351 UTC
302357000565#m पुुंशिु कु गु रुं ट कु पुरु
नकयु िुु। कु ुंग्रस कु सरकु र
िुु िुु िुु य
सपिुु िुु रिु जु िुु !
3 https://twitter.com/naren दु श मुुं कमजु र और असथिुर Apr 11, 2024 · 9:14 AM 1866 155
dramodi/status/1778350 UTC
943462891774#m सरकु र क जिग जब पुणय
बहुमिु कु मजबुिु
सरकु र
िुु िुु िुु, िुु यु फकय
नदखिुु िुु…
4 https://twitter.com/naren पु िव िुगरु ऋनिुकु श मुुं Apr 11, 2024 · 9:12 AM 4923 178
dramodi/status/1778350 UTC
356293926953#m उमडु जिुसु गर कउत्सु िु और
उमुं ग िुु सु फ कर नदयु
िुु नक उत्तरु ख
ुं ड मुुं भु जपु
कु प्रचुं ड नवजय नमलिुु
जु रिुु िुु।
5 https://twitter.com/naren िुररयु णु कु मिुुुं द् रगढ़ मुुं Apr 11, 2024 · 9:01 AM 5174 337
dramodi/status/1778347 UTC
711336427917#m हुआ बस िुु दसु अत्य
ुं िु
पु डु दु यक िुु। मुरु शु क-
सुं वदिुु एुं उिु सभु
पररवु रु ुं कु सु िु िुुुं ,
सजन्हु ुंिुु इस दुरु्यटिुु मुुं
अपिुु बच्चु ुं कु खु यु
िुु। इसकु सु िु िुु मुुं
सभु रु्ु यल बच्चु ुं कु शु घ्र स्विथ
िुु िुु कु कु मिुु
करिुु हुं । रु ज्य सरकु र क
दु खरुख मुुं थिुु िुु य
प्रशु सिु पु नडिुु ुं और
उिुकु पररजिुु ुं कु िुरसुं
भव सिुु यिुु मुुं जुट
िुु।
Solution2- Getting Started with snscrape
Requirements
2. pandas
Installing snscrape
pip3 install snscrape
Perfect, now that we’ve set up snscrape and peeripheral requirements, let’s
jump right into using snscrape!
Using snscrape
Use the Python Wrapper method because it's easy to interact with data
scraping,
Code :
import pandas as pd
attributes_container = []
if i>100:
break
print(tweets_df)
Output
Date Created ... Tweets
4 2023-01-16 04:38:35+00:00 ... திருவை் ளுவர் தினத்தில் , அறிவில் சிறந்த திருவை் ...
98 2023-01-05 02:50:52+00:00 ... Birthday wishes to Dr. Murli Manohar Joshi Ji....
100 2023-01-04 14:22:25+00:00 ... National Green Hydrogen Mission, which the Uni...
Code :-
import numpy as np
import scipy as sc
#import pandas as pd
from fractions import Fraction
# WWW matrix
M = np.matrix([[0,0,1],
[Fraction(1,2),0,0],
[Fraction(1,2),1,0]])
E = np.zeros((3,3))
E[:] = dp
# taxation
beta = 0.8
# WWW matrix
A = beta * M + ((1-beta) * E)
# initial vector
r = np.matrix([dp, dp, dp])
r = np.transpose(r)
previous_r = r
for it in range(1,10):
r=A*r
print (float_format(r,3))
#check if converged
if (float_format(previous_r,3)==float_format(r,3)).all():
break
previous_r = r
[[ 0.333]
[ 0.217]
[ 0.45 ]]
1 [[ 0.415]
2 [ 0.217]
3 [ 0.368]]
4 [[ 0.358]
5 [ 0.245]
6 [ 0.397]]
7 .
8 .
9 .
10[[ 0.378]
11[ 0.225]
12[ 0.397]]
13[[ 0.378]
14[ 0.232]
15[ 0.39 ]]
16[[ 0.373]
17[ 0.232]
18[ 0.395]]
19[[ 0.376]
20[ 0.231]
21[ 0.393]]
22[[ 0.375]
23[ 0.232]
24[ 0.393]]
25Final:
26[[ 0.375]
[ 0.232]
[ 0.393]]
sum 1.0
The output would be:
1 [[ 0.333]
2 [ 0.217]
3 [ 0.45 ]]
4 [[ 0.415]
5 [ 0.217]
6 [ 0.368]]
7 [[ 0.358]
8 [ 0.245]
9 [ 0.397]]
10.
11.
12.
13[[ 0.378]
14[ 0.225]
15[ 0.397]]
16[[ 0.378]
17[ 0.232]
18[ 0.39 ]]
19[[ 0.373]
20[ 0.232]
21[ 0.395]]
22[[ 0.376]
23[ 0.231]
24[ 0.393]]
25[[ 0.375]
26[ 0.232]
[ 0.393]]
Final:
[[ 0.375]
[ 0.232]
[ 0.393]]
sum 1.0
Practical 5 Date:
Remove numbers
Remove numbers if they are not relevant to your analyses. Usually, regular
expressions are used to remove numbers.
Example 2. Numbers removing
Python code:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4
red and 2 blue balls.'
result = re.sub(r'\d+', '', input_str)
print(result)
Output:
Box A contains red and white balls, while Box B contains red and blue balls.
Remove punctuation
The following code removes this set of symbols [!”#$%&’()*+,-
./:;<=>?@[\]^_`{|}~]:
Example 3. Punctuation removal
Python code:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" #
Sample string
result = input_str.translate(str.maketrans("","", string.punctuation))
print(result)
Output:
This is an example of string with punctuation
4. Remove whitespaces
To remove leading and ending spaces, you can use the strip() function:
Example 4. White spaces removal
Python code:
input_str = ‘ \t a string example\t ‘
input_str = input_str.strip()
input_str
Output:
‘a string example’
5 Tokenization
Tokenization is the process of splitting the given text into smaller pieces called
tokens. Words, numbers, punctuation marks, and others can be considered as
tokens.
Stopwords are the commonly used words and are removed from the text as
they do not add any value to the analysis. These words carry less or no
meaning.
Code :
import nltk
nltk.download('stopwords')
import nltk
nltk.download('punkt')
input_str = 'NLTK is a leading platform for building Python programs to
work with human language data.'
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(input_str)
print (result)
output:
[‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’,
‘work’, ‘human’, ‘language’, ‘data’, ‘.’]
It is also known as the text standardization step where the words are
stemmed or diminished to their root/base form. For example, words like
‘programmer’, ‘programming, ‘program’ will be stemmed to ‘program’.
But the disadvantage of stemming is that it stems the words such that its root
form loses the meaning or it is not diminished to a proper English word. We
will see this in the steps done below.
Code:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str='There are several types of stemming algorithms.'
input_str=word_tokenize(input_str)
for word in input_str:
print(stemmer.stem(word))
output
There are sever type of stem algorithm.
7. Lemmatization
Output:
ARM Measures
Support: The support of the rule X⇒Y in the transaction database D is the
support of the itemset X ∪ Y in D:
where ‘N’ is the total number of transactions in the database and count(X ∪ Y)
is the number of transactions that contain X ∪ Y.
Confidence: The confidence of the rule X⇒Y in the transaction database D is the
ratio of the number of transactions in D that contain X ∪ Y to the number of
transactions that contain X in D:
The measure ‘lift‘ is newly added in this context. Its significance in ARM is given
below:
Greater lift value indicates stronger association. We will use this measure in our
experiment.
Dataset Description
transaction.csv
Chips Banana
Juice Chips
Juice Chips
code below.
arm.py
import pandas as pd
df = pd.read_csv('transaction.csv', header=None)
print(df.describe())
print("\nShape:",df.shape)
database = []
for i in range(0,30):
arm_results = list((arm_rules))
print("\nNo. of rule(s):",len(arm_results))
print("========")
print(arm_results)
Output:
Display statistics:
===================
0 1 2 3 4 5
count 19 18 23 23 20 22
unique 1 1 1 1 1 1
freq 19 18 23 23 20 22
Shape: (30, 6)
No. of rule(s): 1
Results:
========
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}),
support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread',
'Milk'}),
items_add=frozenset({'Butter'}), confidence=0.9375,
lift=1.2228260869565217)])]
Explanation
The program generates only one rule based on user-specified input measures
such as: min_support = 0.5, min_confidence = 0.7, and min_lift = 1.2.
The support count value for the rule is 0.5. This number is calculated by dividing
the number of transactions containing ‘Butter’, ‘Bread’, and ‘Milk’ by the total
number of transactions.
The confidence level for the rule is 0.9375, which shows that out of all the
transactions that contain both ‘Bread’ and ‘Milk’, 93.75 % contain ‘Butter’ too.
The lift of 1.22 tells us that ‘Butter’ is 1.22 times more likely to be bought by
the customers who buy both ‘Bread’ and ‘Milk’ compared to the default
likelihood sale of ‘Butter.’
Practical 7 Date:
Develop a basic crawler for the web search for user defined keywords.
Solution 1
import requests
def web(page,WebUrl):
if(page>0):
url = WebUrl
code = requests.get(url)
plain = code.text
s = BeautifulSoup(plain, 'html.parser')
#print(s)
tet = link.get('id')
print(tet)
tet_2 = link.get('href')
print(tet_2)
web(1,'https://www.amazon.in/mobile-phones/b?ie=UTF8&node=1389401031&ref_=nav-progressive-
content')
Solution 2
import requests
from bs4 import BeautifulSoup
#url=("www.amazon.in")
#url=("www.nytimes.com")
url=("www.timesofindia.indiatimes.com")
code=requests.get("http://"+url)
plain=code.text
s=BeautifulSoup(plain,'html.parser')
for link in s.find_all("a"):
print(link.get("href"))
#print(link.prettify())
Solution 3
import requests
def get_title(soup):
try:
title_value = title.string
title_string = title_value.strip()
except AttributeError:
title_string = ""
return title_string
def get_price(soup):
try:
except AttributeError:
price = ""
return price
def get_rating(soup):
try:
try:
except:
rating = ""
return rating
def get_availability(soup):
try:
available = available.find("span").string.strip()
except AttributeError:
available = ""
return available
if name == ' main ':
HEADERS = ({'User-Agent':
URL = "https://www.amazon.com/Sony-PlayStation-Pro-1TB-Console-4/dp/B07K14XKZH/"
# HTTP Request