0% found this document useful (0 votes)

13 views55 pages

01 Web Data Analytics Pawan

Hwusjsjsjjsjsjsjsjsjsjjsjjsjsjjsjsjsjjsjsjhdhxhsjsjsjsjs

Uploaded by

kambleyash1412

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views55 pages

01 Web Data Analytics Pawan

Hwusjsjsjjsjsjsjsjsjsjjsjjsjsjjsjsjsjjsjsjhdhxhsjsjsjsjs

Uploaded by

kambleyash1412

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

N.G.ACHARYA & D.K.

MARATHE COLLEGE OF
ARTS, SCIENCE & COMMERCE.
(Affiliated to University Of Mumbai)

PRACTICAL JOURNAL

PSCSP516c
Web Data Analytics
SUBMITTED BY
PAWAN PRABHUNATH
MALLAHA
SEAT NO :

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR QUALIFYING M.Sc. (CS) PART-I (SEMESTER – II) EXAMINATION.

2023-2024

DEPARTMENT OF COMPUTER SCIENCE

SHREE N.G.ACHARYA MARG,CHEMBUR

MUMBAI-400 071
N.G.ACHARYA & D.K.MARATHE COLLEGE OF

ARTS, SCIENCE & COMMERCE.

(Affiliated to University Of Mumbai)

CERTIFICATE

This is to certify that Mr. Pawan Prabhunath Mallaha Seat No. studying
in Master of Science in Computer Science Part I Semester II has
satisfactorily completed the Practical of PSCSP516c Web Data Analytics
as prescribed by University of Mumbai, during the academic year 2023-
24.

Signature Signature Signature

Internal Guide External Examiner Head Of Department

College Seal Date:

INDEX

Practical Practical Signature.

No.

1 Scrape an online E-commerce Site for Data.

2 Scrape an online Social Media Site for Data.Use

Python to scrape information from twitter.

3 Page Rank for link analysis using python.

5 Demonstrate Text Minning and Webpage Pre-

processing using meta information from the web
Pages.

6 Apriori Algorithm implementation in case study.

7 Develop a basic crawler for the web search for

User defined keywords.
Practical 1 and 2 WM

Scrape an online E-Commerce Site for Data.

1. Extract product data from Amazon - be it any product and put these details in
the MySQL database. One can use pipeline. Like 1 pipeline to process the scraped
data and other to put data in the database and since Amazon has some
restrictions on scraping of data, ask them to work on small set of requests
otherwise proxies and all would have to be used.

2. Scrape the details like color, dimensions, material etc. Or customer ratings by
features

How to Scrape Amazon Product Data: Names,

Pricing, ASIN, etc.

Amazon offers numerous services on their ecommerce

platform..
One thing they do not offer though, is easy access to their product
data.

There’s currently no way to just export product data from Amazon to

a spreadsheet for any business needs you might have. Either for
competitor research, comparison shopping or to build an API for your
app project.

Web scraping easily solves this issue.

Free Amazon Web Scraping

Web scraping will allow you to select the specific data you’d want
from the Amazon website into a spreadsheet or JSON file. You could
even make this an automated process that runs on a daily, weekly or
monthly basis to continuously update your data.

For this project, we will use ParseHub a free and powerful web
scraping that can work with any website. Make sure to download and
install ParseHub for free before getting started.
Scraping Amazon Product Data
For this example, we will scrape product data from Amazon.com’s
results page for “computer monitor”. We will extract information
available both on the results page and information available on each
of the product pages.
Getting Started
1. First, make sure to download and install ParseHub. We will use this
web scraper for this project.
2. Open ParseHub, click on “New Project” and use the URL from
Amazon’s result page. The page will now be rendered inside the app.
Scraping Amazon Results Page
1. Once the site is rendered, click on the product name of the first result
on the page. In this case, we will ignore the sponsored listings. The
name you’ve clicked will become green to indicate that it’s been
selected.
2. The rest of the product names will be highlighted in yellow. Click on
the second one on the list. Now all of the items will be highlighted in
green.
3. On the left sidebar, rename your selection to product. You will notice
that ParseHub is now extracting the product name and URL for each
product.
4. On the left sidebar, click the PLUS(+) sign next to the product
selection and choose the Relative Select command.
5. Using the Relative Select command, click on the first product name
on the page and then on its listing price. You will see an arrow
connect the two selections.
6. Expand the new command you’ve created and then delete the URL
that is also being extracted by default.
7. Repeat steps 4 through 6 to also extract the product star rating, the
number of reviews and product image. Make sure to rename your
new selections accordingly.
Pro Tip: The method above will only extract the image URL for each
product. Want to download the actual image file from the site? Read
our guide on how to scrape and download images with ParseHub.
We have now selected all the data we wanted to scrape from the
results page. Your project should now look like this:
Scraping Amazon Product Page
Now, we will tell ParseHub to click on each of the products we’ve
selected and extract additional data from each page. In this case, we
will extract the product ASIN, Screen Size and Screen Resolution.
1. First, on the left sidebar, click on the 3 dots next to the
main_template text.
2. Rename your template to search_results_page. Templates help
ParseHub keep different page layouts separate.
3. Now use the PLUS(+) button next to the product selection and choose
the “Click” command. A pop-up will appear asking you if this link is a
“next page” button. Click “No” and next to Create New Template
input a new template name, in this case, we will use product_page.
4. ParseHub will now automatically create this new template and
render the Amazon product page for the first product on the list.
5. Scroll down the “Product Information” part of the page and using the
Select command, click on the first element of the list. In this case, it
will be the Screen Size item.
6. Like we have done before, keep on selecting the items until they all
turn green. Rename this selection to labels.
7. Expand the labels selection and remove the begin new entry in labels
command.
8. Now click the PLUS(+) sign next to the labels selection and use the
Conditional command. This will allow us to only pull some of the info
from these items.
9. For our first Conditional command, we will use the following
expression:
$e.text.contains(“Screen Size”)
10. We will then use the PLUS(+) sign next to our conditional
command to add a Relative Select command. We will now use this
Relative Select command to first click on the Screen Size text and
then on the actual measurement next to it (in this case, 21.5 inches).
11. Now ParseHub will extract the product’s screen size into its own
column. We can copy-paste the conditional command we just
created to pull other information. Just make sure to edit the
conditional expression. For example, the ASIN expression will be:
$e.text.contains(“ASIN")
12. Lastly, make sure that your conditional selections are aligned
properly so they are not nested amongst themselves. You can drag
and drop the selections to fix this. The final template should look like
this:
Want to scrape reviews as well? Check our guide on how to Scrape
Amazon reviews using a free web scraper.
Adding Pagination
Now, you might want to scrape several pages worth of data for this
project. So far, we are only scraping page 1 of the search results. Let’s
setup ParseHub to navigate to the next 10 results pages.

1. On the left sidebar, return to the search_results_page template. You

might also need to change the browser tab to the search results page
as well.
2. Click on the PLUS(+) sign next to the page selection and choose the
Select command.
3. Then select the Next page link at the bottom of the Amazon page.
Rename the selection to next_button.
4. By default, ParseHub will extract the text and URL from this link, so
expand your new next_button selection and remove these 2
commands.

5. Now, click on the PLUS(+) sign of your next_button selection and use
the Click command.
6. A pop-up will appear asking if this is a “Next” link. Click Yes and enter
the number of pages you’d like to navigate to. In this case, we will
scrape 9 additional pages.
Running and Exporting your Project
Now that we are done setting up the project, it’s time to run our
scrape job.

On the left sidebar, click on the "Get Data" button and click on the
"Run" button to run your scrape. For longer projects, we recommend
doing a Test Run to verify that your data will be formatted correctly.

After the scrape job is completed, you will now be able to download
all the information you’ve requested as a handy spreadsheet or as
a JSON file.
Practical 2

Aim: Scrape an online Social Media Site for Data. Use python to scrape
information from twitter.

Solution - The Twitter scraper (ntscraper)

This is a simple library to scrape Nitter instances for tweets. It can:

● search and scrape tweets with a certain term

● search and scrape tweets with a certain hashtag
● scrape tweets from a user profile
● get profile information of a user, such as display name, username, number of tweets,
profile picture ...

Installation
pip install ntscraper

How to use

First, initialize the library:

from ntscraper import Nitter

scraper = Nitter(log_level=1, skip_instance_check=False)

The valid logging levels are:

● None = no logs
● 0 = only warning and error logs
● 1 = previous + informational logs (default)
The skip_instance_check parameter is used to skip the check of the Nitter instances altogether
during the execution of the script. If you use your own instance or trust the instance you are relying
on, then you can skip set it to 'True', otherwise it's better to leave it to false.

Then, choose the proper function for what you want to do from the following.

Scrape tweets
github_hash_tweets = scraper.get_tweets("github", mode='hashtag')

bezos_tweets = scraper.get_tweets("JeffBezos", mode='user')

Parameters:

● term: search term

● mode: modality to scrape the tweets. Default is 'term' which will look for tweets
containing the search term. Other modes are 'hashtag' to search for a hashtag and 'user'
to scrape tweets from a user profile
● number: number of tweets to scrape. Default is -1 (no limit).
● since: date to start scraping from, formatted as YYYY-MM-DD. Default is None
● until: date to stop scraping at, formatted as YYYY-MM-DD. Default is None
● near: location to search tweets from. Default is None (anywhere)
● language: language of the tweets to search. Default is None (any language). The
language must be specified as a 2-letter ISO 639-1 code (e.g. 'en' for English, 'es' for
Spanish, 'fr' for French ...)
● to: user to which the tweets are directed. Default is None (any user). For example, if you
want to search for tweets directed to @github, you would set this parameter to 'github'
● filters: list of filters to apply to the search. Default is None. Valid filters are:
'nativeretweets', 'media', 'videos', 'news', 'verified', 'native_video', 'replies', 'links',
'images', 'safe', 'quote', 'pro_video'
● exclude: list of filters to exclude from the search. Default is None. Valid filters are the
same as above
● max_retries: max retries to scrape a page. Default is 5
● instance: Nitter instance to use. Default is None and will be chosen at random

Returns a dictionary with tweets and threads for the term.

Code
pip install ntscraper

import pandas as pd

from ntscraper import Nitter

scraper = Nitter(0)

def get_tweets(name, modes, no):

tweets = scraper.get_tweets(name, mode = modes, number=no)

final_tweets = []

for x in tweets['tweets']:

data =
[x['link'],x['text'],x['date'],x['stats']['likes'],x['stats']['comments']]

final_tweets.append(data)

dat= pd.DataFrame(final_tweets, columns

=['twitter_link','text','date','likes','comments'])

return dat

data = get_tweets('narendramodi','user',6)

twitter_linktextdatelikescomments0https://twitter.com/narendramodi/status/177835...प्रभु श्रु रु म कु प्रु ण-

प्रतिुषु् ठु क तिुमंत्रण ...Apr 11, 2024 · 9:19 AM
UTC25872021https://twitter.com/narendramodi/status/177835...ऋषिुकु श सहिुिु उत्तरु खुं ड कु पषिुत्र
भुमम मुुं िुु शक...Apr 11, 2024 · 9:18 AM
UTC15591082https://twitter.com/narendramodi/status/177835...मु दु िुु िुिु रुुं क िुिु पुुं शिु
कु गु रुं टु कु प ु ककय...Apr
11, 2024 · 9:16 AM UTC16461473https://twitter.com/narendramodi/status/177835...दु श मुुं कमजु र और असु् थिुर
सरकु र कु जगिु जब पुणण ...Apr 11, 2024 · 9:14 AM
UTC18661554https://twitter.com/narendramodi/status/177835...पु िुिु
िुगरु ऋषिुकु श मुुं उमडु जिुसु गर कु उतु् सु िु और...Apr 11, 2024 · 9:12 AM
UTC49231785https://twitter.com/narendramodi/status/177834...िुररयु णु कु मिुुुं द् रगढ़ मुुं िुु आ बस
िुु दसु अतु् युंिु ...Apr 11, 2024 · 9:01 AM UTC5174337
Index twitter_link text date likes comments

0 https://twitter.com/naren प्रभु श्रु रु म कु प्रु ण-प्रतिुष्ठु Apr 11, 2024 · 9:19 AM 2587 202
dramodi/status/1778352 UTC
254736592998#m कु निुमुं त्रण ठु करु िुु
वु लु ुं कु उत्तरु ख
ुं ड
सनिुिु दुश क
जिुिुु इिु चिुु वु ुं मुुं
कडु सबक ससखु िुु ज
रिुु िुु।
1 https://twitter.com/naren ऋनिुकु श सनिुिु उत्तरु ख ुं ड Apr 11, 2024 · 9:18 AM 1559 108
dramodi/status/1778351 UTC
927404765363#m कु पनवत्र भुनम मुुं व शनिु
और सु मथथु्य िुु, जु नकसु कु
भ जु िव बदल सकिुु िुु।
इससु जुडु मुरु एक
अिुुभव…
2 https://twitter.com/naren मु दु िुु िव रुुंक िव Apr 11, 2024 · 9:16 AM 1646 147
dramodi/status/1778351 UTC
302357000565#m पुुंशिु कु गु रुं ट कु पुरु
नकयु िुु। कु ुंग्रस कु सरकु र
िुु िुु िुु य
सपिुु िुु रिु जु िुु !
3 https://twitter.com/naren दु श मुुं कमजु र और असथिुर Apr 11, 2024 · 9:14 AM 1866 155
dramodi/status/1778350 UTC
943462891774#m सरकु र क जिग जब पुणय
बहुमिु कु मजबुिु
सरकु र
िुु िुु िुु, िुु यु फकय
नदखिुु िुु…
4 https://twitter.com/naren पु िव िुगरु ऋनिुकु श मुुं Apr 11, 2024 · 9:12 AM 4923 178
dramodi/status/1778350 UTC
356293926953#m उमडु जिुसु गर कउत्सु िु और
उमुं ग िुु सु फ कर नदयु
िुु नक उत्तरु ख
ुं ड मुुं भु जपु
कु प्रचुं ड नवजय नमलिुु
जु रिुु िुु।
5 https://twitter.com/naren िुररयु णु कु मिुुुं द् रगढ़ मुुं Apr 11, 2024 · 9:01 AM 5174 337
dramodi/status/1778347 UTC
711336427917#m हुआ बस िुु दसु अत्य
ुं िु
पु डु दु यक िुु। मुरु शु क-
सुं वदिुु एुं उिु सभु
पररवु रु ुं कु सु िु िुुुं ,
सजन्हु ुंिुु इस दुरु्यटिुु मुुं
अपिुु बच्चु ुं कु खु यु
िुु। इसकु सु िु िुु मुुं
सभु रु्ु यल बच्चु ुं कु शु घ्र स्विथ
िुु िुु कु कु मिुु
करिुु हुं । रु ज्य सरकु र क
दु खरुख मुुं थिुु िुु य
प्रशु सिु पु नडिुु ुं और
उिुकु पररजिुु ुं कु िुरसुं
भव सिुु यिुु मुुं जुट
िुु।
Solution2- Getting Started with snscrape

Requirements

1. Python 3.8 or higher

2. pandas

Installing snscrape
pip3 install snscrape

use the below command to download the snscrape dev version:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

Perfect, now that we’ve set up snscrape and peeripheral requirements, let’s
jump right into using snscrape!
Using snscrape

1. Using Python Wrapper

Use the Python Wrapper method because it's easy to interact with data
scraping,

wrappers around functions in Python allows modifying behavior of function or

class. Basically, the wrapper wraps a second function to extend the behavior of
the wrapped function, without permanently altering it.

Code :

import snscrape.modules.twitter as sntwitter

import pandas as pd

# Created a list to append all tweet attributes(data)

attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:narendramodi').get_items()):

if i>100:

break

attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.rawContent])

# Creating a dataframe from the tweets list above

tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet",

"Tweets"])

print(tweets_df)

Output
Date Created ... Tweets

0 2023-01-16 09:41:27+00:00 ... Addressed the 1st batch of spirited Agniveers....

1 2023-01-16 07:39:29+00:00 ... In the book #ExamWarriors, one Mantra is ‘Your...

2 2023-01-16 04:40:55+00:00 ... On Thiruvalluvar Day, I pay homage to the wise...

3 2023-01-16 04:38:36+00:00 ... மேலுே் இளைஞர்கை் அவசியே் திருக்குறளை படிக்க வ...

4 2023-01-16 04:38:35+00:00 ... திருவை் ளுவர் தினத்தில் , அறிவில் சிறந்த திருவை் ...

.. ... ... ...

96 2023-01-05 04:14:31+00:00 ... My remarks at All-India Water Conference on th...

97 2023-01-05 02:51:57+00:00 ........ Infra 𝖸 \n\nEmpowerment 𝖸 \n\nDevelopment 𝖸

98 2023-01-05 02:50:52+00:00 ... Birthday wishes to Dr. Murli Manohar Joshi Ji....

99 2023-01-04 15:18:28+00:00 ... Looks interesting. Over the years, Ahmedabad’s...

100 2023-01-04 14:22:25+00:00 ... National Green Hydrogen Mission, which the Uni...

[101 rows x 4 columns]

Practical 3 - Page Rank for link analysis using python

Create a small set of pages namely page1, page2, page3 and

page4 apply random walk on the same

Code :-

import numpy as np
import scipy as sc
#import pandas as pd
from fractions import Fraction

# keep it clean and tidy

def float_format(vector, decimal):
return np.round((vector).astype(float), decimals=decimal)

# we have 3 webpages and probability of landing to each one is 1/3

#(default Probability)
dp = Fraction(1,3)

# WWW matrix
M = np.matrix([[0,0,1],
[Fraction(1,2),0,0],
[Fraction(1,2),1,0]])
E = np.zeros((3,3))
E[:] = dp

# taxation
beta = 0.8

# WWW matrix
A = beta * M + ((1-beta) * E)

# initial vector
r = np.matrix([dp, dp, dp])
r = np.transpose(r)

previous_r = r
for it in range(1,10):
r=A*r
print (float_format(r,3))
#check if converged
if (float_format(previous_r,3)==float_format(r,3)).all():
break
previous_r = r

print ("Final:\n", float_format(r,3))

print( "sum", np.sum(r))
The output would be:

[[ 0.333]
[ 0.217]
[ 0.45 ]]
1 [[ 0.415]
2 [ 0.217]
3 [ 0.368]]
4 [[ 0.358]
5 [ 0.245]
6 [ 0.397]]
7 .
8 .
9 .
10[[ 0.378]
11[ 0.225]
12[ 0.397]]
13[[ 0.378]
14[ 0.232]
15[ 0.39 ]]
16[[ 0.373]
17[ 0.232]
18[ 0.395]]
19[[ 0.376]
20[ 0.231]
21[ 0.393]]
22[[ 0.375]
23[ 0.232]
24[ 0.393]]
25Final:
26[[ 0.375]
[ 0.232]
[ 0.393]]
sum 1.0
The output would be:

1 [[ 0.333]
2 [ 0.217]
3 [ 0.45 ]]
4 [[ 0.415]
5 [ 0.217]
6 [ 0.368]]
7 [[ 0.358]
8 [ 0.245]
9 [ 0.397]]
10.
11.
12.
13[[ 0.378]
14[ 0.225]
15[ 0.397]]
16[[ 0.378]
17[ 0.232]
18[ 0.39 ]]
19[[ 0.373]
20[ 0.232]
21[ 0.395]]
22[[ 0.376]
23[ 0.231]
24[ 0.393]]
25[[ 0.375]
26[ 0.232]
[ 0.393]]
Final:
[[ 0.375]
[ 0.232]
[ 0.393]]
sum 1.0
Practical 5 Date:

Demonstrate Text Mining and Webpage Pre-processing using

meta information from the web pages (Local/Online).

Solution: After a text is obtained, we start with text normalization. Text

normalization includes:
● converting all letters to lower or upper case
● converting numbers into words or removing numbers
● removing punctuations, accent marks and other diacritics
● removing white spaces
● Tokenization and removing stop words, sparse terms, and particular words
● Stemming
● Lemmatization
We will describe text normalization steps in detail below.
Convert text to lowercase
Example 1. Convert text to lowercase
Python code:
input_str = 'The 5 biggest countries by population in 2017 are China, India,
United States, Indonesia, and Brazil.'
input_str = input_str.lower()
print(input_str)
Output:
the 5 biggest countries by population in 2017 are china, india, united
states, indones

Remove numbers
Remove numbers if they are not relevant to your analyses. Usually, regular
expressions are used to remove numbers.
Example 2. Numbers removing
Python code:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4
red and 2 blue balls.'
result = re.sub(r'\d+', '', input_str)
print(result)
Output:
Box A contains red and white balls, while Box B contains red and blue balls.
Remove punctuation
The following code removes this set of symbols [!”#$%&’()*+,-
./:;<=>?@[\]^_`{|}~]:
Example 3. Punctuation removal
Python code:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" #
Sample string
result = input_str.translate(str.maketrans("","", string.punctuation))
print(result)

Output:
This is an example of string with punctuation
4. Remove whitespaces
To remove leading and ending spaces, you can use the strip() function:
Example 4. White spaces removal
Python code:
input_str = ‘ \t a string example\t ‘
input_str = input_str.strip()
input_str
Output:
‘a string example’
5 Tokenization
Tokenization is the process of splitting the given text into smaller pieces called
tokens. Words, numbers, punctuation marks, and others can be considered as
tokens.
Stopwords are the commonly used words and are removed from the text as
they do not add any value to the analysis. These words carry less or no
meaning.
Code :
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

import nltk

nltk.download('punkt')
input_str = 'NLTK is a leading platform for building Python programs to
work with human language data.'

stop_words = set(stopwords.words('english'))

from nltk.tokenize import word_tokenize

tokens = word_tokenize(input_str)

result = [i for i in tokens if not i in stop_words]

print (result)

output:
[‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’,
‘work’, ‘human’, ‘language’, ‘data’, ‘.’]

6. Stemming using NLTK:

It is also known as the text standardization step where the words are
stemmed or diminished to their root/base form. For example, words like
‘programmer’, ‘programming, ‘program’ will be stemmed to ‘program’.

But the disadvantage of stemming is that it stems the words such that its root
form loses the meaning or it is not diminished to a proper English word. We
will see this in the steps done below.

Code:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str='There are several types of stemming algorithms.'
input_str=word_tokenize(input_str)
for word in input_str:
print(stemmer.stem(word))

output
There are sever type of stem algorithm.

7. Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common

base form. As opposed to stemming, lemmatization does not simply chop off inflections.
Instead it uses lexical knowledge bases to get the correct base forms of words.

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer=WordNetLemmatizer()
input_str='been had done languages cities mice'
input_str=word_tokenize(input_str)
for word in input_str:
print(lemmatizer.lemmatize(word))

Output:

be have do language city mouse

Practical 6- Apriori Algorithm implementation in case
study

Case Study – Apriori algorithm

Apriori algorithm is an Association Rule Mining (ARM) algorithm for boolean

association rules. The algorithm is based on the fact that it uses prior
knowledge of the frequent itemset property which states that all nonempty
subsets of a frequent itemset must also be frequent. This algorithm uses two
functions namely candidate generation and pruning at every iteration.

In general, the association rule is an expression of the form X⇒Y, where X, Y ⊆

I. Here, X is called the antecedent and Y is called the consequent. Association
rule shows how many times Y has occurred if X has already occurred depending
on the minimum support (s) and minimum confidence (c) values.

ARM Measures

Support: The support of the rule X⇒Y in the transaction database D is the
support of the itemset X ∪ Y in D:

support(X⇒Y) = count(X ∪ Y) / N –––> (1)

where ‘N’ is the total number of transactions in the database and count(X ∪ Y)
is the number of transactions that contain X ∪ Y.

Confidence: The confidence of the rule X⇒Y in the transaction database D is the
ratio of the number of transactions in D that contain X ∪ Y to the number of
transactions that contain X in D:

confidence(X⇒Y) = count(X ∪ Y) / count(X) = support(X ∪ Y) / support(X)

–––> (2)

It is basically denotes a conditional probability P(Y|X).

Lift: The lift of the rule X⇒Y is referred to as the interestingness measure, takes
this into account by incorporating the prior probability of the rule consequent as
follows:

lift(X⇒Y) = support(X ∪ Y) / support(X) ∗ support(Y) –––> (3)

The measure ‘lift‘ is newly added in this context. Its significance in ARM is given
below:

● lift(X⇒Y) = 1 means that there is no correlation between X and Y,

● lift(X⇒Y) > 1 means that there is a positive correlation between X
and Y, and
● lift(X⇒Y) < 1 means that there is a negative correlation between X
and Y.

Greater lift value indicates stronger association. We will use this measure in our
experiment.

Dataset Description

The following dataset (transaction.csv) contains transactional records of a

departmental store on a particular day. The dataset is having 30 records and
contains six items such as Juice, Chips, Bread, Butter, Milk, and Banana. The
snapshot of the dataset is given below using MS Excel software.

transaction.csv

Juice Chips Bread Butter Milk Banana

Juice Bread Butter Milk

Bread Butter Milk

Chips Banana

Juice Chips Bread Butter Milk Banana

Juice Chips Milk

Juice Chips Bread Butter Banana

Juice Chips Milk

Juice Bread Banana

Juice Bread Butter Milk

Chips Bread Butter Banana

Juice Butter Milk Banana

Juice Chips Bread Butter Milk

Juice Bread Butter Milk Banana

Juice Chips Bread Butter Milk Banana

Chips Bread Butter Milk Banana

Chips Butter Milk Banana

Juice Chips Bread Butter Milk Banana

Juice Bread Butter Milk Banana

Juice Chips Bread Milk Banana

Juice Chips

Bread Butter Banana

Bread Butter Milk Banana

Juice Chips

Bread Butter Banana

Chips Bread Butter Milk Banana

Juice Bread Butter Banana

Chips Bread Butter Milk Banana

Chips Bread Butter Banana

Python Environment Setup

Before we start coding, we need to install the ‘apyori’ module first.

pip install apyori

It is mandatory because ‘apriori‘ is a member of the ‘apyori’ module.

Implementation of Apriori algorithm

We provide here the implementation of Apriori algorithm using Python coding.

The objective is to discover the association rules based on support, confidence
and lift respectively greater than equal to min_support, min_confidence and
min_lift.

code below.

arm.py

# Step 1: Import the libraries

import pandas as pd

from apyori import apriori

# Step 2: Load the dataset

df = pd.read_csv('transaction.csv', header=None)

# Step 3: Display statistics of records

print("Display statistics: ")

print("===================")

print(df.describe())

# Step 4: Display shape of the dataset

print("\nShape:",df.shape)

# Step 5: Convert dataframe into a nested list

database = []

for i in range(0,30):

database.append([str(df.values[i,j]) for j in range(0,6)])

# Step 6: Develop the Apriori model

arm_rules = apriori(database, min_support=0.5, min_confidence=0.7,

min_lift=1.2)

arm_results = list((arm_rules))

# Step 7: Display the number of rule(s)

print("\nNo. of rule(s):",len(arm_results))

# Step 8: Display the rule(s)

print("\nResults: ")

print("========")

print(arm_results)

Output:

Display statistics:

===================

0 1 2 3 4 5

count 19 18 23 23 20 22

unique 1 1 1 1 1 1

top Juice Chips Bread Butter Milk Banana

freq 19 18 23 23 20 22

Shape: (30, 6)

No. of rule(s): 1

Results:

========
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}),
support=0.5,

ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread',
'Milk'}),

items_add=frozenset({'Butter'}), confidence=0.9375,
lift=1.2228260869565217)])]

Explanation

The program generates only one rule based on user-specified input measures
such as: min_support = 0.5, min_confidence = 0.7, and min_lift = 1.2.

The support count value for the rule is 0.5. This number is calculated by dividing
the number of transactions containing ‘Butter’, ‘Bread’, and ‘Milk’ by the total
number of transactions.

The confidence level for the rule is 0.9375, which shows that out of all the
transactions that contain both ‘Bread’ and ‘Milk’, 93.75 % contain ‘Butter’ too.

The lift of 1.22 tells us that ‘Butter’ is 1.22 times more likely to be bought by
the customers who buy both ‘Bread’ and ‘Milk’ compared to the default
likelihood sale of ‘Butter.’
Practical 7 Date:

Develop a basic crawler for the web search for user defined keywords.

Solution 1

import requests

from bs4 import BeautifulSoup

def web(page,WebUrl):

if(page>0):

url = WebUrl

code = requests.get(url)

plain = code.text

s = BeautifulSoup(plain, 'html.parser')

#print(s)

#for link in s.findAll('a', {'class':'s-access-detail-page'}):

for link in s.findAll('a'):

tet = link.get('id')

print(tet)

tet_2 = link.get('href')

print(tet_2)

web(1,'https://www.amazon.in/mobile-phones/b?ie=UTF8&node=1389401031&ref_=nav-progressive-
content')

Solution 2
import requests
from bs4 import BeautifulSoup
#url=("www.amazon.in")
#url=("www.nytimes.com")
url=("www.timesofindia.indiatimes.com")
code=requests.get("http://"+url)
plain=code.text
s=BeautifulSoup(plain,'html.parser')
for link in s.find_all("a"):
print(link.get("href"))
#print(link.prettify())

Solution 3