0% found this document useful (0 votes)
145 views10 pages

BDA Assignment (Savi Bilandi)

This document summarizes a student's submission of their data analytics lab work. It includes responses to 4 questions analyzing grocery store transaction data, a diabetes dataset, and a car theft dataset using association rule mining and naive Bayes classification. Screenshots of Weka outputs are included. The student traces the Apriori algorithm on sample transaction data and identifies frequent itemsets and strong association rules. They also classify an unlabeled example in the car theft data using naive Bayes.

Uploaded by

SAVI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views10 pages

BDA Assignment (Savi Bilandi)

This document summarizes a student's submission of their data analytics lab work. It includes responses to 4 questions analyzing grocery store transaction data, a diabetes dataset, and a car theft dataset using association rule mining and naive Bayes classification. Screenshots of Weka outputs are included. The student traces the Apriori algorithm on sample transaction data and identifies frequent itemsets and strong association rules. They also classify an unlabeled example in the car theft data using naive Bayes.

Uploaded by

SAVI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

INTERNATIONAL SCHOOL OF INFORMATICS &

MANAGEMENT

Submitted in partial fulfillment of the requirement for the award of


the degree of
Diploma in Business Data analytics
2018-20

Subject:- Data analytics Lab


Submitted To:- Submitted By :-
Dr.Monika Rathore Savi Bilandi

MBA/2018/3731

Roll No. 18MIIXX699

Data Analytics Roll No. 183172


Q.1 Find association rules for supermarket / grocery dataset using R/
Weka . share screen shots and explain rules in your?
Solution:
1. Start the WEKA EXPLORER

2. LOAD THE SUPERMARKET DATABASE

3. RULES :

In principle the algorithm is quite simple. It builds up attribute-value (item) sets that maximize the number of instances
that can be explained (coverage of the dataset). The search through item space is very much similar to the problem
faced with attribute selection and subset search.

4. ANALYSE RESULTS:

 HJ
Q.2 Provide decision tree for diabetites dataset ( or you can take any
inbuilt dataset of your choice using R or Weka. ?
Solution:
Q.3 Trace the results of using the Apriori algorithm on the grocery
store example with support threshold s=33.34% and confidence
threshold c=60%. Show the candidate and frequent item sets for each
database scan. Enumerate all the final frequent item sets. Also
indicate the association rules that are generated and highlight the
strong ones, sort them by confidence. ?

Transaction Items
ID
T1 HotDogs, Buns,
Ketchup
T2 HotDogs, Buns
T3 HotDogs, Coke, Chips
T4 Chips, Coke
T5 Chips, Ketchup
T6 HotDogs, Coke, Chips

Solution:
Support threshold =33.34% => threshold is at least 2
transactions. Applying Apriori
Pass (k) Candidate k-itemsets and their support Frequent k-itemsets
k=1 HotDogs(4), Buns(2), Ketchup(2), Coke(3), Chips(4) HotDogs, Buns,
Ketchup,
Coke, Chips
k=2 {HotDogs, Buns}(2), {HotDogs, Ketchup}(1), {HotDogs, Buns},
{HotDogs, Coke}(2), {HotDogs, Chips}(2), {HotDogs, Coke},
{Buns, Ketchup}(1), {Buns, Coke}(0), {Buns, Chips} {HotDogs, Chips},
(0), {Coke, Chips}
{Ketchup, Coke}(0), {Ketchup, Chips}(1), {Coke,
Chips}(3)
k=3 {HotDogs, Coke, Chips}(2) {HotDogs, Coke,
Chips}
k=4 {}
Note that {HotDogs, Buns, Coke} and {HotDogs, Buns, Chips} are not candidates when
k=3 because their subsets {Buns, Coke} and {Buns, Chips} are not frequent.
Note also that normally, there is no need to go to k=4 since the longest transaction has only
3 items.

All Frequent Itemsets: {HotDogs}, {Buns}, {Ketchup}, {Coke}, {Chips}, {HotDogs,


Buns}, {HotDogs, Coke}, {HotDogs, Chips}, {Coke, Chips}, {HotDogs, Coke, Chips}.
Association rules:
{HotDogs, Buns} would generate: HotDogs € Buns (2/6=0.33, 2/4=0.5) and
Buns € HotDogs (2/6=0.33, 2/2=1);
{HotDogs, Coke} would generate: HotDogs € Coke (0.33, 0.5) and
Coke € HotDogs (2/6=0.33, 2/3=0.66);
{HotDogs, Chips} would generate: HotDogs € Chips (0.33, 0.5) and
Chips € HotDogs (2/6=0.33, 2/4=0.5);
{Coke, Chips} would generate: Coke € Chips (3/6=0.5, 3/3=1)
and
Chips € Coke (3/6=0.5, 3/4=0.75);
{HotDogs, Coke, Chips} would generate: HotDogs € Coke ^ Chips (2/6=0.33, 2/4=0.5),
Coke € Chips ^ HotDogs (2/6=0.33,
2/3=0.66), Chips € Coke ^ HotDogs
(2/6=0.33, 2/4=0.5), HotDogs ^ Coke €
Chips(2/6=0.33, 2/2=1), HotDogs ^ Chips €
Coke(2/6=0.33, 2/2=1) and Coke ^ Chips €
HotDogs(2/6=0.33, 2/3=0.66).

With the confidence threshold set to 60%, the Strong Association Rules are
1. Coke € Chips (0.5, 1) (sorted by confidence): 5.
2. Buns € HotDogs (0.33, 1); Chips € Coke (0.5, 0.75);
3. HotDogs ^ Coke € Chips(0.33, 1) 6. Coke € HotDogs (0.33, 0.66);
4. HotDogs ^ Chips € Coke(0.33, 1) 7. Coke € Chips ^ HotDogs (0.33,
0.66)
8. Coke ^ Chips € HotDogs(0.33,
0.66).
Q. 4 For following data set Example classify a Red
,Domestic and SUV. Note there is no example of a Red
Domestic SUV in following data set. So determine if the
classifier is Red ,Domestic and SUV , What will be the class
label ? Using Naive Bayes Classifier algorithm?
No. Color Type Origin Stolen?

1 Red Sports Domestic Yes

2 Red Sports Domestic No

3 Red Sports Domestic Yes

4 Yellow Sports Domestic No

5 Yellow Sports Imported Yes

6 Yellow SUV Imported No

7 Yellow SUV Imported Yes

8 Yellow SUV Domestic No

9 Red SUV Imported No

10 Red Sports Imported Yes

Solution
1 The Classifier
The Bayes Naive classifier selects the most likely classification Vnb given the attribute values
a1, a2, . . . an. This results in: Y
Vnb = argmaxvj ∈V P (vj ) P (ai |vj ) (1)

We generally estimate P (ai |vj ) using m-estimates:

P (ai |vj ) nc + mp (2)


= n+m
wher
e:
n = the number of training examples for
which v = vj nc = number of examples for
|
1
which v = vj and a = ai p = a priori
estimate for P (ai vj)
m = the equivalent sample size

2 Car theft Example


Attributes are Color , Type , Origin, and the subject, stolen can be either yes or
no.

2.1 data set


Example Color Type Origin Stole
No. n?
1 Red Spor Domest Yes
ts ic
2 Red Spor Domest No
ts ic
3 Red Spor Domest Yes
ts ic
4 Yello Spor Domest No
w ts ic
5 Yello Spor Import Yes
w ts ed
6 Yello SUV Import No
w ed
7 Yello SUV Import Yes
w ed
8 Yello SUV Domest No
w ic
9 Red SUV Import No
ed
10 Red Spor Import Yes
ts ed

2.2 Training example


We want to classify a Red Domestic SUV. Note there is no example of a Red
Domestic SUV in our data set. Looking back at equation (2) we can see how to
compute this. We need to calculate the probabilities
P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) , P(Red|

No) , P(SUV|No), and P(Domestic|No)

2
and multiply them by P(Yes) and P(No) respectively . We can estimate these
values using equation (3).
Yes: No:
Red: Red:
n=5 n=5
n_c= 3 n_c = 2
p = .5 p = .5
m=3 m=3
SUV: SUV:
n=5 n=5
n_c = 1 n_c = 3
p = .5 p = .5
m=3 m=3
Domestic: Domestic:
n=5 n=5
n_c = 2 n_c = 3
p = .5 p = .5
m=3 m =3
|
Looking at P (Red Y es), we have 5 cases where vj = Yes , and in 3 of those
|
cases ai = Red. So for P (Red Y es), n = 5 and nc = 3. Note that all attribute
are binary (two possible values). We are assuming no other information so, p =
1 / (number-of-attribute-values) = 0.5 for all of our attributes. Our m value is
arbitrary, (We will use m = 3) but consistent for all attributes. Now we simply
apply eqauation (3) using the precomputed values of n , nc, p, and m.
3 + 3 ∗ .5 2 + 3 ∗ .5
P (Red|Y es) = = .56
5 + P (Red|No) = = 5.43
+
3 3
1 + 3 ∗ .5 3 + 3 ∗ .5
P (SUV |Y es) = =5.31
+
P (SUV |No) = 5+
= .56
3 3
2 + 3 ∗ .5 3 + 3 ∗ .5
P (Domestic|Y es) = 5+
= .43 P (Domestic|No) = 5+
= .56
3 3

We have P (Y es) = .5 and P (No) = .5, so we can apply equation (2). For v = Y es,
we have
P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)

= .5 * .56 * .31 * .43 = .037


and for v = No, we have
P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No)

= .5 * .43 * .56 * .56 = .069


Since 0.069 > 0.037, our example gets classified as ’NO’

Q.5 What do you understand by linear regression ? For


following dataset is there a relation between Quantity Sold
(Output) and Price and Advertising (Input). Predict
Quantity Sold using regression analysis in Excel .?

Solution:
Linear regression is an important tool in analytics. The technique uses
statistical calculations to plot a trend line in a set of data points. ... Linear
regression shows a relationship between an independent variable and a
dependent variable being studied. There are a number of ways to calculate
linear regression
Analysis: if R Square is greater than 0.08, as it in the case , there is a good fit to
the data. Some statistics references recommend using the adjusted R square value.

Interpretation: R square of .961 means that 96.1% of the variation in price &
advertising can be explained by quantity sold. The adjusted R square of .942 means
94.2%.

Since the p value (0<0.05), we reject the null hypothesis that the two variables are
unrelated . in other words, there is a relation between the two variables.

You might also like