BDA Assignment (Savi Bilandi)
BDA Assignment (Savi Bilandi)
MANAGEMENT
MBA/2018/3731
3. RULES :
In principle the algorithm is quite simple. It builds up attribute-value (item) sets that maximize the number of instances
that can be explained (coverage of the dataset). The search through item space is very much similar to the problem
faced with attribute selection and subset search.
4. ANALYSE RESULTS:
HJ
Q.2 Provide decision tree for diabetites dataset ( or you can take any
inbuilt dataset of your choice using R or Weka. ?
Solution:
Q.3 Trace the results of using the Apriori algorithm on the grocery
store example with support threshold s=33.34% and confidence
threshold c=60%. Show the candidate and frequent item sets for each
database scan. Enumerate all the final frequent item sets. Also
indicate the association rules that are generated and highlight the
strong ones, sort them by confidence. ?
Transaction Items
ID
T1 HotDogs, Buns,
Ketchup
T2 HotDogs, Buns
T3 HotDogs, Coke, Chips
T4 Chips, Coke
T5 Chips, Ketchup
T6 HotDogs, Coke, Chips
Solution:
Support threshold =33.34% => threshold is at least 2
transactions. Applying Apriori
Pass (k) Candidate k-itemsets and their support Frequent k-itemsets
k=1 HotDogs(4), Buns(2), Ketchup(2), Coke(3), Chips(4) HotDogs, Buns,
Ketchup,
Coke, Chips
k=2 {HotDogs, Buns}(2), {HotDogs, Ketchup}(1), {HotDogs, Buns},
{HotDogs, Coke}(2), {HotDogs, Chips}(2), {HotDogs, Coke},
{Buns, Ketchup}(1), {Buns, Coke}(0), {Buns, Chips} {HotDogs, Chips},
(0), {Coke, Chips}
{Ketchup, Coke}(0), {Ketchup, Chips}(1), {Coke,
Chips}(3)
k=3 {HotDogs, Coke, Chips}(2) {HotDogs, Coke,
Chips}
k=4 {}
Note that {HotDogs, Buns, Coke} and {HotDogs, Buns, Chips} are not candidates when
k=3 because their subsets {Buns, Coke} and {Buns, Chips} are not frequent.
Note also that normally, there is no need to go to k=4 since the longest transaction has only
3 items.
With the confidence threshold set to 60%, the Strong Association Rules are
1. Coke € Chips (0.5, 1) (sorted by confidence): 5.
2. Buns € HotDogs (0.33, 1); Chips € Coke (0.5, 0.75);
3. HotDogs ^ Coke € Chips(0.33, 1) 6. Coke € HotDogs (0.33, 0.66);
4. HotDogs ^ Chips € Coke(0.33, 1) 7. Coke € Chips ^ HotDogs (0.33,
0.66)
8. Coke ^ Chips € HotDogs(0.33,
0.66).
Q. 4 For following data set Example classify a Red
,Domestic and SUV. Note there is no example of a Red
Domestic SUV in following data set. So determine if the
classifier is Red ,Domestic and SUV , What will be the class
label ? Using Naive Bayes Classifier algorithm?
No. Color Type Origin Stolen?
Solution
1 The Classifier
The Bayes Naive classifier selects the most likely classification Vnb given the attribute values
a1, a2, . . . an. This results in: Y
Vnb = argmaxvj ∈V P (vj ) P (ai |vj ) (1)
2
and multiply them by P(Yes) and P(No) respectively . We can estimate these
values using equation (3).
Yes: No:
Red: Red:
n=5 n=5
n_c= 3 n_c = 2
p = .5 p = .5
m=3 m=3
SUV: SUV:
n=5 n=5
n_c = 1 n_c = 3
p = .5 p = .5
m=3 m=3
Domestic: Domestic:
n=5 n=5
n_c = 2 n_c = 3
p = .5 p = .5
m=3 m =3
|
Looking at P (Red Y es), we have 5 cases where vj = Yes , and in 3 of those
|
cases ai = Red. So for P (Red Y es), n = 5 and nc = 3. Note that all attribute
are binary (two possible values). We are assuming no other information so, p =
1 / (number-of-attribute-values) = 0.5 for all of our attributes. Our m value is
arbitrary, (We will use m = 3) but consistent for all attributes. Now we simply
apply eqauation (3) using the precomputed values of n , nc, p, and m.
3 + 3 ∗ .5 2 + 3 ∗ .5
P (Red|Y es) = = .56
5 + P (Red|No) = = 5.43
+
3 3
1 + 3 ∗ .5 3 + 3 ∗ .5
P (SUV |Y es) = =5.31
+
P (SUV |No) = 5+
= .56
3 3
2 + 3 ∗ .5 3 + 3 ∗ .5
P (Domestic|Y es) = 5+
= .43 P (Domestic|No) = 5+
= .56
3 3
We have P (Y es) = .5 and P (No) = .5, so we can apply equation (2). For v = Y es,
we have
P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)
Solution:
Linear regression is an important tool in analytics. The technique uses
statistical calculations to plot a trend line in a set of data points. ... Linear
regression shows a relationship between an independent variable and a
dependent variable being studied. There are a number of ways to calculate
linear regression
Analysis: if R Square is greater than 0.08, as it in the case , there is a good fit to
the data. Some statistics references recommend using the adjusted R square value.
Interpretation: R square of .961 means that 96.1% of the variation in price &
advertising can be explained by quantity sold. The adjusted R square of .942 means
94.2%.
Since the p value (0<0.05), we reject the null hypothesis that the two variables are
unrelated . in other words, there is a relation between the two variables.