0% found this document useful (0 votes)
65 views

Data Mining BITS-PILANI Mid Semester Sample

PCA helps reduce data dimensionality to identify patterns and correlations. FOIL_Prune assesses rule quality using positive and negative tuples. DIC dynamically adds/deletes itemsets during transactions to reduce passes over data and count frequent itemsets more efficiently.

Uploaded by

Meena Sivan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Data Mining BITS-PILANI Mid Semester Sample

PCA helps reduce data dimensionality to identify patterns and correlations. FOIL_Prune assesses rule quality using positive and negative tuples. DIC dynamically adds/deletes itemsets during transactions to reduce passes over data and count frequent itemsets more efficiently.

Uploaded by

Meena Sivan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Provide brief answers to the following: (3 X 2 marks)


a. How does PCA (principal component analysis) impact data mining activity?
b. How can FOIL_Prune enhance quality of rules?
c. How can DIC (Dynamic Itemset Counting) enhance efficiency of Apriori algorithm?

2. You found that bread and cheese sell together often in a store. Individually bread sells more
often than cheese. Will you conclude that bread=>cheese or cheese=>bread? Why? (2 marks)

3. After mining a transaction database for frequent itemsets, there is one largest frequent itemset
of size 6. Let N, C, M be the total number of frequent itemsets, closed frequent itemsets, and
maximal frequent itemsets (including the one of size 6). What is the minimum values of N, C,
and M? (2 marks)

4. Consider the following observations of an attribute. Answer the following: (4 marks)

2, 124, 131, 132, 133, 135, 135, 136, 136, 136, 140, 141, 142, 143, 249

(a) What is the five-number summary for the given data?


(b) Draw boxplot for the data.
(c) Identify the outliers, if any.
(d) How do outliers impact the measures of central tendency (Mean, Mode and Median) of
data? Comment using the given data set.

5. Compare the proximity between the following pairs with supporting computations. (Assume all
attributes are ordinal) (4 marks):
a) C and E
b) A and D

Object Income Height Weight

A <50K Normal Underweight


B 75K – 100K Short Normal
C >100K Tall Overweight
D 50K – 75K Short Normal
E 50K – 75K Normal Overweight

6. During decision tree induction for 2-class data, a node has {10, 10} objects of class1 and class2.
Upon split using an attribute A, two child nodes were created which have {7, 2} and {3, 8}
objects of class1 and class2. Calculate the improvement in (im)purity using Gini index, Entropy
and Misclassification error. (3 marks)
7. Given below are the confusion matrices of two classifiers, classifier A and classifier B, when
tested with the same dataset. The classification involved two classes Class 1 and Class 2.
Compare performance of 2 classifiers with supporting computations. (4 marks)

Classifier A Predicted Classifier B Predicted


Actual 1 2 Actual 1 2
1 42 8 1 41 9
2 8 42 2 7 43

8. A market basket database has five transactions. Let minimum support = 60%.

TID items bought


T100 Bread, Butter, Beans, Potato, Jam, Milk
T200 Bread, Butter, Shampoo, Potato, Jam, Milk
T300 Beans, Soap, Butter, Bread
T400 Beans, Onion, Apple, Butter, Milk
T500 Apple, Banana, Jam, Bread, Butter

Find frequent itemsets by using apriori algorithm. Show all intermediate steps clearly.
[5 marks]
Answer 1. (a)
Principal Component Analysis (PCA) is used for reducing the dimensionality of the big data sets provided.

It does this by transforming large sets of variables to multiple smaller ones which can also retain most of the content of the information we had in the
set.

Out of all the dimension reduction approaches, PCA is by far the most well known.

Of course in Data Mining, this process of reducing the dimensionality of the large set makes the data to be much more operational friendly; i.e. most
variance in the dataset is retained, most of which is along the axis, which is crucial in Data Mining.

Answer 1. (b)

Rule Pruning is one of the essential processes of data mining as it helps us assess the quality of the original large set of data. This is because it is m
easier to allow a rule to be performed on a trained data than it is to run any rule on other subsequent data-sets.

Now, FOIL is considered to be one of the best rule pruning methods as it is simple and effective.

FOIL_Prune = (T+ - T-) / (T+ + T-), where

T+ is used for +ve number of tuples in a rule R, and

T- is used for -ve number of tuples in a rule R

The value of FOIL_Prune increases with the increase of accuracy of rule R on the pruning set.

This means that if this value is higher for a pruned rule R, then we go forward with the process of pruning R.

Hence, with the help of this pruning, the effectiveness of rule R increases drastically.

Answer 1.(c)

Dynamic Itemset Counting(DIC) is an alternative to Apriori Itemset Generation.

Here, itemsets are added and deleted dynamically as the transactions are read in the background.

Also, for any itemset to be considered frequent, all of the particular itemset's subsets must also be frequent. This allows us to only examine the items
which all subsets are also frequent.

DIC is able to address the high-level issues(which wasn't possible in Apriori Algorithm). This helps in cases where we want to know when and which
itemsets to count.

This allows DIC to be a substantially faster than Apriori as the latter can require many passes to get to the required itemset.

Data mining: Data mining is the process extract the useful, dataset , patterns and correlation from a big dataset of raw data using machine learning m
and reduce the dimensions of complex dataset.
Data mining: Data mining is the process extract the useful, dataset , patterns and correlation from a big dataset of raw data using machine learning
method and reduce the dimensions of complex dataset.

A. PCA impact data mining: High dimensional data is extremely complex to process due to inconsistency in the features which increase the
computation time and make data processing. So PCA helps us to reduce the dimensions of data that enables you to identify correlation and pattern
data set so that it can be transformed into data set of significantly lower dimensions without loss of important information when we transferring the
data. So we make sure we remove all inconsistency in the data that main goal of data mining.

Step By step PCA:

1. STANDARDIZATION OF DATA

2.Computing covariance matrix

3.calculate eigenvalue and eigenvactors

4.computing the principal component

5.Reducing the dimensions of data

B. Foil_ Pruning: Assessment of rule quality are made with instances from the training data set.

A rule is pruned by removing a attribute test . We choose to prune a Rule (R) to get the high quality of prune verson accessed on a pruning data
set.Foil _ Prune is the very effective method for assessment of rule quality made with tuples from original data.

Foil_prone(R) = pos- neg/ pos neg

Pos and neg are types( positive and negetive) used by R, accuracy of R will increase on the pruning set, favours the rule that have high accuracy a
cover many positive types so if Pos are higher , if Foil_prune is higher for the prune version of R , prune R.

C. Although algorithms are easy to implement it needs many database scans which reduces the overall performance, if data base is very large. So
there are many variations of Apriori algorithm that have been proposed to improve the efficiency. So the dynamic itemset Counting techniques redu
the no. Of passes made over the data while keeping the number of itemset which are counted. In DIC the database is partition into blocks and mar
by star points maintains the count so far to cross the minimum support. The itemset is added to collection itemset collection which is used for furthe
gerate candidate of longer itemset, so it dynamically devide the data base and first start from the start point and it maintains when it counter and if
reaches to minimum support than it can be used for frequent dataset item, by finding the frequent itemset algorithm's efficiency is increased.
Q2.
Bread is an edible entity which can make pair with most of other edible things. Like we can eat bread with curry, with veggies, with chicken curries, a
many more . So any many edible things are here that requires bread to eat with . But as per cheese is concern there are no that mmuch pair to eat
with as many bread have.

So bread purchased by customers in larger amount as compare with the cheese. As per importnace of both things in our plate is concern they are
equally important. But bread is a basic entity.

If someone has lesser money so his first priority will be bread not cheese. So people give priority to bread more than the cheese that's is also a reas
why people go for bread.

So as per above explanation bread=>cheese.


Q3.
The largest frequent item set with size = 6

To find :The minimal value of the total number of frequency itemsets

Solution:

Now, According to the Association rule mining algorithm,

the range of the frequent itemset will be from k-2 to k where k= larger frequent set size.

Here, the value of k = 6

Therefore, the minimal value of total number of frequency itemsets = 6-2 = 4


Q4.
Data is in ascending order

minimum = 2

maximum = 249

Median is the middle number of the dataset = (136 + 136) / 2 = 136

Lower half = 2, 124, 131, 132, 133, 135

Upper half = 136, 140, 141, 142, 143, 249

Lower half's median = (132 + 131) / 2 = 131.5

Upper half's median = (141 + 142) / 2 = 141.5

a.) Five number summary of given data:

minimum = 2, lower_half_median = 131.5, median = 136, upper_half_median =141.5 and maxinmum = 249

b.) boxplot for data:

min ---[ 131.5 136 141.5 ]---max

2 lower_half_median median lower_half_median 249

c.) Outliers:

min = 2

max = 249

l_m = 131.5

u_m = 141.5

m = 136

IQR = u_m - l_m = 10

-1.5 * IQR = 15

-15 + l_m = 116.5

15 + u_m = 156.5

2 is outside interval (116.5, 156.5)

So, 2 is the outlier.

Ouliers have significant impact in mean but not on median and mode so if we remove 2 we get

mean = 144.46 (change) mode = 136 (same) median = 136 (same)


Q5
The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has Mf states. These ordered states define
the ranking 1,..., Mf. The treatment of ordinal variables is quite similar to that of interval-scaled variables when computing the dissimilarity between
objects. Suppose that f is a variable from a set of ordinal variables describing n objects. The dissimilarity computation with respect to f involves the
following steps:

1. The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking 1,...,Mf. Replace each xif by its corresponding rank, rif
∈{1,...,Mf}.

2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto [0.0, 1.0] so that eac
variable has equal weight. This can be achieved by replacing the rank rif of the ith object in the fth variable by

ul - M. 1

3. Dissimilarity can then be computed using any of the distance measures described for interval-scaled variables, using zif to represent the f value for
the ith object.

Let the income be ranked as follows.


1- <50K

2 - 50K - 75K

3 - 75K - 100K

4 - >100K

rif ∈{1,...,Mf}. = {1,2,3,4}. zif = {0,0.33, 0.66,1.00}

Let the height be ranked as follows.


1 - short 2- Normal 3-Tall
rif ∈{1,...,Mf}. = {1,2,3} zif={0, 0.5,1}

Let the weight be ranked as follows.


1- Underweight 2- Normal 3-Overweight
rif ∈{1,...,Mf}. = {1,2,3} zif={0, 0.5,1}

1) Distance between C and E


The values after ranking and mapping is given below Table 5.1

Now the distance or proximity can be calculated using Eucledian distance.

=SQRT((1-0.33)2+(1-0.5)2+(1-1)2)

=SQRT((0.67)2+(0.5)2)

=SQRT(0.4489+0.25)

=SQRT(0.6989) = 0.836

2) Distance between A and D

The values after ranking and mapping is given below. Table 5.2
Now the distance or proximity can be calculated using Eucledian distance.

=SQRT((0.33-0)2+(0-0.5)2+(0.5-0)2)

=SQRT((0.33)2+(-0.5)2 + (0.5)2)

=SQRT(0.1089+0.25+0.25)

=SQRT(0.6089) = 0.78

The distance between A and D is less compared to the distance between the pair C and E. In other words A and D is more similar than C and E..
Table 5.1 & 5.2
Q6. Formula:

For Binary classification:

gini-index = 2 * p * (1 - p)

entropy = p * log2p+ (1-P) * log2(1-P))

misclassification error = p

where p = proportion of miniority class in the node.

At parent:{10,10}

p = 10/20 = 0.5

gini-index = 2 * 0.5 * 0.5 = 0.5

entropy = - [0.5 * log(2,0.5) + 0.5 * log(2,0.5) ] = 1

misclassification error = p = 0.5

At left child : {7,2}

p = 2 / 9 = 0.22

gini-index = 2 * 0.22 * 0.78 = 0.3432

entropy = - [0.22 * log(2,0.22) + 0.78 * log(2,0.78) ] = 0.76

misclassification error = p = 0.22

note: log(2,n) = log n to the base 2.

At right child : {3,8}

p = 3 / (3+8) = 0.27

gini-index = 2 * 0.27 * 0.73 = 0.396

entropy = - [0.27 * log(2,0.27) + 0.73 * log(2,0.73) ] = 0.84

misclassification error = p = 0.27

as all of the impurites are went down when compared to with parent node:

Decision tree algorithm could consider taking this split.

improvement in purity on left child:

based on gini-index: 0.5 - 0.3432 = 0.1568

based on entropy : 1 - 0.76 = 0.24

based on misclassification error = 0.5 - 0.22 = 0.28

improvement in purity on right child:

based on gini-index: 0.5 - 0.396 = 0.104

based on entropy : 1 - 0.84 = 0.16

based on misclassification error = 0.5 - 0.27 = 0.23


Q7

The two main performance measuring parameters are Accuracy(ACC) and F1 score(F1). Here you can see that both the classifier have same
accuracy of 84% and F1 score of 0.8400 and 0.8367 respectively. So accuracy wise both the classifiers have same accuracy.

You might also like