0% found this document useful (0 votes)

65 views

Data Mining BITS-PILANI Mid Semester Sample

PCA helps reduce data dimensionality to identify patterns and correlations. FOIL_Prune assesses rule quality using positive and negative tuples. DIC dynamically adds/deletes itemsets during transactions to reduce passes over data and count frequent itemsets more efficiently.

Uploaded by

Meena Sivan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

Data Mining BITS-PILANI Mid Semester Sample

Uploaded by

Meena Sivan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Provide brief answers to the following: (3 X 2 marks)

a. How does PCA (principal component analysis) impact data mining activity?
b. How can FOIL_Prune enhance quality of rules?
c. How can DIC (Dynamic Itemset Counting) enhance efficiency of Apriori algorithm?

2. You found that bread and cheese sell together often in a store. Individually bread sells more
often than cheese. Will you conclude that bread=>cheese or cheese=>bread? Why? (2 marks)

3. After mining a transaction database for frequent itemsets, there is one largest frequent itemset
of size 6. Let N, C, M be the total number of frequent itemsets, closed frequent itemsets, and
maximal frequent itemsets (including the one of size 6). What is the minimum values of N, C,
and M? (2 marks)

4. Consider the following observations of an attribute. Answer the following: (4 marks)

2, 124, 131, 132, 133, 135, 135, 136, 136, 136, 140, 141, 142, 143, 249

(a) What is the five-number summary for the given data?

(b) Draw boxplot for the data.
(c) Identify the outliers, if any.
(d) How do outliers impact the measures of central tendency (Mean, Mode and Median) of
data? Comment using the given data set.

5. Compare the proximity between the following pairs with supporting computations. (Assume all
attributes are ordinal) (4 marks):
a) C and E
b) A and D

Object Income Height Weight

A <50K Normal Underweight

B 75K – 100K Short Normal
C >100K Tall Overweight
D 50K – 75K Short Normal
E 50K – 75K Normal Overweight

6. During decision tree induction for 2-class data, a node has {10, 10} objects of class1 and class2.
Upon split using an attribute A, two child nodes were created which have {7, 2} and {3, 8}
objects of class1 and class2. Calculate the improvement in (im)purity using Gini index, Entropy
and Misclassification error. (3 marks)
7. Given below are the confusion matrices of two classifiers, classifier A and classifier B, when
tested with the same dataset. The classification involved two classes Class 1 and Class 2.
Compare performance of 2 classifiers with supporting computations. (4 marks)

Classifier A Predicted Classifier B Predicted

Actual 1 2 Actual 1 2
1 42 8 1 41 9
2 8 42 2 7 43

8. A market basket database has five transactions. Let minimum support = 60%.

TID items bought

T100 Bread, Butter, Beans, Potato, Jam, Milk
T200 Bread, Butter, Shampoo, Potato, Jam, Milk
T300 Beans, Soap, Butter, Bread
T400 Beans, Onion, Apple, Butter, Milk
T500 Apple, Banana, Jam, Bread, Butter

Find frequent itemsets by using apriori algorithm. Show all intermediate steps clearly.
[5 marks]
Answer 1. (a)
Principal Component Analysis (PCA) is used for reducing the dimensionality of the big data sets provided.

It does this by transforming large sets of variables to multiple smaller ones which can also retain most of the content of the information we had in the
set.

Out of all the dimension reduction approaches, PCA is by far the most well known.

Of course in Data Mining, this process of reducing the dimensionality of the large set makes the data to be much more operational friendly; i.e. most
variance in the dataset is retained, most of which is along the axis, which is crucial in Data Mining.

Answer 1. (b)

Rule Pruning is one of the essential processes of data mining as it helps us assess the quality of the original large set of data. This is because it is m
easier to allow a rule to be performed on a trained data than it is to run any rule on other subsequent data-sets.

Now, FOIL is considered to be one of the best rule pruning methods as it is simple and effective.

FOIL_Prune = (T+ - T-) / (T+ + T-), where

T+ is used for +ve number of tuples in a rule R, and

T- is used for -ve number of tuples in a rule R

The value of FOIL_Prune increases with the increase of accuracy of rule R on the pruning set.

This means that if this value is higher for a pruned rule R, then we go forward with the process of pruning R.

Hence, with the help of this pruning, the effectiveness of rule R increases drastically.

Answer 1.(c)

Dynamic Itemset Counting(DIC) is an alternative to Apriori Itemset Generation.

Here, itemsets are added and deleted dynamically as the transactions are read in the background.

Also, for any itemset to be considered frequent, all of the particular itemset's subsets must also be frequent. This allows us to only examine the items
which all subsets are also frequent.

DIC is able to address the high-level issues(which wasn't possible in Apriori Algorithm). This helps in cases where we want to know when and which
itemsets to count.

This allows DIC to be a substantially faster than Apriori as the latter can require many passes to get to the required itemset.

Data mining: Data mining is the process extract the useful, dataset , patterns and correlation from a big dataset of raw data using machine learning m
and reduce the dimensions of complex dataset.
Data mining: Data mining is the process extract the useful, dataset , patterns and correlation from a big dataset of raw data using machine learning
method and reduce the dimensions of complex dataset.

A. PCA impact data mining: High dimensional data is extremely complex to process due to inconsistency in the features which increase the
computation time and make data processing. So PCA helps us to reduce the dimensions of data that enables you to identify correlation and pattern
data set so that it can be transformed into data set of significantly lower dimensions without loss of important information when we transferring the
data. So we make sure we remove all inconsistency in the data that main goal of data mining.

Step By step PCA:

1. STANDARDIZATION OF DATA

2.Computing covariance matrix

3.calculate eigenvalue and eigenvactors

4.computing the principal component

5.Reducing the dimensions of data

B. Foil_ Pruning: Assessment of rule quality are made with instances from the training data set.

A rule is pruned by removing a attribute test . We choose to prune a Rule (R) to get the high quality of prune verson accessed on a pruning data
set.Foil _ Prune is the very effective method for assessment of rule quality made with tuples from original data.

Foil_prone(R) = pos- neg/ pos neg

Pos and neg are types( positive and negetive) used by R, accuracy of R will increase on the pruning set, favours the rule that have high accuracy a
cover many positive types so if Pos are higher , if Foil_prune is higher for the prune version of R , prune R.

C. Although algorithms are easy to implement it needs many database scans which reduces the overall performance, if data base is very large. So
there are many variations of Apriori algorithm that have been proposed to improve the efficiency. So the dynamic itemset Counting techniques redu
the no. Of passes made over the data while keeping the number of itemset which are counted. In DIC the database is partition into blocks and mar
by star points maintains the count so far to cross the minimum support. The itemset is added to collection itemset collection which is used for furthe
gerate candidate of longer itemset, so it dynamically devide the data base and first start from the start point and it maintains when it counter and if
reaches to minimum support than it can be used for frequent dataset item, by finding the frequent itemset algorithm's efficiency is increased.
Q2.
Bread is an edible entity which can make pair with most of other edible things. Like we can eat bread with curry, with veggies, with chicken curries, a
many more . So any many edible things are here that requires bread to eat with . But as per cheese is concern there are no that mmuch pair to eat
with as many bread have.

So bread purchased by customers in larger amount as compare with the cheese. As per importnace of both things in our plate is concern they are
equally important. But bread is a basic entity.

If someone has lesser money so his first priority will be bread not cheese. So people give priority to bread more than the cheese that's is also a reas
why people go for bread.

So as per above explanation bread=>cheese.

Q3.
The largest frequent item set with size = 6

To find :The minimal value of the total number of frequency itemsets

Solution:

Now, According to the Association rule mining algorithm,

the range of the frequent itemset will be from k-2 to k where k= larger frequent set size.

Here, the value of k = 6

Therefore, the minimal value of total number of frequency itemsets = 6-2 = 4

Q4.
Data is in ascending order

minimum = 2

maximum = 249

Median is the middle number of the dataset = (136 + 136) / 2 = 136

Lower half = 2, 124, 131, 132, 133, 135

Upper half = 136, 140, 141, 142, 143, 249

Lower half's median = (132 + 131) / 2 = 131.5

Upper half's median = (141 + 142) / 2 = 141.5

a.) Five number summary of given data:

minimum = 2, lower_half_median = 131.5, median = 136, upper_half_median =141.5 and maxinmum = 249

b.) boxplot for data:

min ---[ 131.5 136 141.5 ]---max

2 lower_half_median median lower_half_median 249

c.) Outliers:

min = 2

max = 249

l_m = 131.5

u_m = 141.5

m = 136

IQR = u_m - l_m = 10

-1.5 * IQR = 15

-15 + l_m = 116.5

15 + u_m = 156.5

2 is outside interval (116.5, 156.5)

So, 2 is the outlier.

Ouliers have significant impact in mean but not on median and mode so if we remove 2 we get

mean = 144.46 (change) mode = 136 (same) median = 136 (same)

Q5
The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has Mf states. These ordered states define
the ranking 1,..., Mf. The treatment of ordinal variables is quite similar to that of interval-scaled variables when computing the dissimilarity between
objects. Suppose that f is a variable from a set of ordinal variables describing n objects. The dissimilarity computation with respect to f involves the
following steps:

1. The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking 1,...,Mf. Replace each xif by its corresponding rank, rif
∈{1,...,Mf}.

2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto [0.0, 1.0] so that eac
variable has equal weight. This can be achieved by replacing the rank rif of the ith object in the fth variable by

ul - M. 1

3. Dissimilarity can then be computed using any of the distance measures described for interval-scaled variables, using zif to represent the f value for
the ith object.

Let the income be ranked as follows.

1- <50K

2 - 50K - 75K

3 - 75K - 100K

4 - >100K

rif ∈{1,...,Mf}. = {1,2,3,4}. zif = {0,0.33, 0.66,1.00}

Let the height be ranked as follows.

1 - short 2- Normal 3-Tall
rif ∈{1,...,Mf}. = {1,2,3} zif={0, 0.5,1}

Let the weight be ranked as follows.

1- Underweight 2- Normal 3-Overweight
rif ∈{1,...,Mf}. = {1,2,3} zif={0, 0.5,1}

1) Distance between C and E

The values after ranking and mapping is given below Table 5.1

Now the distance or proximity can be calculated using Eucledian distance.

=SQRT((1-0.33)2+(1-0.5)2+(1-1)2)

=SQRT((0.67)2+(0.5)2)

=SQRT(0.4489+0.25)

=SQRT(0.6989) = 0.836

2) Distance between A and D

The values after ranking and mapping is given below. Table 5.2
Now the distance or proximity can be calculated using Eucledian distance.

=SQRT((0.33-0)2+(0-0.5)2+(0.5-0)2)

=SQRT((0.33)2+(-0.5)2 + (0.5)2)

=SQRT(0.1089+0.25+0.25)

=SQRT(0.6089) = 0.78

The distance between A and D is less compared to the distance between the pair C and E. In other words A and D is more similar than C and E..
Table 5.1 & 5.2
Q6. Formula:

For Binary classification:

gini-index = 2 * p * (1 - p)

entropy = p * log2p+ (1-P) * log2(1-P))

misclassification error = p

where p = proportion of miniority class in the node.

At parent:{10,10}

p = 10/20 = 0.5

gini-index = 2 * 0.5 * 0.5 = 0.5

entropy = - [0.5 * log(2,0.5) + 0.5 * log(2,0.5) ] = 1

misclassification error = p = 0.5

At left child : {7,2}

p = 2 / 9 = 0.22

gini-index = 2 * 0.22 * 0.78 = 0.3432

entropy = - [0.22 * log(2,0.22) + 0.78 * log(2,0.78) ] = 0.76

misclassification error = p = 0.22

note: log(2,n) = log n to the base 2.

At right child : {3,8}

p = 3 / (3+8) = 0.27

gini-index = 2 * 0.27 * 0.73 = 0.396

entropy = - [0.27 * log(2,0.27) + 0.73 * log(2,0.73) ] = 0.84

misclassification error = p = 0.27

as all of the impurites are went down when compared to with parent node:

Decision tree algorithm could consider taking this split.

improvement in purity on left child:

based on gini-index: 0.5 - 0.3432 = 0.1568

based on entropy : 1 - 0.76 = 0.24

based on misclassification error = 0.5 - 0.22 = 0.28

improvement in purity on right child:

based on gini-index: 0.5 - 0.396 = 0.104

based on entropy : 1 - 0.84 = 0.16

based on misclassification error = 0.5 - 0.27 = 0.23

The two main performance measuring parameters are Accuracy(ACC) and F1 score(F1). Here you can see that both the classifier have same
accuracy of 84% and F1 score of 0.8400 and 0.8367 respectively. So accuracy wise both the classifiers have same accuracy.

PATTERN RECOGNITION Final Notes
89% (9)
PATTERN RECOGNITION Final Notes
40 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages
Data Warehousing and Mining April 2019
No ratings yet
Data Warehousing and Mining April 2019
4 pages
DMW MCQ
No ratings yet
DMW MCQ
388 pages
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
No ratings yet
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
4 pages
CS 515 Data Warehousing and Data Mining
No ratings yet
CS 515 Data Warehousing and Data Mining
5 pages
Jntuqp DWDM
No ratings yet
Jntuqp DWDM
8 pages
CSC 501 mid term 2-Assignment
No ratings yet
CSC 501 mid term 2-Assignment
2 pages
DM 2019
No ratings yet
DM 2019
7 pages
B.Tech May2022 Comp CSPE-64 Sem4
No ratings yet
B.Tech May2022 Comp CSPE-64 Sem4
4 pages
Dcs 7302
No ratings yet
Dcs 7302
17 pages
Data Mining List of Important Question
No ratings yet
Data Mining List of Important Question
4 pages
DW Model Questions
No ratings yet
DW Model Questions
8 pages
DM QB
No ratings yet
DM QB
7 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
13 pages
unit4 mcqs
No ratings yet
unit4 mcqs
7 pages
Data Mining and Warehousing22
No ratings yet
Data Mining and Warehousing22
3 pages
Data Warehousing and Data Mining Dec 2023
No ratings yet
Data Warehousing and Data Mining Dec 2023
7 pages
DM
No ratings yet
DM
7 pages
data mning
No ratings yet
data mning
40 pages
4
No ratings yet
4
3 pages
DWDM Unit Wise Question Bank
No ratings yet
DWDM Unit Wise Question Bank
8 pages
B.Tech Degree S8 (S, FE) / S6 (PT) (S, FE) Examination June 2023 (2015 Scheme)
No ratings yet
B.Tech Degree S8 (S, FE) / S6 (PT) (S, FE) Examination June 2023 (2015 Scheme)
4 pages
DM-Question Bank 2024-25 Objective Question Bank
No ratings yet
DM-Question Bank 2024-25 Objective Question Bank
14 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
Q1R_ext(2)
No ratings yet
Q1R_ext(2)
4 pages
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
No ratings yet
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
10 pages
Dwdm Answer
No ratings yet
Dwdm Answer
19 pages
Assignment 1: Data Mining MGSC5126 - 10
No ratings yet
Assignment 1: Data Mining MGSC5126 - 10
10 pages
Data Mining Merged
No ratings yet
Data Mining Merged
10 pages
Assignment 1 5
No ratings yet
Assignment 1 5
4 pages
da 2023
No ratings yet
da 2023
30 pages
Mid Term
No ratings yet
Mid Term
12 pages
Exam DUT 070816 Ans
No ratings yet
Exam DUT 070816 Ans
5 pages
CST466
No ratings yet
CST466
5 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
Exam-dm1-121017-ans
No ratings yet
Exam-dm1-121017-ans
8 pages
Unit-4 Da
No ratings yet
Unit-4 Da
15 pages
C-3 Pap365er
No ratings yet
C-3 Pap365er
4 pages
3
No ratings yet
3
4 pages
DW Ans
No ratings yet
DW Ans
19 pages
Data Mining IMP Objective Questions_Sep 2023
No ratings yet
Data Mining IMP Objective Questions_Sep 2023
4 pages
Frequent Item-Set Mining Methods: Prepared By-Mr - Nilesh Magar
No ratings yet
Frequent Item-Set Mining Methods: Prepared By-Mr - Nilesh Magar
31 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
III Yr B.Tech. - Computer Science & Engineering/Information Technology Data Mining
No ratings yet
III Yr B.Tech. - Computer Science & Engineering/Information Technology Data Mining
2 pages
QB Students DM
No ratings yet
QB Students DM
12 pages
Exam DM 071214 Ans
No ratings yet
Exam DM 071214 Ans
7 pages
DeThiCK_2023-06 - EN
No ratings yet
DeThiCK_2023-06 - EN
4 pages
B.Tech Odd Semester Examination, 2018-19 Name of Subject: El-I (Data Warehousing & Data Mining)
No ratings yet
B.Tech Odd Semester Examination, 2018-19 Name of Subject: El-I (Data Warehousing & Data Mining)
3 pages
126VW122019
No ratings yet
126VW122019
2 pages
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
No ratings yet
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
3 pages
Gujarat Technological University: Subject Code: 171601 Date: 25/11/2014 Subject Name: Data Warehousing and Data Mining
No ratings yet
Gujarat Technological University: Subject Code: 171601 Date: 25/11/2014 Subject Name: Data Warehousing and Data Mining
2 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Revision (ques.only)
No ratings yet
Revision (ques.only)
2 pages
unit 4- Question Bank
No ratings yet
unit 4- Question Bank
11 pages
5CAI3-01 - Data Mining-Concepts and Techniques
No ratings yet
5CAI3-01 - Data Mining-Concepts and Techniques
2 pages
ML Trends
No ratings yet
ML Trends
89 pages
Machine Learning Roadmap
No ratings yet
Machine Learning Roadmap
35 pages
Exploratory Factor Analysis Kootstra 04
No ratings yet
Exploratory Factor Analysis Kootstra 04
15 pages
Principal Component Analysis 4 Dummies
100% (1)
Principal Component Analysis 4 Dummies
8 pages
Factor Analysis Excellent Guide With Spss (Peter James Kpolovie (Kpolovie, Peter James) ) (Z-Library)
No ratings yet
Factor Analysis Excellent Guide With Spss (Peter James Kpolovie (Kpolovie, Peter James) ) (Z-Library)
414 pages
Operational Modal Analysis Tutorial - Svib Seminar May 2007
No ratings yet
Operational Modal Analysis Tutorial - Svib Seminar May 2007
12 pages
Sustainability HR
No ratings yet
Sustainability HR
27 pages
Adegenet Tutorial
No ratings yet
Adegenet Tutorial
63 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Face Recognition Using PCA and Wavelet Method: Ijgip
No ratings yet
Face Recognition Using PCA and Wavelet Method: Ijgip
4 pages
An Introduction To - Strong - Corrplot - Strong - Package
No ratings yet
An Introduction To - Strong - Corrplot - Strong - Package
61 pages
Ondrkál, Filip. 2023: Horné Srnie: Emulation of Carpathian Insignia During The Urnfield Inflation, Archaeometry
No ratings yet
Ondrkál, Filip. 2023: Horné Srnie: Emulation of Carpathian Insignia During The Urnfield Inflation, Archaeometry
14 pages
Operational Modal Analysis Tutorial
No ratings yet
Operational Modal Analysis Tutorial
11 pages
3 Persiapan Data Mining
No ratings yet
3 Persiapan Data Mining
83 pages
Statistical Arbitrage in The U.S. Equities Market
No ratings yet
Statistical Arbitrage in The U.S. Equities Market
47 pages
Properties of An Analogue Cheese Obtained From Raw
No ratings yet
Properties of An Analogue Cheese Obtained From Raw
24 pages
(PDF Download) Introduction To Research Methods and Data Analysis in Psychology Darren Langdridge Fulll Chapter
100% (4)
(PDF Download) Introduction To Research Methods and Data Analysis in Psychology Darren Langdridge Fulll Chapter
64 pages
Chemical Process Performance Evaluation
No ratings yet
Chemical Process Performance Evaluation
170 pages
Genes de Chachapoyas
No ratings yet
Genes de Chachapoyas
11 pages
Statistical Evaluation of Agricultural Development in Asian Countries
No ratings yet
Statistical Evaluation of Agricultural Development in Asian Countries
62 pages
Statistics for Imaging Optics and Photonics 1st Edition Peter Bajorski 2024 Scribd Download
100% (3)
Statistics for Imaging Optics and Photonics 1st Edition Peter Bajorski 2024 Scribd Download
31 pages
Sustainability Assessment in the Steel Industry Using Partial Least Squares - Structural Equation Model
No ratings yet
Sustainability Assessment in the Steel Industry Using Partial Least Squares - Structural Equation Model
12 pages
Chapter 11
No ratings yet
Chapter 11
33 pages
DenseNet For Brain Tumor Classification in MRI Images
100% (1)
DenseNet For Brain Tumor Classification in MRI Images
9 pages
DA Lab Manual
No ratings yet
DA Lab Manual
60 pages
Textmining For Lawyers Dyevre
No ratings yet
Textmining For Lawyers Dyevre
28 pages
Ebbes Et Al 2024 Getting The Board On Board Marketing Department Power and Board Interlocks
No ratings yet
Ebbes Et Al 2024 Getting The Board On Board Marketing Department Power and Board Interlocks
21 pages
SRM Valliammai Engineering College: Face Recognition Based Attendance Management System
No ratings yet
SRM Valliammai Engineering College: Face Recognition Based Attendance Management System
11 pages
IDRISI Selva GIS Image Processing Specifications
No ratings yet
IDRISI Selva GIS Image Processing Specifications
12 pages

Data Mining BITS-PILANI Mid Semester Sample

Uploaded by

Data Mining BITS-PILANI Mid Semester Sample

Uploaded by

1.

Provide brief answers to the following: (3 X 2 marks)

4. Consider the following observations of an attribute. Answer the following: (4 marks)

(a) What is the five-number summary for the given data?

Object Income Height Weight

A <50K Normal Underweight

Classifier A Predicted Classifier B Predicted

TID items bought

FOIL_Prune = (T+ - T-) / (T+ + T-), where

T+ is used for +ve number of tuples in a rule R, and

T- is used for -ve number of tuples in a rule R

Dynamic Itemset Counting(DIC) is an alternative to Apriori Itemset Generation.

Step By step PCA:

2.Computing covariance matrix

3.calculate eigenvalue and eigenvactors

4.computing the principal component

5.Reducing the dimensions of data

Foil_prone(R) = pos- neg/ pos neg

So as per above explanation bread=>cheese.

To find :The minimal value of the total number of frequency itemsets

Now, According to the Association rule mining algorithm,

Here, the value of k = 6

Therefore, the minimal value of total number of frequency itemsets = 6-2 = 4

Median is the middle number of the dataset = (136 + 136) / 2 = 136

Lower half = 2, 124, 131, 132, 133, 135

Upper half = 136, 140, 141, 142, 143, 249

Lower half's median = (132 + 131) / 2 = 131.5

Upper half's median = (141 + 142) / 2 = 141.5

a.) Five number summary of given data:

b.) boxplot for data:

min ---[ 131.5 136 141.5 ]---max

2 lower_half_median median lower_half_median 249

IQR = u_m - l_m = 10

-15 + l_m = 116.5

2 is outside interval (116.5, 156.5)

So, 2 is the outlier.

mean = 144.46 (change) mode = 136 (same) median = 136 (same)

Let the income be ranked as follows.

rif ∈{1,...,Mf}. = {1,2,3,4}. zif = {0,0.33, 0.66,1.00}

Let the height be ranked as follows.

Let the weight be ranked as follows.

1) Distance between C and E

Now the distance or proximity can be calculated using Eucledian distance.

2) Distance between A and D

For Binary classification:

entropy = p * log2p+ (1-P) * log2(1-P))

where p = proportion of miniority class in the node.

gini-index = 2 * 0.5 * 0.5 = 0.5

entropy = - [0.5 * log(2,0.5) + 0.5 * log(2,0.5) ] = 1

misclassification error = p = 0.5

At left child : {7,2}

gini-index = 2 * 0.22 * 0.78 = 0.3432

entropy = - [0.22 * log(2,0.22) + 0.78 * log(2,0.78) ] = 0.76

misclassification error = p = 0.22

note: log(2,n) = log n to the base 2.

At right child : {3,8}

gini-index = 2 * 0.27 * 0.73 = 0.396

entropy = - [0.27 * log(2,0.27) + 0.73 * log(2,0.73) ] = 0.84

misclassification error = p = 0.27

Decision tree algorithm could consider taking this split.

improvement in purity on left child:

based on gini-index: 0.5 - 0.3432 = 0.1568

based on entropy : 1 - 0.76 = 0.24

based on misclassification error = 0.5 - 0.22 = 0.28

improvement in purity on right child:

based on gini-index: 0.5 - 0.396 = 0.104

based on entropy : 1 - 0.84 = 0.16

based on misclassification error = 0.5 - 0.27 = 0.23

You might also like