0% found this document useful (0 votes)

161 views13 pages

NguyenCongSang ITITIU20292 Lab2

Uploaded by

thuctranduynguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views13 pages

NguyenCongSang ITITIU20292 Lab2

Uploaded by

thuctranduynguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Introduction to Data Mining

Lab 2: Evaluation

Nguyễn Công Sáng – ITITIU20292

2.1. Be a classifier

In the second class, we are going to learn how to use datasets to evaluate data mining algorithms in
Weka. (See the lecture of class 2 by Ian H. Witten, [1]1)

Interactive decision tree construction

➔ Follow the instruction in [1] to see how decision trees are created for different combinations of
attributes in a dataset. Firstly, a dataset and a training set are selected. Secondly, we choose and
start running UserClassifier to see a decision tree in the Tree Visualizer. In the Data Visualizer,
thirdly, the attributes to use for X and Y are selected, we then select instances in a region in the
graph and submit. At this point, the Tree Visualizer shows the tree.
➔ Examine segment-challenge dataset to draw a decision tree for the following pair of attributes
by selecting and submitting classes one by one, then remark how many instances are predicted
correctly.

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
2
Attribut Split on Region-centroid-row and intensity-mean
es
Decision
Tree

Remark
The number of instances predicted correctly for each class is as follows:

Brickface (a): 124 instances predicted correctly.

Sky (b): 110 instances predicted correctly.
Foliage (c): 95 instances predicted correctly.
Cement (d): 94 instances predicted correctly.
Window (e): 38 instances predicted correctly.
Path (f): 93 instances predicted correctly.

3
Grass (g): 122 instances predicted correctly.
Therefore, a total of 675 instances out of 810 are predicted correctly across all classes.

Build a tree, what strategy do you use?

To build a decision tree, various strategies can be employed. One common strategy is to use recursive
partitioning, where at each step, the algorithm selects the attribute that best splits the data into subsets
that are as pure as possible with respect to the target variable. The purity of subsets is typically
measured using metrics like entropy, Gini impurity, or classification error.

Can you build a “perfect” tree?

While it's theoretically possible to build a "perfect" decision tree that perfectly classifies every instance
in the training data, achieving this in practice may not be feasible or desirable due to overfitting.
Overfitting occurs when the tree captures noise or random fluctuations in the training data, making it
perform poorly on unseen data.

2.2. Training and testing

The lecture of evaluation (see [1]-2.2)

Follow the instructions in [1]-2.3: use J48 to analyze segment dataset, and write down how accuracy it
can achieve with different seeds. (If a random number seed is provided, the dataset will be shuffled
before the subset is extracted.)

4
Random number seeds Percent accuracy (x) Random number seeds Percent accuracy (x)
1 0.967 6 0.967
2 0.940 7 0.920
3 0.940 8 0.940
4 0.967 9 0.933
5 0.953 10 0.947
Evaluation Sample Mean 0.9474
Standard deviation 0.015

Note:

5
Remark? - The classifier's performance is quite consistent, with a high average accuracy and low
variance among different random number seeds. This suggests that the classifier is robust and not
heavily influenced by the initial random seed, indicating stability in its performance across different
runs.

2.3. Baseline accuracy

Follow the instructions in [1]-2.4 to run some classifiers for diabetes dataset:

Classifier Accuracy(%)
J48 76.2452
NaiveBayes 77.0115

6
IBk 72.7969
PART 74.3295
ZeroR 65.1042
What is Baseline accuracy? -

Baseline accuracy is the accuracy that a trivial or baseline classifier achieves by always predicting the
majority class in the dataset. In binary classification tasks, the majority class is often referred to as the
negative class, and the minority class is the positive class.

For example, if we have a dataset with 70% of instances belonging to class A and 30% belonging to class
B, the baseline accuracy would be 70%. This means that a trivial classifier that always predicts class A
would achieve an accuracy of 70%.

For supermarket dataset:

Classifier Accuracy
ZeroR 63.713
J48 62.6828
NaiveBayes 62.6828
IBk 38.2708
PART 62.6828

Why do the classifiers achive lower accuracy?

– When all classifiers achieve the same accuracy, and it is lower than what might be expected,
reasons for this:

7
Data Quality: The dataset may be noisy or contain inconsistencies, making it challenging for
classifiers to learn meaningful patterns.

2.4. Cross-validation
The holdout procedure: a certain amount is held over for testing and the remainder used for training.

Stratification: each class is properly represented in both training and test sets.

The repeated holdout method of error rate estimation: In each iteration a certain proportion, say
two-thirds, of the data is randomly selected for training (using different random-number seeds),
possibly with stratification, and the remainder is used for testing. The error rates on the different
iterations are averaged to yield an overall error rate.

The lecture of cross validation, 10-fold cross-validation, stratified cross-validation (see [1]-2.5).

In cross-validation, you decide on a fixed number of folds, or partitions, of the data. Suppose we
use three. Then the data is split into three approximately equal partitions; each in turn is used
for testing and the remainder is used for training. That is, use two-thirds of the data for training
and one-third for testing, and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing. This is called three-fold cross-validation, and if
stratification is adopted as well—which it often is—it is stratified three-fold cross-validation.

Weka does stratified cross-validation by default.

Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.

8
Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 75.3247 Random seed: 1 73.8281
-- 2 77.9221 -- 2 75
-- 3 80.5195 -- 3 75.5208
-- 4 74.026 -- 4 75.5208
-- 5 71.4286 -- 5 74.349
-- 6 70.1299 -- 6 75.651
-- 7 79.2208 -- 7 73.5677
-- 8 71.4286 -- 8 73.9583
-- 9 80.5195 -- 9 74.4792

9
-- 10 67.5325 -- 10 73.0469
Sample Mean 74.8 Sample Mean 74.5
Standard deviation 4.39 Standard deviation 0.9

Examine PART on Diabetes dataset:

Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 75.3247 Random seed: 1 75.2604
-- 2 75.3247 -- 2 73.0469

10
-- 3 71.4286 -- 3 72.7865
-- 4 72.7273 -- 4 74.8698
-- 5 77.9221 -- 5 74.2188
-- 6 71.4286 -- 6 73.0469
-- 7 74.026 -- 7 73.4375
-- 8 68.8312 -- 8 71.875
-- 9 75.3247 -- 9 74.6904
-- 10 66.2338 -- 10 71.3542
Sample Mean 72.9 Sample Mean 73.5
Standard deviation 3.3 Standard deviation 1.2

Examine NaiveBayes on Diabetes dataset:

11
Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 77.9221 Random seed: 1 76.3021
-- 2 75.3247 -- 2 75.2604
-- 3 72.7273 -- 3 76.1719
-- 4 68.8312 -- 4 75.5208
-- 5 80.5195 -- 5 75.1302
-- 6 76.6234 -- 6 75.7813
-- 7 76.6234 -- 7 76.1719
-- 8 74.026 -- 8 75.2604

12
-- 9 76.6234 -- 9 76.0417
-- 10 71.4286 -- 10 75.9115
Sample Mean 75.1 Sample Mean 75.7
Standard deviation 3.2 Standard deviation 0.4

Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
50% (2)
Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
177 pages
Density & Grid Based Clustering
100% (1)
Density & Grid Based Clustering
21 pages
NguyenCongSang ITITIU20292 Lab3
No ratings yet
NguyenCongSang ITITIU20292 Lab3
21 pages
Pima Indian Diabetes Prediction
No ratings yet
Pima Indian Diabetes Prediction
22 pages
Lab3 NguyenQuocKhanh ITITIU18186
No ratings yet
Lab3 NguyenQuocKhanh ITITIU18186
7 pages
Ue22cs342aa2 20241114095341
No ratings yet
Ue22cs342aa2 20241114095341
23 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Strassen's Matrix Multiplication Algorithm: Problem Description
No ratings yet
Strassen's Matrix Multiplication Algorithm: Problem Description
5 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Software Mining Repository Practical
No ratings yet
Software Mining Repository Practical
28 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Large Scale Production Fermenter Design
No ratings yet
Large Scale Production Fermenter Design
15 pages
Model Test Paper (BCA) Elements of Statistics (BCA-305) : Exceeding 75 Words. 3x5 15
No ratings yet
Model Test Paper (BCA) Elements of Statistics (BCA-305) : Exceeding 75 Words. 3x5 15
2 pages
AML - Theory - Syllabus - Chandigarh University
No ratings yet
AML - Theory - Syllabus - Chandigarh University
4 pages
Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
Classification: Decision Tree Induction: Lecture #9
No ratings yet
Classification: Decision Tree Induction: Lecture #9
121 pages
Chemical Engineering in Practice Second Edition - Sampler
100% (1)
Chemical Engineering in Practice Second Edition - Sampler
99 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
3 pages
Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
ML Lab6.Ipynb - Colaboratory
100% (1)
ML Lab6.Ipynb - Colaboratory
5 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Simulated Annealing
No ratings yet
Simulated Annealing
54 pages
Lab3 KNN
No ratings yet
Lab3 KNN
4 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Aiml Manual 6th Sem
No ratings yet
Aiml Manual 6th Sem
15 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Tenses - Ready Reckoner: Tense Affirmative/Negative/Question Use Signal Words
100% (2)
Tenses - Ready Reckoner: Tense Affirmative/Negative/Question Use Signal Words
7 pages
Hw1 Theory Solution PuHK4fmHvB
No ratings yet
Hw1 Theory Solution PuHK4fmHvB
4 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
DAA Mini Project Report Atul
No ratings yet
DAA Mini Project Report Atul
17 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Lab Manual Soft Computing
No ratings yet
Lab Manual Soft Computing
44 pages
Exercises 695 Clas
No ratings yet
Exercises 695 Clas
3 pages
ML 2
No ratings yet
ML 2
6 pages
DM DT Solved Example 02 - Unlocked
No ratings yet
DM DT Solved Example 02 - Unlocked
3 pages
BCSL-045 (2023-24) Solved Assignment
No ratings yet
BCSL-045 (2023-24) Solved Assignment
14 pages
KMBN It01 - Unit 4
No ratings yet
KMBN It01 - Unit 4
19 pages
IV AI-DS AD3491 FDSA Unit3
No ratings yet
IV AI-DS AD3491 FDSA Unit3
35 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
2.2 ML Session Bias Variance Tradeoffs
No ratings yet
2.2 ML Session Bias Variance Tradeoffs
38 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
May 2021 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics ID5059
6 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
WEKA
No ratings yet
WEKA
81 pages
Assignment-4 Noc18 cs52 87
No ratings yet
Assignment-4 Noc18 cs52 87
9 pages
Lab2 ITDSIU21030 Nguyen Duy Phuc
No ratings yet
Lab2 ITDSIU21030 Nguyen Duy Phuc
6 pages
SCBA Pre-Use Inspection
No ratings yet
SCBA Pre-Use Inspection
2 pages
May2015 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May2015 Examination Diet School of Mathematics & Statistics ID5059
9 pages
Week 4 Naive Bayes Classifier
No ratings yet
Week 4 Naive Bayes Classifier
2 pages
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
No ratings yet
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
1 page
Lab2 Form
No ratings yet
Lab2 Form
5 pages
CH 6
No ratings yet
CH 6
72 pages
Openlab 1
No ratings yet
Openlab 1
17 pages
Thesis Port Service
100% (3)
Thesis Port Service
7 pages
Unit I
No ratings yet
Unit I
85 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
7 pages
OS Lab Manual
No ratings yet
OS Lab Manual
56 pages
EXP - 7 - Prasham Doshi - 22bec097
No ratings yet
EXP - 7 - Prasham Doshi - 22bec097
7 pages
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
No ratings yet
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
47 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Read The Masterplan
No ratings yet
Read The Masterplan
47 pages
Hanover Report 1978
100% (1)
Hanover Report 1978
10 pages
Employee Welfare
No ratings yet
Employee Welfare
44 pages
77 4001 StaSaf
No ratings yet
77 4001 StaSaf
20 pages
Community Engagement Solidarity and Citizenship
No ratings yet
Community Engagement Solidarity and Citizenship
24 pages
TOR B1 Listening WS 3 Standard
No ratings yet
TOR B1 Listening WS 3 Standard
3 pages
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
No ratings yet
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
8 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
88 pages
TOCFL 基礎級 A2
No ratings yet
TOCFL 基礎級 A2
11 pages
IELTS Writing Task 2
No ratings yet
IELTS Writing Task 2
34 pages
Teen Smart Prep 2 2020
No ratings yet
Teen Smart Prep 2 2020
151 pages
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
No ratings yet
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
3 pages
B10x Technical Reference 1.4
No ratings yet
B10x Technical Reference 1.4
29 pages
PLC Interview Questions
No ratings yet
PLC Interview Questions
3 pages
Equlibrium
No ratings yet
Equlibrium
20 pages
1.0 Introduction To Biochemistry and Cellular Organization
No ratings yet
1.0 Introduction To Biochemistry and Cellular Organization
6 pages
5 Muscle
No ratings yet
5 Muscle
3 pages
My NoteBook
No ratings yet
My NoteBook
17 pages
Specification Sheet For DG Set Alternator: 1 Identity
No ratings yet
Specification Sheet For DG Set Alternator: 1 Identity
6 pages
Parkinson Disease & ALS Cheat Sheet
No ratings yet
Parkinson Disease & ALS Cheat Sheet
4 pages
Manual HON 370 20 GB
No ratings yet
Manual HON 370 20 GB
51 pages
Landing Page Inspiration 3
No ratings yet
Landing Page Inspiration 3
1 page
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
No ratings yet
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
3 pages
CS Nipple 21K-62-71310
No ratings yet
CS Nipple 21K-62-71310
1 page
STAR HIB Plus Product Specifications 4
No ratings yet
STAR HIB Plus Product Specifications 4
1 page

NguyenCongSang ITITIU20292 Lab2

Uploaded by

NguyenCongSang ITITIU20292 Lab2

Uploaded by

Introduction to Data Mining

Nguyễn Công Sáng – ITITIU20292

Interactive decision tree construction

Brickface (a): 124 instances predicted correctly.

Build a tree, what strategy do you use?

Can you build a “perfect” tree?

2.2. Training and testing

2.3. Baseline accuracy

For supermarket dataset:

Why do the classifiers achive lower accuracy?

Weka does stratified cross-validation by default.

Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.

Examine PART on Diabetes dataset:

Examine NaiveBayes on Diabetes dataset:

You might also like