0% found this document useful (0 votes)
161 views13 pages

NguyenCongSang ITITIU20292 Lab2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views13 pages

NguyenCongSang ITITIU20292 Lab2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Data Mining

Lab 2: Evaluation

Nguyễn Công Sáng – ITITIU20292


2.1. Be a classifier

In the second class, we are going to learn how to use datasets to evaluate data mining algorithms in
Weka. (See the lecture of class 2 by Ian H. Witten, [1]1)

Interactive decision tree construction

➔ Follow the instruction in [1] to see how decision trees are created for different combinations of
attributes in a dataset. Firstly, a dataset and a training set are selected. Secondly, we choose and
start running UserClassifier to see a decision tree in the Tree Visualizer. In the Data Visualizer,
thirdly, the attributes to use for X and Y are selected, we then select instances in a region in the
graph and submit. At this point, the Tree Visualizer shows the tree.
➔ Examine segment-challenge dataset to draw a decision tree for the following pair of attributes
by selecting and submitting classes one by one, then remark how many instances are predicted
correctly.

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
2
Attribut Split on Region-centroid-row and intensity-mean
es
Decision
Tree

Remark
The number of instances predicted correctly for each class is as follows:

Brickface (a): 124 instances predicted correctly.


Sky (b): 110 instances predicted correctly.
Foliage (c): 95 instances predicted correctly.
Cement (d): 94 instances predicted correctly.
Window (e): 38 instances predicted correctly.
Path (f): 93 instances predicted correctly.

3
Grass (g): 122 instances predicted correctly.
Therefore, a total of 675 instances out of 810 are predicted correctly across all classes.

Build a tree, what strategy do you use?

To build a decision tree, various strategies can be employed. One common strategy is to use recursive
partitioning, where at each step, the algorithm selects the attribute that best splits the data into subsets
that are as pure as possible with respect to the target variable. The purity of subsets is typically
measured using metrics like entropy, Gini impurity, or classification error.

Can you build a “perfect” tree?

While it's theoretically possible to build a "perfect" decision tree that perfectly classifies every instance
in the training data, achieving this in practice may not be feasible or desirable due to overfitting.
Overfitting occurs when the tree captures noise or random fluctuations in the training data, making it
perform poorly on unseen data.

2.2. Training and testing


The lecture of evaluation (see [1]-2.2)

Follow the instructions in [1]-2.3: use J48 to analyze segment dataset, and write down how accuracy it
can achieve with different seeds. (If a random number seed is provided, the dataset will be shuffled
before the subset is extracted.)

4
Random number seeds Percent accuracy (x) Random number seeds Percent accuracy (x)
1 0.967 6 0.967
2 0.940 7 0.920
3 0.940 8 0.940
4 0.967 9 0.933
5 0.953 10 0.947
Evaluation Sample Mean 0.9474
Standard deviation 0.015

Note:

5
Remark? - The classifier's performance is quite consistent, with a high average accuracy and low
variance among different random number seeds. This suggests that the classifier is robust and not
heavily influenced by the initial random seed, indicating stability in its performance across different
runs.

2.3. Baseline accuracy


Follow the instructions in [1]-2.4 to run some classifiers for diabetes dataset:

Classifier Accuracy(%)
J48 76.2452
NaiveBayes 77.0115

6
IBk 72.7969
PART 74.3295
ZeroR 65.1042
What is Baseline accuracy? -

Baseline accuracy is the accuracy that a trivial or baseline classifier achieves by always predicting the
majority class in the dataset. In binary classification tasks, the majority class is often referred to as the
negative class, and the minority class is the positive class.

For example, if we have a dataset with 70% of instances belonging to class A and 30% belonging to class
B, the baseline accuracy would be 70%. This means that a trivial classifier that always predicts class A
would achieve an accuracy of 70%.

For supermarket dataset:

Classifier Accuracy
ZeroR 63.713
J48 62.6828
NaiveBayes 62.6828
IBk 38.2708
PART 62.6828

Why do the classifiers achive lower accuracy?

– When all classifiers achieve the same accuracy, and it is lower than what might be expected,
reasons for this:

7
Data Quality: The dataset may be noisy or contain inconsistencies, making it challenging for
classifiers to learn meaningful patterns.

2.4. Cross-validation
The holdout procedure: a certain amount is held over for testing and the remainder used for training.

Stratification: each class is properly represented in both training and test sets.

The repeated holdout method of error rate estimation: In each iteration a certain proportion, say
two-thirds, of the data is randomly selected for training (using different random-number seeds),
possibly with stratification, and the remainder is used for testing. The error rates on the different
iterations are averaged to yield an overall error rate.

The lecture of cross validation, 10-fold cross-validation, stratified cross-validation (see [1]-2.5).

In cross-validation, you decide on a fixed number of folds, or partitions, of the data. Suppose we
use three. Then the data is split into three approximately equal partitions; each in turn is used
for testing and the remainder is used for training. That is, use two-thirds of the data for training
and one-third for testing, and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing. This is called three-fold cross-validation, and if
stratification is adopted as well—which it often is—it is stratified three-fold cross-validation.

Weka does stratified cross-validation by default.

Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.

8
Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 75.3247 Random seed: 1 73.8281
-- 2 77.9221 -- 2 75
-- 3 80.5195 -- 3 75.5208
-- 4 74.026 -- 4 75.5208
-- 5 71.4286 -- 5 74.349
-- 6 70.1299 -- 6 75.651
-- 7 79.2208 -- 7 73.5677
-- 8 71.4286 -- 8 73.9583
-- 9 80.5195 -- 9 74.4792

9
-- 10 67.5325 -- 10 73.0469
Sample Mean 74.8 Sample Mean 74.5
Standard deviation 4.39 Standard deviation 0.9

Examine PART on Diabetes dataset:

Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 75.3247 Random seed: 1 75.2604
-- 2 75.3247 -- 2 73.0469

10
-- 3 71.4286 -- 3 72.7865
-- 4 72.7273 -- 4 74.8698
-- 5 77.9221 -- 5 74.2188
-- 6 71.4286 -- 6 73.0469
-- 7 74.026 -- 7 73.4375
-- 8 68.8312 -- 8 71.875
-- 9 75.3247 -- 9 74.6904
-- 10 66.2338 -- 10 71.3542
Sample Mean 72.9 Sample Mean 73.5
Standard deviation 3.3 Standard deviation 1.2

Examine NaiveBayes on Diabetes dataset:

11
Holdout (10%) Percent accuracy (x) 10-fold cross-validation Percent accuracy (x)
Random seed: 1 77.9221 Random seed: 1 76.3021
-- 2 75.3247 -- 2 75.2604
-- 3 72.7273 -- 3 76.1719
-- 4 68.8312 -- 4 75.5208
-- 5 80.5195 -- 5 75.1302
-- 6 76.6234 -- 6 75.7813
-- 7 76.6234 -- 7 76.1719
-- 8 74.026 -- 8 75.2604

12
-- 9 76.6234 -- 9 76.0417
-- 10 71.4286 -- 10 75.9115
Sample Mean 75.1 Sample Mean 75.7
Standard deviation 3.2 Standard deviation 0.4

13

You might also like