Data Mining - Weka 3.6.0
Data Mining - Weka 3.6.0
JAYAKODY N. M.
2014/E/011
SEMESTER 7
12/02/2018
1. BRIEF DESCRIPTION ABOUT THE DATASET
The train data set contains 280 attributes which includes attributes like age, sex, height, weight,
etc. including the class label. Some of the attributes are defined as Numeric while the rest are
defined as Nominal. Numeric data include data which contain numeric values while the nominal
data contain values of either zero or one. Numeric data include, age, height, weight, etc. and
nominal data include sex, chDI_RRwaveExists, class label, etc.
The data set contains 16 classes numbered from 1 to 16. The train data set contains attribute details
of 480 patients. But certain attributes like T, P, QRST, etc. contain missing values.
The following table shows the number of instances of each class,
Table 1 Number of instances of each class
Class a b c d e f g h i j k l m n o p
Instances 221 43 14 15 13 24 3 2 8 44 0 0 0 4 4 21
Tuning of the parameters were done after trial and comparing their accuracies. The highest
accuracy was brought by cross validation – 10 folds.
Table 3 Results with different test options
Test Option Accuracy
Cross Validation – 5 folds 69.2308 %
Cross Validation -10 folds 69.7115 %
Percentage split – 66% 64.539 %
Percentage split – 75% 66.3462 %
7. CROSS-VALIDATION
Cross validation is technique used to evaluate the predictive models by the use of partitioning the
original sample into training set and test set.
In k-fold cross validation, the original sample is randomly partitioned to ‘k’ equal size subsamples.
Out of these subsamples, one is retained as the validation data for model testing, and the remaining
k-1 subsamples as training data. This process is repeated ‘k’ times, with each subsample used as
validation data. The ‘k’ results can be averaged to produce a single estimation.
The purpose of using cross validation is to check the model but not model building.
5-cross validation
ACCURACY
The number of correct predictions were counted using the given testlabel.arff file and the predicted
value shown above.