Weka
Weka
Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association rules,
and visualization. It is also well-suited for developing new machine learning schemes.
Weka contains a collection of visualization tools and algorithms for data
analysis and predictive modeling, together with graphical user interfaces for easy access to
these functions.[1] The original non-Java version of Weka was a Tcl/Tk front-end to (mostly
third-party) modeling algorithms implemented in other programming languages, plus data
preprocessing utilities in C, and a Makefile-based system for running machine learning
experiments. This original version was primarily designed as a tool for analyzing data from
agricultural domains,[2][3] but the more recent fully Java-based version (Weka 3), for which
development started in 1997, is now used in many different application areas, in particular for
educational purposes and research. Advantages of Weka include:
• The Preprocess panel has facilities for importing data from a database, a comma-separated
values (CSV) file, etc., and for preprocessing this data using a so-called filtering algorithm.
These filters can be used to transform the data (e.g., turning numeric attributes into
discrete ones) and make it possible to delete instances and attributes according to specific
criteria.
• The Classify panel enables applying classification and regression algorithms
(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate
the accuracy of the resulting predictive model, and to visualize erroneous
predictions, receiver operating characteristic (ROC) curves, etc., or the model itself (if the
model is amenable to visualization like, e.g., a decision tree).
• The Associate panel provides access to association rule learners that attempt to identify all
important interrelationships between attributes in the data.
• The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-
means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
• The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
• The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.
Dataset – glass.arff
Preprocessed Data: -
Normalized Data: -
Standardized Data: -
Discretized Data: -
NaiveBayes ( Dataset: diabetes.arff )
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
Wgtd Avg. 0.763 0.307 0.759 0.763 0.760 0.468 0.819 0.815
a b <-- classified as
422 78 | a = tested_negative
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
Wgtd Avg. 0.781 0.281 0.781 0.781 0.781 0.500 0.850 0.866
a b <-- classified as
109 21 | a = tested_negative
21 41 | b = tested_positive
Conclusion: Dataset with more splits and folds gave fast and more accurate result.
KNN ( glass.arff )
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.786 0.167 0.696 0.786 0.738 0.602 0.806 0.628 build wind float
0.671 0.130 0.739 0.671 0.703 0.554 0.765 0.629 build wind non-float
0.294 0.051 0.333 0.294 0.313 0.258 0.590 0.144 vehic wind float
Wgtd Avg. 0.706 0.109 0.709 0.706 0.704 0.598 0.792 0.598
=== Confusion Matrix ===
a b c d e f g <-- classified as
0 2 0 0 10 0 1 | e = containers
0 1 0 0 1 7 0 | f = tableware
0 3 0 0 2 1 23 | g = headlamps
KNN ( glass.arff )
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.533 0.184 0.533 0.533 0.533 0.349 0.675 0.417 build wind float
0.667 0.219 0.667 0.667 0.667 0.448 0.724 0.577 build wind non-float
0.000 0.082 0.000 0.000 0.000 -0.082 0.459 0.075 vehic wind float
a b c d e f g <-- classified as
0 0 0 0 3 0 0 | e = containers
0 0 0 0 0 1 0 | f = tableware
0 1 0 0 0 0 8 | g = headlamps
Conclusion: Dataset with less splits or folds gave and more accurate result.
K – Means ( cpu.arff )
Number of iterations: 10
Cluster 0: 600,768,2000,0,1,1,16
Cluster 1: 59,4000,12000,32,6,12,113
Cluster 2: 124,1000,8000,0,1,8,42
Cluster#
=======================================================
Clustered Instances
0 20 ( 10%)
1 31 ( 15%)
2 158 ( 76%)
K – Means ( cpu.arff )
Number of iterations: 8
Cluster 0: 600,768,2000,0,1,1,16
Cluster 1: 59,4000,12000,32,6,12,113
Cluster 2: 124,1000,8000,0,1,8,42
Cluster 3: 125,2000,8000,0,2,14,52
Cluster 4: 26,8000,32000,64,12,16,248
Cluster#
=============================================================================
Clustered Instances
0 20 ( 10%)
1 49 ( 23%)
2 119 ( 57%)
3 7 ( 3%)
4 14 ( 7%)
Conclusion: The number of iterations is directly proportional to number of clusters. The algorithm for
generating more clusters gave clusters with relatively equal iterations and sum of squared errors than its
equivalent.
DBSCAN ( cpu.arff )
- % split = 66%
Number of iterations: 17
Cluster 0: 1100,768,2000,0,1,1,13
Cluster 1: 140,2000,4000,0,4,8,40
Cluster#
============================================
Attribute: MYCT
Attribute: MMIN
Attribute: MMAX
Attribute: CACH
Attribute: CHMIN
Attribute: CHMAX
Attribute: class
Attribute: MYCT
Attribute: MMIN
Attribute: CACH
Attribute: CHMIN
Attribute: CHMAX
Attribute: class
Clustered Instances
0 57 ( 79%)
1 15 ( 21%)
DBSCAN ( cpu.arff )
- % split = 30 %
Number of iterations: 12
Cluster 0: 600,768,2000,0,1,1,16
Cluster 1: 59,4000,12000,32,6,12,113
Cluster#
============================================
Attribute: MMIN
Attribute: MMAX
Attribute: CACH
Attribute: CHMIN
Attribute: CHMAX
Attribute: class
Attribute: MYCT
Attribute: MMIN
Attribute: MMAX
Attribute: CACH
Normal Distribution. Mean = 85.7368 StdDev = 57.2472
Attribute: CHMIN
Attribute: CHMAX
Attribute: class
MakeDensityBasedClusterer:
Wrapped clusterer:
kMeans
======
Number of iterations: 7
Cluster 0: 225,1000,4000,2,3,6,24
Cluster 1: 160,512,2000,2,3,8,32
Cluster#
============================================
Attribute: MYCT
Normal Distribution. Mean = 35.4545 StdDev = 8.9581
Attribute: MMIN
Attribute: MMAX
Attribute: CACH
Attribute: CHMIN
Attribute: CHMAX
Attribute: class
Attribute: MYCT
Attribute: MMIN
Attribute: MMAX
Attribute: CACH
Attribute: CHMAX
Attribute: class
Clustered Instances
0 36 ( 24%)
1 111 ( 76%)
Conclusion: Comparing results of data based on different parameters, it can be concluded that the
algorithm gives accurate results resulting into higher prior probability for higher range of minimum
standard deviation.