0% found this document useful (0 votes)
318 views

Weka

Weka is a collection of machine learning algorithms and data pre-processing tools implemented in Java that can be used for tasks like classification, regression, clustering, and association rule mining. It contains graphical user interfaces that provide easy access to these machine learning functions and tools for visualizing and analyzing data. Weka is open source, portable, contains a wide range of algorithms, and is commonly used for research and educational purposes in machine learning.

Uploaded by

vaibhav solanki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
318 views

Weka

Weka is a collection of machine learning algorithms and data pre-processing tools implemented in Java that can be used for tasks like classification, regression, clustering, and association rule mining. It contains graphical user interfaces that provide easy access to these machine learning functions and tools for visualizing and analyzing data. Weka is open source, portable, contains a wide range of algorithms, and is commonly used for research and educational purposes in machine learning.

Uploaded by

vaibhav solanki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ASSIGNMENT 6 - WEKA

Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association rules,
and visualization. It is also well-suited for developing new machine learning schemes.
Weka contains a collection of visualization tools and algorithms for data
analysis and predictive modeling, together with graphical user interfaces for easy access to
these functions.[1] The original non-Java version of Weka was a Tcl/Tk front-end to (mostly
third-party) modeling algorithms implemented in other programming languages, plus data
preprocessing utilities in C, and a Makefile-based system for running machine learning
experiments. This original version was primarily designed as a tool for analyzing data from
agricultural domains,[2][3] but the more recent fully Java-based version (Weka 3), for which
development started in 1997, is now used in many different application areas, in particular for
educational purposes and research. Advantages of Weka include:

• Free availability under the GNU General Public License.


• Portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform.
• A comprehensive collection of data preprocessing and modeling techniques.
• Ease of use due to its graphical user interfaces.
The Explorer interface features several panels providing access to the main components
of the workbench:

• The Preprocess panel has facilities for importing data from a database, a comma-separated
values (CSV) file, etc., and for preprocessing this data using a so-called filtering algorithm.
These filters can be used to transform the data (e.g., turning numeric attributes into
discrete ones) and make it possible to delete instances and attributes according to specific
criteria.
• The Classify panel enables applying classification and regression algorithms
(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate
the accuracy of the resulting predictive model, and to visualize erroneous
predictions, receiver operating characteristic (ROC) curves, etc., or the model itself (if the
model is amenable to visualization like, e.g., a decision tree).
• The Associate panel provides access to association rule learners that attempt to identify all
important interrelationships between attributes in the data.
• The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-
means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
• The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
• The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.
Dataset – glass.arff

Preprocessed Data: -
Normalized Data: -
Standardized Data: -
Discretized Data: -
NaiveBayes ( Dataset: diabetes.arff )

- Cross validation Folds=10


- % split = 66%

Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 586 76.3021 %

Incorrectly Classified Instances 182 23.6979 %

Kappa statistic 0.4664

Mean absolute error 0.2841

Root mean squared error 0.4168

Relative absolute error 62.5028 %

Root relative squared error 87.4349 %

Total Number of Instances 768

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.844 0.388 0.802 0.844 0.823 0.468 0.819 0.892 tested_negative

0.612 0.156 0.678 0.612 0.643 0.468 0.819 0.671 tested_positive

Wgtd Avg. 0.763 0.307 0.759 0.763 0.760 0.468 0.819 0.815

=== Confusion Matrix ===

a b <-- classified as

422 78 | a = tested_negative

104 164 | b = tested_positive


NaiveBayes ( Dataset: diabetes.arff )

- Cross validation Folds=25


- % split = 75%

Time taken to test model on test split: 0 seconds

=== Summary ===

Correctly Classified Instances 150 78.125 %

Incorrectly Classified Instances 42 21.875 %

Kappa statistic 0.4998

Mean absolute error 0.2642

Root mean squared error 0.3848

Relative absolute error 58.7363 %

Root relative squared error 82.0682 %

Total Number of Instances 192

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.838 0.339 0.838 0.838 0.838 0.500 0.850 0.911 tested_negative

0.661 0.162 0.661 0.661 0.661 0.500 0.850 0.772 tested_positive

Wgtd Avg. 0.781 0.281 0.781 0.781 0.781 0.500 0.850 0.866

=== Confusion Matrix ===

a b <-- classified as

109 21 | a = tested_negative

21 41 | b = tested_positive

Conclusion: Dataset with more splits and folds gave fast and more accurate result.
KNN ( glass.arff )

- Cross validation Folds=10


- % split = 66%

Time taken to build model: 0 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 151 70.5607 %

Incorrectly Classified Instances 63 29.4393 %

Kappa statistic 0.6005

Mean absolute error 0.0897

Root mean squared error 0.2852

Relative absolute error 42.3747 %

Root relative squared error 87.8627 %

Total Number of Instances 214

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.786 0.167 0.696 0.786 0.738 0.602 0.806 0.628 build wind float

0.671 0.130 0.739 0.671 0.703 0.554 0.765 0.629 build wind non-float

0.294 0.051 0.333 0.294 0.313 0.258 0.590 0.144 vehic wind float

? 0.000 ? ? ? ? ? ? vehic wind non-float

0.769 0.030 0.625 0.769 0.690 0.671 0.895 0.456 containers

0.778 0.015 0.700 0.778 0.737 0.726 0.838 0.598 tableware

0.793 0.011 0.920 0.793 0.852 0.834 0.884 0.772 headlamps

Wgtd Avg. 0.706 0.109 0.709 0.706 0.704 0.598 0.792 0.598
=== Confusion Matrix ===

a b c d e f g <-- classified as

55 9 6 0 0 0 0 | a = build wind float

15 51 4 0 3 2 1 | b = build wind non-float

9 3 5 0 0 0 0 | c = vehic wind float

0 0 0 0 0 0 0 | d = vehic wind non-float

0 2 0 0 10 0 1 | e = containers

0 1 0 0 1 7 0 | f = tableware

0 3 0 0 2 1 23 | g = headlamps

KNN ( glass.arff )

- Cross validation Folds=25


- % split = 75%

Time taken to test model on test split: 0.01 seconds

=== Summary ===

Correctly Classified Instances 34 64.1509 %

Incorrectly Classified Instances 19 35.8491 %

Kappa statistic 0.5069

Mean absolute error 0.1084

Root mean squared error 0.3136

Relative absolute error 51.1589 %

Root relative squared error 96.8122 %

Total Number of Instances 53


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.533 0.184 0.533 0.533 0.533 0.349 0.675 0.417 build wind float

0.667 0.219 0.667 0.667 0.667 0.448 0.724 0.577 build wind non-float

0.000 0.082 0.000 0.000 0.000 -0.082 0.459 0.075 vehic wind float

? 0.000 ? ? ? ? ? ? vehic wind non-float

1.000 0.020 0.750 1.000 0.857 0.857 0.990 0.750 containers

1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 tableware

0.889 0.000 1.000 0.889 0.941 0.932 0.944 0.908 headlamps

Wghtd Avg.0.642 0.146 0.646 0.642 0.642 0.496 0.748 0.567

=== Confusion Matrix ===

a b c d e f g <-- classified as

8 4 3 0 0 0 0 | a = build wind float

5 14 1 0 1 0 0 | b = build wind non-float

2 2 0 0 0 0 0 | c = vehic wind float

0 0 0 0 0 0 0 | d = vehic wind non-float

0 0 0 0 3 0 0 | e = containers

0 0 0 0 0 1 0 | f = tableware

0 1 0 0 0 0 8 | g = headlamps

Conclusion: Dataset with less splits or folds gave and more accurate result.
K – Means ( cpu.arff )

- Using Training dataset.


- K=3

Number of iterations: 10

Within cluster sum of squared errors: 16.2226477893351

Initial starting points (random):

Cluster 0: 600,768,2000,0,1,1,16

Cluster 1: 59,4000,12000,32,6,12,113

Cluster 2: 124,1000,8000,0,1,8,42

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1 2

(209.0) (20.0) (31.0) (158.0)

=======================================================

MYCT 203.823 906 40.4194 147

MMIN 2867.9809 664.8 9050.5161 1933.8354

MMAX 11796.1531 4000.6 31407.0968 8935.2152

CACH 25.2057 1 95 14.5759

CHMIN 4.6986 0.95 15.1935 3.1139

CHMAX 18.2679 2.05 47.8065 14.5253

class 105.622 17.75 392.4839 60.462


Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 20 ( 10%)

1 31 ( 15%)

2 158 ( 76%)

K – Means ( cpu.arff )

- Using Training dataset.


- K=5

Number of iterations: 8

Within cluster sum of squared errors: 11.763305243629095

Initial starting points (random):

Cluster 0: 600,768,2000,0,1,1,16

Cluster 1: 59,4000,12000,32,6,12,113

Cluster 2: 124,1000,8000,0,1,8,42

Cluster 3: 125,2000,8000,0,2,14,52

Cluster 4: 26,8000,32000,64,12,16,248

Missing values globally replaced with mean/mode


Final cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(209.0) (20.0) (49.0) (119.0) (7.0) (14.0)

=============================================================================

MYCT 203.823 906 51.5306 169.8571 174 37.3571

MMIN 2867.9809 664.8 4843.6735 1358.0168 1844.5714 12446.8571

MMAX 11796.1531 4000.6 19860 6805.4118 7485.7143 39285.7143

CACH 25.2057 1 52.8367 7.8908 8 118.8571

CHMIN 4.6986 0.95 7.6327 2.1513 4.1429 21.7143

CHMAX 18.2679 2.05 23.551 9.2857 76.2857 70.2857

class 105.622 17.75 169.7755 42.605 53.2857 568.4286

Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 20 ( 10%)

1 49 ( 23%)

2 119 ( 57%)

3 7 ( 3%)

4 14 ( 7%)

Conclusion: The number of iterations is directly proportional to number of clusters. The algorithm for
generating more clusters gave clusters with relatively equal iterations and sum of squared errors than its
equivalent.
DBSCAN ( cpu.arff )

- % split = 66%

Number of iterations: 17

Within cluster sum of squared errors: 15.876103534324052

Initial starting points (random):

Cluster 0: 1100,768,2000,0,1,1,13

Cluster 1: 140,2000,4000,0,4,8,40

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1

(137.0) (113.0) (24.0)

============================================

MYCT 207.2701 242.4425 41.6667

MMIN 3118.2482 1846.3186 9106.9167

MMAX 12202.9635 8090.1416 31567.5

CACH 26.4818 11.2832 98.0417

CHMIN 5.2263 3 15.7083

CHMAX 19.8978 12.8142 53.25

class 115.1752 52.5664 409.9583


Fitted estimators (with ML estimates of variance):

Cluster: 0 Prior probability: 0.8201

Attribute: MYCT

Normal Distribution. Mean = 242.4425 StdDev = 288.4516

Attribute: MMIN

Normal Distribution. Mean = 1846.3186 StdDev = 1662.4062

Attribute: MMAX

Normal Distribution. Mean = 8090.1416 StdDev = 5718.3184

Attribute: CACH

Normal Distribution. Mean = 11.2832 StdDev = 15.2875

Attribute: CHMIN

Normal Distribution. Mean = 3 StdDev = 3.1595

Attribute: CHMAX

Normal Distribution. Mean = 12.8142 StdDev = 16.7953

Attribute: class

Normal Distribution. Mean = 52.5664 StdDev = 40.4748

Cluster: 1 Prior probability: 0.1799

Attribute: MYCT

Normal Distribution. Mean = 41.6667 StdDev = 26.1943

Attribute: MMIN

Normal Distribution. Mean = 9106.9167 StdDev = 6721.6047


Attribute: MMAX

Normal Distribution. Mean = 31567.5 StdDev = 17310.5159

Attribute: CACH

Normal Distribution. Mean = 98.0417 StdDev = 52.1188

Attribute: CHMIN

Normal Distribution. Mean = 15.7083 StdDev = 12.5946

Attribute: CHMAX

Normal Distribution. Mean = 53.25 StdDev = 47.2222

Attribute: class

Normal Distribution. Mean = 409.9583 StdDev = 280.2213

Time taken to build model (percentage split) : 0 seconds

Clustered Instances

0 57 ( 79%)

1 15 ( 21%)
DBSCAN ( cpu.arff )

- % split = 30 %

Number of iterations: 12

Within cluster sum of squared errors: 21.17961501115821

Initial starting points (random):

Cluster 0: 600,768,2000,0,1,1,16

Cluster 1: 59,4000,12000,32,6,12,113

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1

(209.0) (171.0) (38.0)

============================================

MYCT 203.823 238.9064 45.9474

MMIN 2867.9809 1721.0058 8029.3684

MMAX 11796.1531 7743.7778 30031.8421

CACH 25.2057 11.7544 85.7368

CHMIN 4.6986 2.7076 13.6579

CHMAX 18.2679 12.3626 44.8421

class 105.622 50.2632 354.7368

Fitted estimators (with ML estimates of variance):

Cluster: 0 Prior probability: 0.8152


Attribute: MYCT

Normal Distribution. Mean = 238.9064 StdDev = 274.5998

Attribute: MMIN

Normal Distribution. Mean = 1721.0058 StdDev = 1570.8293

Attribute: MMAX

Normal Distribution. Mean = 7743.7778 StdDev = 5370.527

Attribute: CACH

Normal Distribution. Mean = 11.7544 StdDev = 16.8647

Attribute: CHMIN

Normal Distribution. Mean = 2.7076 StdDev = 2.7756

Attribute: CHMAX

Normal Distribution. Mean = 12.3626 StdDev = 16.6476

Attribute: class

Normal Distribution. Mean = 50.2632 StdDev = 37.2153

Cluster: 1 Prior probability: 0.1848

Attribute: MYCT

Normal Distribution. Mean = 45.9474 StdDev = 31.3629

Attribute: MMIN

Normal Distribution. Mean = 8029.3684 StdDev = 6219.804

Attribute: MMAX

Normal Distribution. Mean = 30031.8421 StdDev = 14712.8937

Attribute: CACH
Normal Distribution. Mean = 85.7368 StdDev = 57.2472

Attribute: CHMIN

Normal Distribution. Mean = 13.6579 StdDev = 11.0246

Attribute: CHMAX

Normal Distribution. Mean = 44.8421 StdDev = 39.8646

Attribute: class

Normal Distribution. Mean = 354.7368 StdDev = 243.9342

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on test split ===

MakeDensityBasedClusterer:

Wrapped clusterer:

kMeans

======

Number of iterations: 7

Within cluster sum of squared errors: 7.885124470107327

Initial starting points (random):

Cluster 0: 225,1000,4000,2,3,6,24
Cluster 1: 160,512,2000,2,3,8,32

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1

(62.0) (11.0) (51.0)

============================================

MYCT 203.7419 35.4545 240.0392

MMIN 2908.3548 8545.4545 1692.5098

MMAX 11606.6129 31272.7273 7364.902

CACH 22.5806 81.4545 9.8824

CHMIN 5.5968 16.7273 3.1961

CHMAX 18.371 49.4545 11.6667

class 104.6129 357.6364 50.0392

Fitted estimators (with ML estimates of variance):

Cluster: 0 Prior probability: 0.1875

Attribute: MYCT
Normal Distribution. Mean = 35.4545 StdDev = 8.9581

Attribute: MMIN

Normal Distribution. Mean = 8545.4545 StdDev = 4008.2559

Attribute: MMAX

Normal Distribution. Mean = 31272.7273 StdDev = 12038.5057

Attribute: CACH

Normal Distribution. Mean = 81.4545 StdDev = 38.8666

Attribute: CHMIN

Normal Distribution. Mean = 16.7273 StdDev = 16.955

Attribute: CHMAX

Normal Distribution. Mean = 49.4545 StdDev = 51.8255

Attribute: class

Normal Distribution. Mean = 357.6364 StdDev = 270.4072

Cluster: 1 Prior probability: 0.8125

Attribute: MYCT

Normal Distribution. Mean = 240.0392 StdDev = 280.8856

Attribute: MMIN

Normal Distribution. Mean = 1692.5098 StdDev = 1449.9494

Attribute: MMAX

Normal Distribution. Mean = 7364.902 StdDev = 4778.0192

Attribute: CACH

Normal Distribution. Mean = 9.8824 StdDev = 10.9931


Attribute: CHMIN

Normal Distribution. Mean = 3.1961 StdDev = 3.4924

Attribute: CHMAX

Normal Distribution. Mean = 11.6667 StdDev = 14.3167

Attribute: class

Normal Distribution. Mean = 50.0392 StdDev = 31.2799

Time taken to build model (percentage split) : 0 seconds

Clustered Instances

0 36 ( 24%)

1 111 ( 76%)

Conclusion: Comparing results of data based on different parameters, it can be concluded that the
algorithm gives accurate results resulting into higher prior probability for higher range of minimum
standard deviation.

You might also like