WEKA Practical Protocol
WEKA Practical Protocol
COLLEGE, BHOPAL
SESSION: 2023-24
Data Mining And Warehousing.
(Practical Protocol)
First, you will start with the raw data collected from the field. This data may contain several null
values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the
data. Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the
options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic
selection of features to create a reduced dataset. Note that under each category, WEKA provides the
im
First, you will start with the raw data collected from the field. This data may
contain several null values and irrelevant fields. You use the data preprocessing
tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML
algorithms.
Next, depending on the kind of ML model that you are trying to develop you
would select one of the options such as Classify, Cluster, or Associate. The
Attributes Selection allows the automatic selection of features to create a reduced
dataset.
Note that under each category, WEKA provides the implementation of several
algorithms. You would select an algorithm of your choice, set the desired
parameters and run it on the dataset. Then, WEKA would give you the statistical
output of the model processing. It provides you a visualization tool to inspect the
data. The various models can be applied on the same dataset.
You can then compare the outputs of different models and select the best that
meets your purpose. Thus, the use of WEKA results in a quicker development of
machine learning models on the whole.
Weka – Installation.
To install WEKA on your machine, visit WEKA’s official website and download
the installation file. WEKA supports installation on Windows, Mac OS X and
Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The steps for installing on Mac are as follows –
Download the Mac installation file.
Double click on the downloaded weka-3-8-3-corretto-jvm.dmg file.
You will see the following screen on successful installation.
Click on the weak-3-8-3-corretto-jvm icon to start Weka.
Optionally you may start it from the command line − java -jar weka.jar .
The WEKA GUI Chooser application will start and you would see the following
screen.
The GUI Chooser application allows you to run five different types of applications
as listed here –
Explore
Experimenter
KnowledgeFlow
Workbench
Simple CLI
Weka Environment:-
Weka - Launching Explorer
In this chapter, let us look into various functionalities that the explorer provides for
working with big data. When you click on the Explorer button in the Applications
selector, it opens the following screen-
Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first
step in machine learning is to preprocess the data. Thus, in the Preprocess
option, you will select the data file, process it and make it fit for applying the
various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the
classification of your data. To list a few, you may apply algorithms such as
Linear Regression, Logistic Regression, Support Vector Machines, Decision
Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very
exhaustive and provides both supervised and unsupervised machine learning
algorithms.
Cluster Tab
Under the Cluster tab, there are several clustering algorithms provided - such
as SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and
FPGrowth. Select Attributes Tab Select Attributes allows you feature selections
based on several algorithms such as ClassifierSubsetEval,
PrinicipalComponents, etc.
Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for
analysis.
As you noticed, WEKA provides several ready-to-use algorithms for testing
and building your machine learning applications. To use WEKA effectively,
you must have a sound knowledge of these algorithms, how they work, which
one to choose under what circumstances, what to look for in their processed
output, and so on. In short, you must have a solid foundation in machine
learning to use WEKA effectively in building your apps.
Weka experimenter
The experimenter configures the test options for you with sensible defaults. The experiment is
configured to use Cross Validation with 10 folds. It is a “Classification” type problem and each
algorithm + dataset combination is run 10 times (iteration control).
Weka Workbench
WEKA is a workbench for machine learning that is intended to aid in the application of machine
learning techniques to a variety of real-world problems, in particular, those arising from
agricultural and horticultural domains.
Simple CLI
The Simple CLI (Command Line Interface) provides a command line interface to run the WEKA
API. This is useful for running shell scripts to automate processes, calling the WEKA API from
other.
Practical 2
Weka - Loading Data
On this chapter, we start with the first tab that you use to preprocess the data.
This is common to all algorithms that you would apply to your data for building
the model and is a common step for all subsequent operations in WEKA.
For a machine learning algorithm to give acceptable accuracy, it is important
that you must cleanse your data first. This is because the raw data collected
from the field may contain null values, irrelevant columns and so on
n this chapter, you will learn how to preprocess the raw data and create a clean,
meaningful dataset for further use. First, you will learn to load the data file into
the WEKA explorer.
Just under the Machine Learning tabs that you studied in the previous lesson,
you would find the following three buttons –
Open file …
Open URL…
Open DB …
Click on the Open file ... button. A directory navigator window opens as shown
in the following screen –
Now, navigate to the folder where your data files are stored. WEKA installation
comes up with many sample databases for you to experiment. These are available
in the data folder of the WEKA installation.
Loading Data from Web Once you click on the Open URL … button, you can see a
window as follows –
We will open the file from a public URL Type the following URL in the popup
box –
https://storm.cis.fordham.edu/~gweiss/data-mining/wekadata/weather.nominal.arff
You may specify any other URL where your data is stored. The Explorer will load
the data from the remote site into its environment.
Loading Data from DB Once you click on the Open DB ... button, you can see a
window as follows –
Set the connection string to your database, set up the query for data selection,
process the query and load the selected records in WEKA.
Weka - File Formats
WEKA supports a large number of file formats for the data. Here is
the complete list −
arff
arff.gz
bsi
csv
dat
data
json
json.gz
libsvm
m
names
xrff
xrff.gz
The types of files that it supports are listed in the drop-down list box at the
bottom of the screen. This is shown in the screenshot given be From the
screenshot, you can infer the following points −
The @relation tag defines the name of the database.
The @attribute tag defines the attributes.
The @data tag starts the list of data rows each containing the
comma separated fields.
The attributes can take nominal values as in the case of outlook shown here−
@attribute outlook (sunny, overcast, rainy)
The attributes can take real values as in this case –
@attribute temperature real
You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)
The Target assumes two nominal values yes or no.
Other Formats
The Explorer can load the data in any of the earlier mentioned
formats. As arff is the preferred format in WEKA, you may load the
data from any format and save it to arff format for later use. After
preprocessing the data, just save it to arff format for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you
will learn how to preprocess the data.
From the screenshot, you can infer the following points –
Other Formats
The Explorer can load the data in any of the earlier mentioned
formats. As arff is the preferred format in WEKA, you may load the
data from any format and save it to arff format for later use. After
preprocessing the data, just save it to arff format for further
analysis.
Now that you have learned how to load data into WEKA, in the next
chapter, you will learn how to preprocess the data.
As you would notice it supports several formats including CSV and JSON. The
default file type is Arff.
Arff Format
An Arff file contains two sections - header and data.
As an example for Arff format, the Weather data file loaded from the WEKA
sample databases is shown below –
The @data tag starts the list of data rows each containing the comma separated
fields.
The attributes can take nominal values as in the case of outlook shown here –
You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)
Other Formats
The Explorer can load the data in any of the earlier mentioned formats. As arff is
the preferred format in WEKA, you may load the data from any format and save it
to arff format for later use. After preprocessing the data, just save it to arff format
for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you
will learn how to preprocess the data.
Practical-3&4
Extract: The first stage in the date from various sources ETL process to extract
such as transactional systems, spreadshorts and flat feler. This step Sevolver dato
from brending staging the systems and storing it
Transfarem :- "du In this this ostage stage, the for loading a auto format the data
warehouse cleaning and validating combining dato fields! extracted data is
transfor- that is suptable. This may involve enting data types, the data, converting
Omultiple Source, and creating
Load :- After the data is transformed it is landed cuts the data warehouse. This step
involves creating the physical date structures and landing the data into the ware
house.
ETL is a process. In Date warehousing and it astands extract, transform and land. It
in which au process ETL trol extracts the data from various data sow area and
Warehouse finally, loads Tranfarms it in the staging System into the Dots.
Using the Open file ... option under the Preprocess tag select the weather-
nominal.arff file.
When you open the file, your screen looks like as shown here-
Understanding Data.
Let us first look at the highlighted Current relation sub window. It shows the name
of the database that is currently loaded. You can infer two points from this sub
window −
There are 14 instances - the number of rows in the table.
The table contains 5 attributes - the fields, which are discussed in the
upcoming sections.
On the left side, notice the Attributes sub window that displays the various
fields in the database.
The weather database contains five fields - outlook, temperature, humidity, windy
and play. When you select an attribute from this list by clicking on it, further
details on the attribute itself are displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see
the following screen-
If you click on the Visualize All button, you will be able to see all features in one
single window as shown here-
Practical 6
Procedure for Implementation
The process of building a Data Mart can be complex, but it generally involves the
following 5 easy steps:
Step 1: Design
Step 2: Build / Construct
Step 3: Populate / Data Transfer
Step 4: Data Access
Step 5: Manage
Step 1: Design
This is the first step when building a Data Mart.It includes tasks such as initiating a
request for the Data Mart and collecting information about the requirements. Other
tasks involved in this step include identifying the data sources and selecting the
right data subset.The output of this step is the logical and physical design of the
Data Mart.
Step 2: Build / Construct
This is the step during which both the physical and the logical structures for the
Data Mart are created.In this step, you create the tables, indexes, fields, and access
controls.
Step 3: Populate / Data Transfer
This is the step in which you populate the Data Mart by transferring data into it.
You can also set the frequency with which data transfer will be done, whether daily
or weekly.
To ensure that information stored in the structure is clean, it is always overwritten
during the population of the Data Mart. In this step, the source information is
extracted, cleaned, transformed, and loaded into the Data Mart.
Step 4: Data Access
In this step, the data that has been loaded into the Data Mart is put into active use.
Activities involved here include querying, generating graphs and reports, and
publishing.
To make it easy for non-technical users to use the Data Mart, a meta-layer should
be set up and item names and database structures translated into corporate
expressions.If possible, interfaces and APIs should be set up to ease the process of
data access.
Step 5: Manage
This is the last step when building a Data Mart and it involves the following tasks:
Controlling user access.
Refining and optimizing the target system to improve its performance.
Adding new data into the Data Mart and managing it.
Configuring recovery settings and ensuring that the system is available
even after the occurrences of disasters.
Practical-7
Build a classification model to classify data using Naïve Bayes Algorithm.
Regression is the easiest technique to use, but is also probably the least
powerful.The regression model is then used to predict the result of an unknown
dependent variable, given the Values of the Independent Variables.
STEP Including:
Step-1: Open WEKA Explorer.
Step-2: Select salesdata anff flle frove the Choose file Option "under the
preprocess tat
Step-3: Go to classity tab for classiying. the unclassified data lick on se button
from this, select "func=function. logistic".
Step-4: click on start Button. The classifying but put will be seen on the Right-
hand panel, we can see that with the default configuration that logistic regression
archieves on Accurancy of 63%.
It shows run information in the ponder-
correctly classified instances: 2948
Incorrectly classified instances: 1679 Total No Instances: 4627
Detailed Accuracy By class.
Confusion Matrix..\
Step-5: To visualize the Theet, right click on the result and select Visualize
tree.
* Margin curve
*Threshold curve (Class value high)
* Threshold curve (Class value low)
Practical-10
Implement Clustering Algorithm over different Dataset.
Clustering Algorithm.
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
group of data points into clusters so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.
Density Based Spatial Clustering Of Application With Noise.
The DBSCAN algorithm is based on the interitive notion of “Clusters
and “Noise”.the key idea is that for each point of a cluster, the
neighbourhood of a given radius has to contain at least a minimum
no.of points.
Steps to be follows:-
Step1: Open the weka explorer in the preprocessing interface and
import the appropriate datasets.
Step2: To perform clustering, go to the explorer ”clusters” tab and
select button.as a result of this step,dropdown list of available
clustering algorithms display. Pick the hierarchal or DBSCAN
algorithm.
Step3: Click on start button. The resulting window display the
centroid of each cluster,as well as data on the no. and proportion of
instances assigned to each cluster.a mean vector is used to represent
each cluster centroid.
Steps to be followed:
Step : 1 open weka explorer and under preprocessor tab
choose “diabetic and non-diabetic.csv” file.
Step : 2 The file now gets loaded in the weka explorer
Step : 3 Go to “classify” tab . The algorithm can be mined
from here.
Step : 4 The Textbox next to choose button, show the
“RandomForest –I 10 -K 0 –S 1” or “Logistic –R 1.0 E-8 –M
-1”, which depicts the summarized rules set for the specific
algorithm in the setting tab.
Step : 5 click on start button. The classification rules are
generated in the right panel. This panel shows-
Summary
Detailed Accuracy By Class
Confusion Matrix
Run Information.
PRACTICAL -13
Analyze the IRIS datasets in weka and apply suitable data mining technique.
IRIS is an open access flower based dataset and is normally available on UCI
dataset. The major Objective of this research work is two examine the IRIS data
using data mining techniques available supported in weka.
Steps to be followed:-
Step : 1 open weka explorer and under preprocessor tab choose “ iris.csv” file.
Step : 2 The file now gets loaded in the weka explorer
Step : 3 Go to “classify” tab for classify the unclassifying data.click on
“choose” button from this, select “func = bayse naïve bayse and tree.decision
stump”.
Step : 4 To perform “clustering”, go to the explorer “cluster” tab and select
button. As a result of this steps, dropdown list of available clustering algorithm
displayed picks the Hierarchical or DBSCAN algorithm.
Step : 5 visualising the model.
Bayes.NaiveBayes .
Tress.DecisionStump.
Density Based Spatial Clustering Of Application With Noise.
The DBSCAN algorithm is based on the interitive notion of “Clusters
and “Noise”.the key idea is that for each point of a cluster, the
neighbourhood of a given radius has to contain at least a minimum
no.of points.