0% found this document useful (0 votes)
185 views

WEKA Practical Protocol

Practical

Uploaded by

huzaifa insaf
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views

WEKA Practical Protocol

Practical

Uploaded by

huzaifa insaf
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

SANT HIRDARAM GIRLS

COLLEGE, BHOPAL

SESSION: 2023-24
Data Mining And Warehousing.
(Practical Protocol)

SUBMITTED BY: SUBMITTED TO:


Aishwarya Purswani Ms.KanchanChaturvedi

BCA IIIrd Year Asst.prof of dept.of CS.


Practical 1
What is Weka?
WEKA - an open source software provides tools for data preprocessing,
implementation of several Machine Learning algorithms, and visualization tools so
that you can develop machine learning techniques and apply them to real-world
data mining problems. What WEKA offers is summarized in the following
diagram –

First, you will start with the raw data collected from the field. This data may contain several null
values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the
data. Then, you would save the preprocessed data in your local storage for applying ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of the
options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic
selection of features to create a reduced dataset. Note that under each category, WEKA provides the
im

First, you will start with the raw data collected from the field. This data may
contain several null values and irrelevant fields. You use the data preprocessing
tools provided in WEKA to cleanse the data.
Then, you would save the preprocessed data in your local storage for applying ML
algorithms.
Next, depending on the kind of ML model that you are trying to develop you
would select one of the options such as Classify, Cluster, or Associate. The
Attributes Selection allows the automatic selection of features to create a reduced
dataset.
Note that under each category, WEKA provides the implementation of several
algorithms. You would select an algorithm of your choice, set the desired
parameters and run it on the dataset. Then, WEKA would give you the statistical
output of the model processing. It provides you a visualization tool to inspect the
data. The various models can be applied on the same dataset.
You can then compare the outputs of different models and select the best that
meets your purpose. Thus, the use of WEKA results in a quicker development of
machine learning models on the whole.

Weka – Installation.
To install WEKA on your machine, visit WEKA’s official website and download
the installation file. WEKA supports installation on Windows, Mac OS X and
Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The steps for installing on Mac are as follows –
 Download the Mac installation file.
 Double click on the downloaded weka-3-8-3-corretto-jvm.dmg file.
You will see the following screen on successful installation.
Click on the weak-3-8-3-corretto-jvm icon to start Weka.
 Optionally you may start it from the command line − java -jar weka.jar .
The WEKA GUI Chooser application will start and you would see the following
screen.

The GUI Chooser application allows you to run five different types of applications
as listed here –
 Explore
 Experimenter
 KnowledgeFlow
 Workbench
 Simple CLI

Weka Environment:-
Weka - Launching Explorer
In this chapter, let us look into various functionalities that the explorer provides for
working with big data. When you click on the Explorer button in the Applications
selector, it opens the following screen-

On the top, you will see several tabs as listed here –


 Preprocess
 Classify
 Cluster
 Associate
 Select Attributes
 Visualize
 It contains a Collection of visualization tools and algorithms for data
analysis and predictive modeling coupled with graphical user interface.
Weka supports several standard data mining tasks, more specifically, data
pre-processing, clustering, classification, regressing, visualization and
feature selection.

Preprocess Tab

Initially as you open the explorer, only the Preprocess tab is enabled. The first
step in machine learning is to preprocess the data. Thus, in the Preprocess
option, you will select the data file, process it and make it fit for applying the
various machine learning algorithms.

Classify Tab

The Classify tab provides you several machine learning algorithms for the
classification of your data. To list a few, you may apply algorithms such as
Linear Regression, Logistic Regression, Support Vector Machines, Decision
Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very
exhaustive and provides both supervised and unsupervised machine learning
algorithms.

Cluster Tab

Under the Cluster tab, there are several clustering algorithms provided - such
as SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on.

Associate Tab

Under the Associate tab, you would find Apriori, FilteredAssociator and
FPGrowth. Select Attributes Tab Select Attributes allows you feature selections
based on several algorithms such as ClassifierSubsetEval,
PrinicipalComponents, etc.

Visualize Tab

Lastly, the Visualize option allows you to visualize your processed data for
analysis.
As you noticed, WEKA provides several ready-to-use algorithms for testing
and building your machine learning applications. To use WEKA effectively,
you must have a sound knowledge of these algorithms, how they work, which
one to choose under what circumstances, what to look for in their processed
output, and so on. In short, you must have a solid foundation in machine
learning to use WEKA effectively in building your apps.

Weka experimenter

The experimenter configures the test options for you with sensible defaults. The experiment is
configured to use Cross Validation with 10 folds. It is a “Classification” type problem and each
algorithm + dataset combination is run 10 times (iteration control).

Weka Knowledge Flow

Weka KnowledgeFlow. KnowledgeFlow. The KnowledgeFlow presents a "dataflow" inspired


interface to Weka. The user can select Weka components from a tool bar, place them on a
layout canvas and connect them together in order to form a "knowledge flow" for processing
and analyzing data.

Weka Workbench

WEKA is a workbench for machine learning that is intended to aid in the application of machine
learning techniques to a variety of real-world problems, in particular, those arising from
agricultural and horticultural domains.

Simple CLI
The Simple CLI (Command Line Interface) provides a command line interface to run the WEKA
API. This is useful for running shell scripts to automate processes, calling the WEKA API from
other.
Practical 2
Weka - Loading Data

On this chapter, we start with the first tab that you use to preprocess the data.
This is common to all algorithms that you would apply to your data for building
the model and is a common step for all subsequent operations in WEKA.
For a machine learning algorithm to give acceptable accuracy, it is important
that you must cleanse your data first. This is because the raw data collected
from the field may contain null values, irrelevant columns and so on
n this chapter, you will learn how to preprocess the raw data and create a clean,
meaningful dataset for further use. First, you will learn to load the data file into
the WEKA explorer.

The data can be loaded from the following sources –


 Local file system
 Web
 Database In this chapter, we will see all the three options of loading data in
detail.

Loading Data from Local File System

Just under the Machine Learning tabs that you studied in the previous lesson,
you would find the following three buttons –

 Open file …
 Open URL…
 Open DB …

Click on the Open file ... button. A directory navigator window opens as shown
in the following screen –
Now, navigate to the folder where your data files are stored. WEKA installation
comes up with many sample databases for you to experiment. These are available
in the data folder of the WEKA installation.
Loading Data from Web Once you click on the Open URL … button, you can see a
window as follows –

We will open the file from a public URL Type the following URL in the popup
box –
https://storm.cis.fordham.edu/~gweiss/data-mining/wekadata/weather.nominal.arff
You may specify any other URL where your data is stored. The Explorer will load
the data from the remote site into its environment.

Loading Data from DB Once you click on the Open DB ... button, you can see a
window as follows –

Set the connection string to your database, set up the query for data selection,
process the query and load the selected records in WEKA.
Weka - File Formats
 WEKA supports a large number of file formats for the data. Here is
 the complete list −
 arff
 arff.gz
 bsi
 csv
 dat
 data
 json
 json.gz
 libsvm
 m
 names
 xrff
 xrff.gz
The types of files that it supports are listed in the drop-down list box at the
bottom of the screen. This is shown in the screenshot given be From the
screenshot, you can infer the following points −
 The @relation tag defines the name of the database.
 The @attribute tag defines the attributes.
 The @data tag starts the list of data rows each containing the
 comma separated fields.

 The attributes can take nominal values as in the case of outlook shown here−
@attribute outlook (sunny, overcast, rainy)
 The attributes can take real values as in this case –
@attribute temperature real
 You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)
 The Target assumes two nominal values yes or no.

Other Formats
The Explorer can load the data in any of the earlier mentioned
formats. As arff is the preferred format in WEKA, you may load the
data from any format and save it to arff format for later use. After
preprocessing the data, just save it to arff format for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you
will learn how to preprocess the data.
From the screenshot, you can infer the following points –

 The @relation tag defines the name of the database.


 The @attribute tag defines the attributes.
 The @data tag starts the list of data rows each containing the
 comma separated fields.
 The attributes can take nominal values as in the case of outlook shown here−
@attribute outlook (sunny, overcast, rainy)
 The attributes can take real values as in this case −
@attribute temperature real
 You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)
 The Target assumes two nominal values yes or no.

Other Formats
The Explorer can load the data in any of the earlier mentioned
formats. As arff is the preferred format in WEKA, you may load the
data from any format and save it to arff format for later use. After
preprocessing the data, just save it to arff format for further
analysis.

Now that you have learned how to load data into WEKA, in the next
chapter, you will learn how to preprocess the data.

As you would notice it supports several formats including CSV and JSON. The
default file type is Arff.

Arff Format
An Arff file contains two sections - header and data.

 The header describes the attribute types.

 The data section contains a comma separated list of data.

As an example for Arff format, the Weather data file loaded from the WEKA
sample databases is shown below –

From the screenshot, you can infer the following points –

 The @relation tag defines the name of the database.

 The @attribute tag defines the attributes.

 The @data tag starts the list of data rows each containing the comma separated
fields.

 The attributes can take nominal values as in the case of outlook shown here –

@attribute outlook (sunny, overcast, rainy)

 The attributes can take real values as in this case −

@attribute temperature real

 You can also set a Target or a Class variable called play as shown here −
@attribute play (yes, no)

 The Target assumes two nominal values yes or no.

Other Formats
The Explorer can load the data in any of the earlier mentioned formats. As arff is
the preferred format in WEKA, you may load the data from any format and save it
to arff format for later use. After preprocessing the data, just save it to arff format
for further analysis.
Now that you have learned how to load data into WEKA, in the next chapter, you
will learn how to preprocess the data.

Practical-3&4

Implement attribute selection and visualization in weka & Perform ETL

Operation Over Dataset.


Perform ETL operation over data set.

The ETL can be broken down into three stages :-

Extract: The first stage in the date from various sources ETL process to extract
such as transactional systems, spreadshorts and flat feler. This step Sevolver dato
from brending staging the systems and storing it

Transfarem :- "du In this this ostage stage, the for loading a auto format the data
warehouse cleaning and validating combining dato fields! extracted data is
transfor- that is suptable. This may involve enting data types, the data, converting
Omultiple Source, and creating

Load :- After the data is transformed it is landed cuts the data warehouse. This step
involves creating the physical date structures and landing the data into the ware
house.

ETL is a process. In Date warehousing and it astands extract, transform and land. It
in which au process ETL trol extracts the data from various data sow area and
Warehouse finally, loads Tranfarms it in the staging System into the Dots.

• Attribute Selection and Visualization.


Practical 5
Weka - Preprocessing the Data
The data that is collected from the field contains many unwanted things that leads
to wrong analysis. For example, the data may contain null fields, it may contain
columns that are irrelevant to the current analysis, and so on. Thus, the data must
be preprocessed to meet the requirements of the type of analysis you are seeking.
This is the done in the preprocessing module.

To demonstrate the available features in preprocessing, we will use the Weather


database that is provided in the installation.

Using the Open file ... option under the Preprocess tag select the weather-
nominal.arff file.

When you open the file, your screen looks like as shown here-
Understanding Data.

Let us first look at the highlighted Current relation sub window. It shows the name
of the database that is currently loaded. You can infer two points from this sub
window −
 There are 14 instances - the number of rows in the table.
 The table contains 5 attributes - the fields, which are discussed in the
upcoming sections.

On the left side, notice the Attributes sub window that displays the various
fields in the database.
The weather database contains five fields - outlook, temperature, humidity, windy
and play. When you select an attribute from this list by clicking on it, further
details on the attribute itself are displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see
the following screen-

In the Selected Attribute subwindow, you can observe the following –


 The name and the type of the attribute are displayed.
 The type for the temperature attribute is Nominal.
 The number of Missing values is zero.
 There are three distinct values with no unique value.
 The table underneath this information shows the nominal values for this field as
hot, mild and cold.
 It also shows the count and weight in terms of a percentage for each nominal
value.
At the bottom of the window, you see the visual representation of the class values.

If you click on the Visualize All button, you will be able to see all features in one
single window as shown here-

Practical 6
Procedure for Implementation
The process of building a Data Mart can be complex, but it generally involves the
following 5 easy steps:
 Step 1: Design
 Step 2: Build / Construct
 Step 3: Populate / Data Transfer
 Step 4: Data Access
 Step 5: Manage
Step 1: Design
This is the first step when building a Data Mart.It includes tasks such as initiating a
request for the Data Mart and collecting information about the requirements. Other
tasks involved in this step include identifying the data sources and selecting the
right data subset.The output of this step is the logical and physical design of the
Data Mart.
Step 2: Build / Construct
This is the step during which both the physical and the logical structures for the
Data Mart are created.In this step, you create the tables, indexes, fields, and access
controls.
Step 3: Populate / Data Transfer
This is the step in which you populate the Data Mart by transferring data into it.
You can also set the frequency with which data transfer will be done, whether daily
or weekly.
To ensure that information stored in the structure is clean, it is always overwritten
during the population of the Data Mart. In this step, the source information is
extracted, cleaned, transformed, and loaded into the Data Mart.
Step 4: Data Access
In this step, the data that has been loaded into the Data Mart is put into active use.
Activities involved here include querying, generating graphs and reports, and
publishing.
To make it easy for non-technical users to use the Data Mart, a meta-layer should
be set up and item names and database structures translated into corporate
expressions.If possible, interfaces and APIs should be set up to ease the process of
data access.
Step 5: Manage
This is the last step when building a Data Mart and it involves the following tasks:
 Controlling user access.
 Refining and optimizing the target system to improve its performance.
 Adding new data into the Data Mart and managing it.
 Configuring recovery settings and ensuring that the system is available
 even after the occurrences of disasters.

Practical-7
Build a classification model to classify data using Naïve Bayes Algorithm.

 Naive Bayes Algorithm.

The Naïve Bayes classifier is a supervised machine learning algorithm, which is


used for classification tasks, like text classification. It is also part of a family of
generative learning algorithms, meaning that it seeks to model the distribution of
inputs of a given class or category.

 Naïve bayes classifier by following steps:-


Step-1: Click on Explorer.
Step-2: Open file.
Step-3: Choose file/folder which is saved as .arff.
Step-4: Click open Button.
Step-5: Click on choose in top Corner.
Step-6: Select Naïve Bayes under Bayes.
Step-7: Click on Start Button.
Step-8: Visualizing the Model.
Practical-8
Build a classification model using different Decision Tree Algorithm.

 Decision Tree Algorithm-


Decision trees are a popular machine learning algorithm that can be used for both
regression and classification tasks. They are easy to understand, interpret, and
implement, making them an ideal choice for beginners in the field of machine
learning. In this comprehensive guide, we will cover all aspects of the decision tree
algorithm, including the working principles, different types of decision trees, the
process of building decision trees, and how to evaluate and optimize decision trees.

 Decision Tree classifier by following steps:-


Step-1: Click on Explorer.
Step-2: Open file.
Step-3: Choose file/folder which is saved as .arff.
Step-4: Click open Button.
Step-5: Click on choose in top Corner.
Step-6: Select Naïve Bayes under Bayes.
Step-7: Click on Start Button.
Step-8: Visualizing the Model..
Practical-9
Apply Regression to make Marketing Forecast over Sales Data.

Regression is the easiest technique to use, but is also probably the least
powerful.The regression model is then used to predict the result of an unknown
dependent variable, given the Values of the Independent Variables.

Logistic regression is a binary classification algorithm.


The Algorithm learns a coefficient for each input value, which are linearly
Combined into and a regression function regression a logistic transformed using a
(S-Shaped) function. It is a fast & Simple Technique, but can be very Effective on
some problems.
The logistic Regression only supports Binary classification problem, although the
Weka implementation has been adopted to support multi-class classification
Problem.

STEP Including:
Step-1: Open WEKA Explorer.
Step-2: Select salesdata anff flle frove the Choose file Option "under the
preprocess tat

Step-3: Go to classity tab for classiying. the unclassified data lick on se button
from this, select "func=function. logistic".

Step-4: click on start Button. The classifying but put will be seen on the Right-
hand panel, we can see that with the default configuration that logistic regression
archieves on Accurancy of 63%.
 It shows run information in the ponder-
 correctly classified instances: 2948
 Incorrectly classified instances: 1679 Total No Instances: 4627
 Detailed Accuracy By class.
 Confusion Matrix..\

Step-5: To visualize the Theet, right click on the result and select Visualize
tree.
* Margin curve
*Threshold curve (Class value high)
* Threshold curve (Class value low)

Practical-10
Implement Clustering Algorithm over different Dataset.

Clustering Algorithm.
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
group of data points into clusters so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.
Density Based Spatial Clustering Of Application With Noise.
The DBSCAN algorithm is based on the interitive notion of “Clusters
and “Noise”.the key idea is that for each point of a cluster, the
neighbourhood of a given radius has to contain at least a minimum
no.of points.
Steps to be follows:-
Step1: Open the weka explorer in the preprocessing interface and
import the appropriate datasets.
Step2: To perform clustering, go to the explorer ”clusters” tab and
select button.as a result of this step,dropdown list of available
clustering algorithms display. Pick the hierarchal or DBSCAN
algorithm.
Step3: Click on start button. The resulting window display the
centroid of each cluster,as well as data on the no. and proportion of
instances assigned to each cluster.a mean vector is used to represent
each cluster centroid.

It shows run information in the panel-


 Scheme.
 Relation instances.
 Attributes.
 Test mode.
 Clustering results.
 Clustered instances .
 Unclusterd relation.
PRACTICAL-11
Apply Apriori algorithm to find out association rules in data set.
Association rules are mined out after frequent itemsets in a big dataset are found.
These datasets are found out using mining algorithms such as Apriori and FP
Growth. Frequent Itemset mining mines data using support and confidence
measures.
WEKA contains an implementation of the Apriori algorithm for learning
association rules. Apriori works only with binary attributes, and categorical data
(nominal data), if the data set contains any numerical values convert them into
nominal first.
Apriori finds out all rules with minimum support and confidence threshold.
Follow the steps below :-
Step :1 Open WEKA Explorer and under Preprocess tab choose “apriori.csv” file.
Step :2 The file now gets loaded in the WEKA Explorer.
Step :3 Go to the Associate tab. The apriori rules can be mined from here.
Step :4 The Textbox next to choose button, shows the “Apriori-N-10-T-0-C-0.9-D
0.05-U1.0-M0.1-S-1.0-c-1”, which depicts the summarized rules set for the
algorithm in the settings tab.
Step :5 Click on Start Button. The association rules are generated in the right
panel. This panel consists of 2 sections. First is the algorithm, the dataset was
chosen to run. The second part shows the Apriori Information.
Let us understand the run information in the right panel:
 The scheme used us Apriori.
 Instances and Attributes: It has 14 instances and 5 attributes.
 Minimum support and minimum confidence are 0.15 and 0.9 respectively. Out
of 14 instances, 2 instances are found with min support,
 The number of cycles performed for the mining association rule is 17.
 The large itemsets generated are 4: L (1), L (2),L(3) and L (4) but these are not
ranked as their sizes are 12, 47, 39 and 6 respectively.
 Rules found are ranked. The interpretation of these rules is as follows: Outlook
overcast 4 => Play yes 4: means out of 14, 4 instances show that for outlook
overcast play is true. This gives a strong association. The confidence level is
0.1.
Practical-12
Build a classifier to identify diabetic and non diabetic patients.
Naive Bayes :-
Naive Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving
classification problems.It is mainly used in text classification
that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.
Decision Tree :-
A decision tree is a type of supervised learning algorithm that
is commonly used in machine learning to model and predict
outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds
to attribute value and each leaf node represents the final
decision or prediction. The decision tree algorithm falls under
the category of supervised learning. They can be used to solve
both regression and classification problems.
Logistic Regression :-
Logistic regression predicts the output of a categorical
dependent variable. Therefore the outcome must be a
categorical or discrete value.It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0
and 1.
Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for
solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
DATASET --
Here we use of PIMA Indian diabetic data set consists of 9
attributes:
 preg - the number of times the subject had been pregnant
 plan - the concentration of blood plasma glucose (two hours
after drinking a glucose solution)
 pres - diastolic blood pressure in mmHg
 skin - triceps skin fold thickness in mm
 insu - serum insulin (two hours after drinking glucose
solution)
 mass - body mass index ((weight/height)**2)
 pedi - ‘diabetes pedigree function’ (a measurement I didn’t
quite understand but it relates to the extent to which an
individual has some kind of hereditary or genetic risk of
diabetes higher than the norm)
 age - in years
Here, the class label is binary classification. It has two values
 Tested positive (1) which means diabetic and
 Tested negative (0) which means non diabetic.

Steps to be followed:
Step : 1 open weka explorer and under preprocessor tab
choose “diabetic and non-diabetic.csv” file.
Step : 2 The file now gets loaded in the weka explorer
Step : 3 Go to “classify” tab . The algorithm can be mined
from here.
Step : 4 The Textbox next to choose button, show the
“RandomForest –I 10 -K 0 –S 1” or “Logistic –R 1.0 E-8 –M
-1”, which depicts the summarized rules set for the specific
algorithm in the setting tab.
Step : 5 click on start button. The classification rules are
generated in the right panel. This panel shows-
 Summary
 Detailed Accuracy By Class
 Confusion Matrix
 Run Information.
PRACTICAL -13
Analyze the IRIS datasets in weka and apply suitable data mining technique.

IRIS is an open access flower based dataset and is normally available on UCI
dataset. The major Objective of this research work is two examine the IRIS data
using data mining techniques available supported in weka.

Steps to be followed:-
Step : 1 open weka explorer and under preprocessor tab choose “ iris.csv” file.
Step : 2 The file now gets loaded in the weka explorer
Step : 3 Go to “classify” tab for classify the unclassifying data.click on
“choose” button from this, select “func = bayse naïve bayse and tree.decision
stump”.
Step : 4 To perform “clustering”, go to the explorer “cluster” tab and select
button. As a result of this steps, dropdown list of available clustering algorithm
displayed picks the Hierarchical or DBSCAN algorithm.
Step : 5 visualising the model.
Bayes.NaiveBayes .

Tress.DecisionStump.
Density Based Spatial Clustering Of Application With Noise.
The DBSCAN algorithm is based on the interitive notion of “Clusters
and “Noise”.the key idea is that for each point of a cluster, the
neighbourhood of a given radius has to contain at least a minimum
no.of points.

You might also like