0% found this document useful (0 votes)
13 views

Deepak Dmbi File

This document provides instructions for using the WEKA data mining software to analyze datasets. It includes: 1) An overview of WEKA and how it can be used for tasks like classification, clustering, regression and more. 2) Instructions on installing and opening WEKA, including descriptions of the main interface tabs for preprocessing data, classification, clustering, association rule mining and visualization. 3) A guide to preprocessing data in WEKA, including supported file formats and options for filtering attributes. 4) Explanations of how to use the classification and clustering tabs to apply different algorithms to the preprocessed data.

Uploaded by

satyam jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Deepak Dmbi File

This document provides instructions for using the WEKA data mining software to analyze datasets. It includes: 1) An overview of WEKA and how it can be used for tasks like classification, clustering, regression and more. 2) Instructions on installing and opening WEKA, including descriptions of the main interface tabs for preprocessing data, classification, clustering, association rule mining and visualization. 3) A guide to preprocessing data in WEKA, including supported file formats and options for filtering attributes. 4) Explanations of how to use the classification and clustering tabs to apply different algorithms to the preprocessed data.

Uploaded by

satyam jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

HMR INSTITUTE OF TECHNOLOGY

& MANAGEMENT

Laboratory Manual
DATA MINING AND BUSINESS INTELLIGENCE

Department:-
Computer Science & Engineering

SUBMITTED TO SUBMITTED BY
Ms. Shristy Goswami Deepak Bhora
Assistant Professor, CSE 20196502720
HMRITM CSE-7C
INDEX

S.No. Aim Of Experiment Date Signature

1 Introduction to weka

2 Study of ETL process and its tools.

3 To create an arff file.

Implementation of Classification technique


4
on ARFF files using WEKA.

Implementation of Clustering technique on


5
ARFF files using WEKA.

Implementation of Association Rule


6
technique on ARFF files using WEKA.

To explore 2 graphs . View their arff files


7
and apply an algorithm on them.

To use numeric transform filter and floor


8 function to obtain the precision up to same
value.
Experiment - 01

Aim: Introduction to weka

WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains. WEKA supports many different
standard data mining tasks such as data preprocessing, classification, clustering, regression,
visualization and feature selection. The basic premise of the application is to utilize a computer
application that can be trained to perform machine learning capabilities and derive useful
information in the form of trends and patterns. WEKA is an open source application that is freely
available under the GNU general public license agreement. Originally written in C the WEKA
application has been completely rewritten in Java and is compatible with almost every computing
platform. It is user friendly with a graphical interface that allows for quick set up and operation.
WEKA operates on the predication that the user data is available as a flat file or relation, this
means that each data object is described by a fixed number of attributes that usually are of a
specific type, normal alpha-numeric or numeric values. The WEKA application allows novice
users a tool to identify hidden information from database and file systems with simple to use
options and visual interfaces.

Installation

The program information can be found by conducting a search on the Web for WEKA Data Mining
or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA . The site has a very large
amount of useful information on the program’s benefits and background. New users might find
some benefit from investigating the user manual for the program. The main WEKA site has links
to this information as well as past experiments for new users to refine the potential uses that might
be of particular interest to them. When prepared to download the software it is best to select the
latest application from the selection offered on the site. The format for downloading the application
is offered in a self-installation package and is a simple procedure that provides the complete
program on the end users machine that is ready to use when extracted.
Opening the program

Once the program has been loaded on the user’s machine it is opened by navigating to the programs
start option and that will depend on the user’s operating system. Figure 1 is an example of the
initial opening screen on a computer with Windows XP.

Figure 1 Chooser screen

There are four options available on this initial screen:


 Simple CLI- provides users without a graphic interface option the ability to execute
commands from a terminal window.
 Explorer- the graphical interface used to conduct experimentation on raw data
 Experimenter- this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation
 Knowledge Flow-basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results

While the options available can be useful for different applications the remaining focus of the user
manual will be on the Experimenter option through the rest of the user guide.
After selecting the Experimenter option, the program starts and provides the user with a separate
graphical interface.

Figure 2

Figure 2 shows the opening screen with the available options. At first there is only the option to
select the Preprocess tab in the top left corner. This is due to the necessity to present the data set
to the application so it can be manipulated. After the data has been pre-processed, the other tabs
become active for use.

There are six tabs:


1. Preprocess- used to choose the data file to be used by the application
2. Classify- used to test and train different learning schemes on the pre-processed data
file under experimentation
3. Cluster- used to apply different tools that identify clusters within the data file
4. Association- used to apply different rules to the data file that identify association within
the data
5. Select attributes-used to apply different rules to reveal changes based on selected
attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output

Once the initial preprocessing of the data set has been completed the user can move between the
tab options to perform changes to the experiment and view the results in real time. This provides
the benefit of having the ability to move from one option to the next so that when a condition
becomes exposed it can be placed in a different environment to be visually changed
instantaneously.

Preprocessing

In order to experiment with the application, the data set needs to be presented to WEKA in a format
that the program understands. There are rules for the type of data that WEKA will accept.
There are three options for presenting data into the program:
 Open File- allows for the user to select files residing on the local machine or recorded medium
 Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
 Open Database- allows the user to retrieve files or data from a database source provided by
the user

There are restrictions on the type of data that can be accepted into the program. Originally the
software was designed to import only ARFF files, newer versions allow different file types such
as CSV, C4.5 and serialized instance formats. The extensions for these files include .csv, .arff,
.names, .bsi and .data.
Figure 3 shows an example of selection of the file weather.arff.
Figure 3

Once the initial data has been selected and loaded the user can select options for refining the
experimental data. The options in the preprocess window include selection of optional filters to
apply and the user can select or remove different attributes of the data set as necessary to identify
specific information. The ability to pick from the available attributes allows users to separate
different parts of the data set for clarity in the experimentation. The user can modify the attribute
selection and change the relationship among the different attributes by deselecting different
choices from the original data set. There are many different filtering options available within the
preprocessing window and the user can select the different options based on need and type of data
present.

Classify

The user has the option of applying many different algorithms to the data set that would in theory
produce a representation of the information used to make observation easier. It is difficult to
identify which of the options would provide the best output for the experiment. The best approach
is to independently apply a mixture of the available choices and see what yields something close
to the desired results. The Classify tab is where the user selects the classifier choices. Figure 4
shows some of the categories.
Figure 4

Again, there are several options to be selected inside of the classify tab. Test option gives the user
the choice of using four different test mode scenarios on the data set:
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage

There is the option of applying any or all of the modes to produce results that can be compared by
the user. Additionally, inside the test options toolbox there is a dropdown menu so the user can
select various items to apply that depending on the choice can provide output options such as
saving the results to file or specifying the random seed value to be applied for the classification.

The classifiers in WEKA have been developed to train the data set to produce output that has been
classified based on the characteristics of the last attribute in the data set. For a specific attribute to
be used the option must be selected by the user in the options menu before testing is performed.
Finally, the results have been calculated and they are shown in the text box on the lower right.
They can be saved in a file and later retrieved for comparison at a later time or viewed within the
window after changes and different results have been derived.
Cluster

The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze. There are a few options within
the cluster window that are similar to those described in the classifier tab. They are using training
set, supplied test set, percentage split. The fourth option is classes to cluster evaluation, which
compares how well the data compares with a pre-assigned class within the data. While in cluster
mode users have the option of ignoring some of the attributes from the data set. This can be useful
if there are specific attributes causing the results to be out of range or for large data sets. Figure 5
shows the Cluster window and some of its options.

Figure 5

Associate

The associate tab opens a window to select the options for associations within the data set. The
user selects one of the choices and presses start to yield the results. There are few options for this
window and they are shown in Figure 6 below.
Figure 6

Select Attributes

The next tab is used to select the specific attributes used for the calculation process. By default,
all of the available attributes are used in the evaluation of the data set. If the use wanted to exclude
certain categories of the data, they would deselect those specific choices from the list in the cluster
window. This is useful if some of the attributes are of a different form such as alphanumeric data
that could alter the results. The software searches through the selected attributes to decide which
of them will best fit the desired calculation. To perform this, the user has to select two options, an
attribute evaluator and a search method. Once this is done the program evaluates the data based
on the sub set of the attributes then performs the necessary search for commonality with the date.
Figure 7 shows the opinions of attribute evaluation.
Figure 8 shows the options for the search method.

Figure-7
Figure-8

Visualization

The last tab in the window is the visualization tab. Within the program calculations and
comparisons have occurred on the data set. Selections of attributes and methods of manipulation
have been chosen. The final piece of the puzzle is looking at the information that has been derived
throughout the process. The user can now actually see the fruit of their efforts in a two dimensional
representation of the information. The first screen that the user sees when they select the
visualization option is a matrix of plots representing the different attributes within the data set
plotted against the other attributes. If necessary, there is a scroll bar to view all of the produced
plots. The user can select a specific plot from the matrix to view its contents for analyzation. A
grid pattern of the plots allows the user to select the attribute positioning to their liking and for
better understanding. Once a specific plot has been selected the user can change the attributes from
one view to another providing flexibility. Figure 9 shows the plot matrix view.
Figure 9

The scatter plot matrix gives the user a visual representation of the manipulated data sets for
selection and analysis. The choices are the attributes across the top and the same from top to
bottom giving the user easy access to pick the area of interest. Clicking on a plot brings up a
separate window of the selected scatter plot. The user can then look at a visualization of the data
of the attributes selected and select areas of the scatter plot with a selection window or by clicking
on the points within the plot to identify the point’s specific information. Figure 10 shows the scatter
plot for two attributes and the points derived from the data set. There are a few options to view the
plot that could be helpful to the user. It is formatted similar to an X/Y graph yet it can show any
of the attribute classes that appear on the main scatter plot matrix. This is handy when the scale of
the attribute is unable to be ascertained in one axis over the other. Within the plot the points can
be adjusted by utilizing a feature called jitter. This option moves the individual points so that in
the event of close data points users can
reveal hidden multiple occurrences within the initial plot. Figure 11 shows an example of this
point selection and the results the user sees.
Figure 10

Figure 11
There are a few options to manipulate the view for the identification of subsets or to separate the
data points on the plot.
 Polyline — can be used to segment different values for additional visualization clarity on the
plot. This is useful when there are many data points represented on the graph.
 Rectangle — this tool is helpful to select instances within the graph for copying or clarification.
 Polygon — Users can connect points to segregate information and isolate points for reference.
Experiment - 02

Aim: Study of ETL process and its tools.

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process
in which an ETL tool extracts the data from various data source systems, transforms it in the
staging area, and then finally, loads it into the Data Warehouse system.

1. Data Extraction:

The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.

2. Data Transformation:

The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key attribute).

3. Data Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e., as soon as some data is extracted, it can
transform and during that period some new data can be extracted. And while the transformed data
is being loaded into the data warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:

ETL Tools:

Extraction, transformation, and load help the organization to make the data accessible, meaningful,
and usable across different data systems. An ETL tool is a software used to extract, transform, and
loading the data. An ETL tool is a set of libraries written in any programming language which will
simplify our work to make data integration and transformation operation for any need.
Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL, and Mark
Logic.
Experiment - 03

Aim: To Create an Arff File.

To create an arff file the steps involved are:-

1. Open a notepad file. Make a relation student, provide the attributes n data to the relation.
2. Save the file with .arff extension. Open WEKA. Open file from the open file option.

Go to the Classify menu.

3. Select a classifier. At the top of the classify section is the classifier box. This box has a
text field that gives the name of the currently selected classifier/algorithm, and its option.
4. Select a classifier by clicking on the choose option. The choose button allows you to
choose one of the classifiers that are available in WEKA.
5. Choose any classifier as per your need and click on the close button for instance, choose
the J48 classifier for your file and press the Start button.
Click Start

The result is obtained in the Classifier output window.

6. Now, right click in the Result list window on the text selected and select the type of graph
you want to evaluate.

We have successfully created an arff file and obtained the output.

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2


Relation: student
Instances: 10 Attributes: 3
name percentage
play
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree


------------------ : yes (10.0/4.0)

Number of Leaves : 1

Size of the tree : 1

Time taken to build model: 0.01 seconds


=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 6 60 %


Incorrectly Classified Instances 4 40 %
Kappa statistic 0
Mean absolute error 0.5333
Root mean squared error 0.5443
Relative absolute error 101.1494 %
Root relative squared error 101.793 %
Total Number of Instances 10

=== Detailed Accuracy by Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 1.000 0.600 1.000 0.750 ? 0.000 0.600 yes 0.000
0.000 ? 0.000 ? ? 0.000 0.400 no
Weighted Avg. 0.600 0.600 ? 0.600 ? ? 0.000 0.520

=== Confusion Matrix ===

a b <-- classified as
6 0 | a = yes
4 0 | b = no

The contents of the notepad file are:-


@relation student
@attribute
name{Alex,Twinkle,Geetu,Shalu,Kanika,Karishma,Ayush,Akshay,Sarvesh,Mahesh} @attribute
percentage real
@attribute play{yes, no}
@data
Alex,88,no
Twinkle,90,yes
Geetu,85,yes
Shalu,84,no
Kanika,75,yes
Karishma,80,yes
Ayush,83,no
Akshay,74,yes
Sarvesh,88,yes
Mahesh,78,no
Experiment - 04

Aim : Implementation of Classification technique on ARFF files using WEKA.

This experiment illustrates the use of j-48 classifier in weka. The sample data set used in this
experiment is “student” data available at .arff format. This document assumes that appropriate data
pre-processing has been performed.

Steps involved in this experiment:

Step - 1:- We begin the experiment by loading the data (exp-4.arff) into weka.

Step - 2:- Next we select the “classify” tab and click “choose” button to select the
“j48”classifi er.

Step - 3: Now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values. The default
version does perform some pruning but does not perform error pruning.

Step - 4: Under the “text” options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.

Step - 5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.

Step - 6: Note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the classification)

Step - 7: Now weka also lets us a view a graphical version of the classification tree. This can be
done by right clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step - 8: We will use our model to classify the new instances.

Step - 9: In the main panel under “text” options click the “supplied test set” radio button and then
click the “set” button. This wills pop-up a window which will allow you to open the file containing
test instances.
=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2


Relation: student
Instances: 14 Attributes: 5
age income student credit-rating buyspc
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree


------------------

age = <30: no (5.0/1.0) age = 30-40: yes (4.0) age = >40


| credit-rating = fair: yes (3.0)
| credit-rating = excellent: no (2.0)

Number of Leaves : 4

Size of the tree : 6

Time taken to build model: 0.01 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 11 78.5714 %


Incorrectly Classified Instances 3 21.4286 %
Kappa statistic 0.5532
Mean absolute error 0.25
Root mean squared error 0.4058
Relative absolute error 49.5283 %
Root relative squared error 79.6745 %
Total Number of Instances 14

=== Detailed Accuracy by Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.875 0.333 0.778 0.875 0.824 0.559 0.854 0.919 yes
0.667 0.125 0.800 0.667 0.727 0.559 0.854 0.727 no
Weighted Avg. 0.786 0.244 0.787 0.786 0.782 0.559 0.854 0.837

=== Confusion Matrix ===


a b <-- classified as
7 1 | a = yes
2 4 | b = no

Dataset exp-4.arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes


30-40, high, yes, fair, yes

>40, medium, no, excellent, no

The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
Experiment - 05

Aim : Implementation of Clustering technique on ARFF files using WEKA.

This experiment illustrates the use of simple k-mean clustering with Weka explorer. The sample
data set used for this example is based on the iris data available in ARFF format. This document
assumes that appropriate preprocessing has been performed. This iris dataset includes 150
instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris. Arff in preprocessing interface.

Step 2: In order to perform clustering, select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click on the text button to the right of the choose button to get the popup window
shown in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is used
for making the internal assignments of instances of clusters.

Step 5 : Once the options have been specified. We run the clustering algorithm there and we must
make sure that they are in the ‘cluster mode’ panel. The use of the training set option is selected
and then we click the ‘start’ button. This process and resulting window are shown in the following
screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid are means vectors
for each clusters. This clusters can be used to characterized the cluster. For eg, the centroid of
cluster1 shows the class iris. Versicolor mean value of the sepal length is 5.4706, sepal width
2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characteristics of each cluster through visualization ,we can
do this, try right clicking the result set on the result. List panel and selecting the visualize cluster
assignments.

The following screenshot shows the clustering rules that were generated when simple k means
algorithm is applied on the given dataset.
Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length and petal length
in each cluster. For instance, for each cluster is dominated by petal length. In this case by
changing the color dimension to other attributes we can see their distribution with in each of the
cluster.

Step 8: We can assure that resulting dataset which included each instance along with its assign
cluster. To do so we click the save button in the visualization window and save the result iris k-
mean .The top portion of this file is shown in the following figure.
Experiment - 06
Aim: Implementation of Association Rule technique on ARFF files using WEKA.

This experiment illustrates some of the basic elements of association rule mining using WEKA.
The sample dataset used for this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: In order to change the parameters for the run (example support, confidence etc) we click
on the text box immediately to the right of the choose button.

Dataset contactlenses.arff
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
Experiment - 07
AIM :- To explore 2 graphs. View their arff files and apply an algorithm on them.

Steps for exploring a graph:-


1. Open Weka
2. Application Explorer
3. Pre-processor click on ‘open file’
4. Set path… Desktop>select>weather.arff
5. Click on ‘Visualize All.’
Graphs of weather.arff

Graphs of weathernominal.arff
Steps for viewing arff file’s data base:-
1. Tools > arff viewer
2. File > open > data
3. weather.arff & weathernominal.arff

Weather.arff

weathernominal.arff

Steps for applying the algorithm to the arff file:-


1. Classify
2. Choose > OneR -B6
3. Right click on the result list
4. Click visualize cost curve > yes
Graph of weather
Graph of weather nominal
Experiment – 08

AIM: To use numeric transform filter and floor function to obtain the precision up
to same value.

Steps to be followed:-
1. Open segment-challenge. Arff in the explorer in weka.

2. Then select the choose option after opening the file.


3. Perform the following:-
4. Choose -> filters -> unsupervised -> Numeric transform method
5. Click and fill the index of the column whose values are to be rounded off.
6. Apply these changes to all the columns by selecting all and click apply.
7. Click Edit to see the viewer as shown.

CONCLUSION:
By using Numeric transform filter and floor method, required values have been obtained.

You might also like