DM Lab
DM Lab
This section will do hands-on using R studio for college basketball dataset.
o The first step is to set the working directory which will be used as the preferred
location to read and write datasets.
o setwd() is used in R to set the working directory
o getwd() to check the present working directory
o Following is a screenshot of R Studio with setwd() and getwd() functions.
summary(sampleData)
hist.default(sampleData$W,col='gray')
We can use Table function to create a frequency table which shows the number of
frequency of the data in the variable using table(sampleData$W)
The frequency table shows the value 20 has a maximum frequency in the data. This
function is very useful while doing statistical categorical variables.
Also, we n plat this frequency table using plot function in R using >
Next, we will discuss bivariate statistical analysis with R
This statistical analysis is a comparison between two variables present in that data set.
It helps to identify the correlation and patterns between the two variables.
Symbol ‘~’ is used for bivariate analysis in R
In this example, we are creating a scatter diagram or scatter plot for G and W variable
using
plot(sampleData$G~sampleData$W,col='blue'
Introduction
Description:
Open the program. Once the program has been loaded on the user‟s machine it is opened
by navigating to the programs start option and that will depend on the user‟s operating
system. Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:
1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.
Fig: 1.1 Weka GUI
1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.
2. Classify- used to test and train different learning schemes on the preprocessed data file
under experimentation.
Fig: 1.3 choosing Zero set from classify
Again there are several options to be selected inside of the classify tab. Test option gives
the user the choice of using four different test mode scenarios on the data set.
3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze.
4. Association- used to apply different rules to the data file that identify association
within the data. The associate tab opens a window to select the options for associations
within the dataset.
5. Select attributes-used to apply different rules to reveal changes based on selected
attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output.
3. Knowledge Flow -basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results
4. Simple CLI - provides users without a graphic interface option the ability to execute
commands from a terminal window.
Explore the default datasets in weka tool.
Click the “Open file…” button to open a data set and double click on the “data”
directory. Weka provides a number of small common machine learning datasets that
you can use to practiceon. Select the “iris.arff” file to load the Iris dataset.
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.
Dataset contactlenses.arff
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
4.Demonstration of classification rule process on dataset student.arff
using j48 algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka. The sample data set used
in this experiment is “student” data available at arff format. This document assumes that
appropriate data pre processing has been performed.
Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values. The
default version does perform some pruning but does not perform error pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.
Step-9: In the main panel under “text” options click the “supplied test set” radio button and
then click the “set” button. This wills pop-up a window which will allow you to open the file
containing test instances.
Dataset student .arff
@relation student
@data
%
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
5.Demonstration of clustering rule process on dataset iris.arff using simple
k-means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.For eg,
the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization ,we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
Interpretation of the above visualization
From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this case
by changing the color dimension to other attributes we can see their distribution with in each
of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure.