0% found this document useful (0 votes)
78 views

DM Lab

1. The document discusses using the R programming language and RStudio IDE to perform statistical analysis on a college basketball dataset. It covers downloading and installing R and RStudio, importing and exploring the dataset, and conducting both univariate and bivariate statistical analyses such as summary statistics, histograms, scatter plots, and t-tests. 2. The document then shifts to discussing the Weka data mining tool. It covers downloading and using the Weka GUI to preprocess data files, classify models, identify clusters, find associations between attributes, and visualize data. The Explorer interface is the main tool used for experimentation and modeling.

Uploaded by

kavitha Mookkan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

DM Lab

1. The document discusses using the R programming language and RStudio IDE to perform statistical analysis on a college basketball dataset. It covers downloading and installing R and RStudio, importing and exploring the dataset, and conducting both univariate and bivariate statistical analyses such as summary statistics, histograms, scatter plots, and t-tests. 2. The document then shifts to discussing the Weka data mining tool. It covers downloading and using the Weka GUI to preprocess data files, classify models, identify clusters, find associations between attributes, and visualize data. The Explorer interface is the main tool used for experimentation and modeling.

Uploaded by

kavitha Mookkan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

1.

Statistical Analysis with R language


Aim: This experiment illustrates some of the basic Statistical Analysis with R language.

 The R language needs to be installed on the system


o R can be installed in Windows, Linux, and MAC OS X.
o The installable file for R can be downloaded from https://cran.r-project.org/.
 Next, the IDE such as R Studio needs to be installed on the system.
o R Studio provides GUI support along with some enterprise-ready features like
Syntax hiliting, debugging, packages, and workspace management.
 R Studio can be downloaded and installed from https://www.rstudio.com/
o Once the R studio is installed, it can be directly used to develop R script which
will work on the installed version of the R language.
 Once the Environment is ready, the next step is to import the data set to R workspace.
o For Example, we will import a .csv file to R studio for Statistical analysis.
o We will be downloading an open-source data set
from https://www.kaggle.com/ for this demonstration.
o The data file we will use is ‘cbb.csv’ which is college basketball dataset,

The practical approach of statistical analysis with R

 This section will do hands-on using R studio for college basketball dataset.
o The first step is to set the working directory which will be used as the preferred
location to read and write datasets.
o setwd() is used in R to set the working directory
o getwd() to check the present working directory
o Following is a screenshot of R Studio with setwd() and getwd() functions.

 Next will import


the data set
using read.csv()
command
and assign to a
data frame called
SampleData
as the following
the syntax
 Sample data
=
read.csv(“cbb.csv”)
 To check the dataset imported correctly and review the few top lines of data use
head() command in R.
 Next, we will use a summary() command to do basic statistical analysis which will
show the Min, Max, Mean, median, and the inter quartile range information about the
data set for each quantitative variable.
 The summary of basketball data set shows the Variable G has min value 24.00, Max
values 40.00, the median value is 31.00 and the mean value is 31.52

summary(sampleData)

 Next, we will discuss univariate data analysis.


o R data frames are an efficient data store reference,
o A particular variable can be assessed from the data frame using $ symbol
o For example, to view the statistical summary of W variable, we will use
 The data can be plotted as a histogram using hist. default() command to view overall
data distribution

hist.default(sampleData$W,col='gray')
 We can use Table function to create a frequency table which shows the number of
frequency of the data in the variable using table(sampleData$W)
 The frequency table shows the value 20 has a maximum frequency in the data. This
function is very useful while doing statistical categorical variables.
 Also, we n plat this frequency table using plot function in R using >
 Next, we will discuss bivariate statistical analysis with R
 This statistical analysis is a comparison between two variables present in that data set.
 It helps to identify the correlation and patterns between the two variables.
 Symbol ‘~’ is used for bivariate analysis in R
 In this example, we are creating a scatter diagram or scatter plot for G and W variable
using

plot(sampleData$G~sampleData$W,col='blue'

 This scatter plot represents the graph for bivariate analysis



o Apart from the Scatter plot, there are several other functions and plots like
histograms, line plots, and boxplots are being used for Bivariate data analysis.
 Next, we will discuss the t-test which is the statistical hypothesis testing process using
R.
o t,test() function used in R to process the t-test
o We will use G variable data of data frame sample data for t-test
o test(sampleDat$G) is the syntax we will apply on the R Studio console.
o T-test shows the statistical inferences and the confidence interval .as outcomes.
o The p-value is the probability value significant to the null hypothesis. And the
percentage value is the confidence interval.
 In this T-test, the P-value is <2.2e-16 and the confidence interval is 95%. It also

shows the mean value of 31.52205.


 In this T-test, the P-value is <2.2e-16 and the confidence interval is 95%. It also
shows the mean value of 31.52205.
 This T-test shows the Alternate hypothesis is true in the hypothesis testing process

2. Study of WEKA Tool

Aim: This experiment illustrate the study of Weka tool.

Introduction

Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of


visualization tools and algorithms for data analysis and predictive modeling, together
with graphical user interfaces for easy access to these functions. The original non-Java
version of Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms
implemented in other programming languages, plus data preprocessing utilities in C, and
Make file-based system for running machine learning experiments. This original version
was primarily designed as a tool for analyzing data from agricultural domains, but the
more recent fully Java-based version (Weka 3), for which development started in 1997, is
now used in many different application areas, in particular for educational purposes and
research. Advantages of Weka include:

 Free availability under the GNU General Public License.


 Portability, since it is fully implemented in the Java programming language and
thus runs on almost any modern computing platform
 A comprehensive collection of data preprocessing and modeling techniques
 Ease of use due to its graphical user interfaces

Description:
Open the program. Once the program has been loaded on the user‟s machine it is opened
by navigating to the programs start option and that will depend on the user‟s operating
system. Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:

Fig: 1.1 Weka GUI

1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.
Fig: 1.1 Weka GUI

1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.

Fig: 1.1 Weka GUI


1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.

Fig: 1.2 Pre-processor


Inside the weka explorer window there are six tabs:
1. Preprocess- used to choose the data file to be used by the application.
Open File- allows for the user to select files residing on the local machine or recorded
medium Open URL- provides a mechanism to locate a file or data source from a different
location specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by user

2. Classify- used to test and train different learning schemes on the preprocessed data file
under experimentation.
Fig: 1.3 choosing Zero set from classify

Again there are several options to be selected inside of the classify tab. Test option gives
the user the choice of using four different test mode scenarios on the data set.

1. Use training set


2. Supplied training set
3. Cross validation
4. Split percentage

3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze.
4. Association- used to apply different rules to the data file that identify association
within the data. The associate tab opens a window to select the options for associations
within the dataset.
5. Select attributes-used to apply different rules to reveal changes based on selected
attributes inclusion or exclusion from the experiment

6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output.

2. Experimenter - this option allows users to conduct different experimental variations


on data sets and perform statistical manipulation. The Weka Experiment Environment
enables the user to create, run, modify, and analyze experiments in a more convenient
manner than is possible when processing the schemes individually. For example, the user
can create an experiment that runs several schemes against a series of datasets and then
analyze the results to determine if one of the schemes is (statistically) better than the other
schemes.

Fig: 1.6 Weka experiment

Results destination: ARFF file, CSV file, JDBC database.


Experiment type: Cross-validation (default), Train/Test Percentage Split (data randomized).
Iteration control: Number of repetitions, Data sets first/Algorithms first.
Algorithms: filters

3. Knowledge Flow -basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results
4. Simple CLI - provides users without a graphic interface option the ability to execute
commands from a terminal window.
Explore the default datasets in weka tool.

Click the “Open file…” button to open a data set and double click on the “data”
directory. Weka provides a number of small common machine learning datasets that

you can use to practiceon. Select the “iris.arff” file to load the Iris dataset.

Fig: 1.7 Different Data Sets in weka


3. Demonstration of Association rule process on dataset
contactlenses.arff using apriori algorithm
Aim: This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.
Dataset contactlenses.arff
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
4.Demonstration of classification rule process on dataset student.arff
using j48 algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka. The sample data set used
in this experiment is “student” data available at arff format. This document assumes that
appropriate data pre processing has been performed.

Steps involved in this experiment:

Step-1: We begin the experiment by loading the data (student.arff)into weka.

Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values. The
default version does perform some pruning but does not perform error pruning.

Step4: Under the “text” options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text” options click the “supplied test set” radio button and
then click the “set” button. This wills pop-up a window which will allow you to open the file
containing test instances.
Dataset student .arff
@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
5.Demonstration of clustering rule process on dataset iris.arff using simple
k-means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.For eg,
the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster through visualization ,we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
Interpretation of the above visualization
From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this case
by changing the color dimension to other attributes we can see their distribution with in each
of the cluster.

Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure.

You might also like