Assignment 4 r Program1
Assignment 4 r Program1
Data analysis case study using R for readily available data set using any one machine learning
Algorithm
1. Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to
be predicted from a given set of predictors (independent variables). Using these set of variables,
we generate a function that map inputs to desired outputs. The training process continues until the
model achieves a desired level of accuracy on the training data. Examples of Supervised Learning:
Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It
is used for clustering population in different groups, which is widely used for segmenting customers in
different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-
means.
3. Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way:
the machine is exposed to an environment where it trains itself continually using trial and error. This
machine learns from past experience and tries to capture the best possible knowledge to make accurate
business decisions. Example of Reinforcement Learning: Markov Decision Process
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to
almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
5. Naive Bayes
6. KNN
7. K-Means
8. Random Forest
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Result
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris
dataset).
This is a good project because it is so well understood.
Attributes are numeric so you have to figure out how to load and handle data.
It is a classification problem, allowing you to practice with perhaps an easier type of
supervised learning algorithms.
It is a mutli-class classification problem (multi-nominal) that may require some specialized
handling.
It only has 4 attribute and 150 rows, meaning it is small and easily fits into memory (and a
screen or A4 page).
All of the numeric attributes are in the same units and the same scale not requiring any
special scaling or transforms to get started.
Let’s get started with your hello world machine learning project in R.
The caret package provides a consistent interface into hundreds of machine learning algorithms and
provides useful convenience methods for data visualization, data resampling, model tuning and model
comparison, among other features. It’s a must have tool for machine learning projects in R.
2. Load The Data
We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello
world” dataset in machine learning and statistics by pretty much everyone.
The dataset contains 150 observations of iris flowers. There are four columns of measurements of the
flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers
belong to one of three species.
You can learn more about this dataset on Wikipedia.
Here is what we are going to do in this step:
1. Load the iris data the easy way.
2. Load the iris data from CSV (optional, for purists).
3. Separate the data into a training dataset and a validation dataset.
Choose your preferred way to load data or try both methods.
2.1 Load Data The Easy Way
Fortunately, the R platform provides the iris dataset for us. Load the dataset as follows:
You now have the iris data loaded in R and accessible via the dataset variable.
I like to name the loaded data “dataset”. This is helpful if you want to copy-paste code between
projects and the dataset always has the same name.
You now have the iris data loaded in R and accessible via the dataset variable.
Later, we will use statistical methods to estimate the accuracy of the models that we create on
unseen data. We also want a more concrete estimate of the accuracy of the best model on
unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will
use this data to get a second and independent idea of how accurate the best model might
actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20%
that we will hold back as a validation dataset.
# create a list of 80% of the rows in the original dataset we can use for training
3. Summarize Dataset
In this step we are going to take a look at the data a few different ways:
Don’t worry, each look at the data is one command. These are useful commands that you can
use again and again on future projects.
We can get a quick idea of how many instances (rows) and how many attributes (columns) the
data contains with the dim function.
# dimensions of dataset
> dim(dataset)
[1] 120 5
It is a good idea to get an idea of the types of the attributes. They could be doubles, integers,
strings, factors and other types.
Knowing the types is important as it will give you an idea of how to better summarize the data
you have and the types of transforms you might need to use to prepare the data before you
model it.
> head(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
3.4 Levels of the Class
The class variable is a factor. A factor is a class that has multiple class labels or levels.
Let’s look at the levels:
> levels(dataset$Species)
This is a multi-class or a multinomial classification problem. If there were two levels, it would
be a binary classification problem.
Let’s now take a look at the number of instances (rows) that belong to each class. We can
view this as an absolute count and as a percentage.
> summary(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.400 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.854 Mean :3.047 Mean :3.768 Mean :1.192
3rd Qu.:6.425 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.200 Max. :6.900 Max. :2.500
Species
Iris-setosa :40
Iris-versicolor:40
Iris-virginica :40
We can see that all of the numerical values have the same scale (centimeters) and similar
ranges [0,8] centimeters.
4. Visualize Dataset
We now have a basic idea about the data. We need to extend that with some visualizations.
We start with some univariate plots, that is, plots of each individual variable.
It is helpful with visualization to have a way to refer to just the input attributes and just the
output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or
class) y.
> par(mfrow=c(1,4))
> for(i in 1:4) {
+ boxplot(x[,i], main=names(iris)[i])
+ }
This gives us a much clearer idea of the distribution of the input attributes:
# barplot for class breakdown
> plot(y)
4.2 Multivariate Plots
First let’s look at scatterplots of all pairs of attributes and color the points by class. In
addition, because the scatterplots show that points for each class are generally separate,
we can draw ellipses around them.
# scatterplot matrix
We can also look at box and whisker plots of each input variable again, but this time broken
down into separate plots for each class. This can help to tease out obvious linear
separations between the classes.
Next we can get an idea of the distribution of each attribute, again like the box and whisker
plots, broken down by class value. Sometimes histograms are good for this, but in this case
we will use some probability density plots to give nice smooth lines for each distribution.
This confirms what we learned in the last section, that the instances are evenly distributed
across the three class:
Density Plot:-
Now it is time to create some models of the data and estimate their accuracy on unseen
data.
Here is what we are going to cover in this step:
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all
combinations of train-test splits. We will also repeat the process 3 times for each algorithm
with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of
correctly predicted instances in divided by the total number of instances in the dataset
multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using
the metricvariable when we run build and evaluate each model next.
We don’t know which algorithms would be good on this problem or what configurations to
use. We get an idea from the plots that some of the classes are partially linearly separable
in some dimensions, so we are expecting generally good results.
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex
nonlinear methods (SVM, RF). We reset the random number seed before reach run to
ensure that the evaluation of each algorithm is performed using exactly the same data
splits. It ensures the results are directly comparable.
# a) linear algorithms
> set.seed(7)
> set.seed(7)
> fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric,
trControl=control)
# kNN
> set.seed(7)
> fit.knn <- train(Species~., data=dataset, method="knn", metric=metric,
trControl=control)
# c) advanced algorithms
# SVM
> set.seed(7)
> fit.svm <- train(Species~., data=dataset, method="svmRadial",
metric=metric, trControl=control)
alpha
#Random Forest
> set.seed(7)
> fit.rf <- train(Species~., data=dataset, method="rf", metric=metric,
trControl=control)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
margin
We now have 5 models and accuracy estimations for each. We need to compare the
models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models
and using the summary function
Call:
summary.resamples(object = results)
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lda 0.8333333 1.0000000 1.0000000 0.9750000 1.0000000 1 0
cart 0.8333333 0.8541667 0.9166667 0.9166667 0.9791667 1 0
knn 0.8333333 0.9166667 1.0000000 0.9583333 1.0000000 1 0
svm 0.7500000 0.9166667 0.9583333 0.9333333 1.0000000 1 0
rf 0.8333333 0.9166667 0.9583333 0.9416667 1.0000000 1 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lda 0.750 1.00000 1.0000 0.9625 1.00000 1 0
cart 0.750 0.78125 0.8750 0.8750 0.96875 1 0
knn 0.750 0.87500 1.0000 0.9375 1.00000 1 0
svm 0.625 0.87500 0.9375 0.9000 1.00000 1 0
rf 0.750 0.87500 0.9375 0.9125 1.00000 1 0
We can see the accuracy of each classifier and also other metrics like Kappa:
We can also create a plot of the model evaluation results and compare the spread and the
mean accuracy of each model. There is a population of accuracy measures for each
algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
We can see that the most accurate model in this case was LDA:
> print(fit.lda)
Linear Discriminant Analysis
120 samples
4 predictor
3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
Resampling results:
Accuracy Kappa
0.975 0.9625
This gives a nice summary of what was used to train the model and the mean and standard
deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%
6. Make Predictions
The LDA was the most accurate model. Now we want to get an idea of the accuracy of the
model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable
to keep a validation set just in case you made a slip during such as overfitting to the training
set or a data leak. Both will result in an overly optimistic result.
We can run the LDA model directly on the validation set and summarize the results in a
confusion matrix.
# estimate Skill of LDA on validation of dataset
Reference
Prediction Iris-setosa Iris-versicolor Iris-virginica
Iris-setosa 10 0 0
Iris-versicolor 0 10 0
Iris-virginica 0 0 10
Overall Statistics
Accuracy : 1
95% CI : (0.8843, 1)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 4.857e-15
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
References:- https://machinelearningmastery.com/machine-learning-in-r-step-by-step/
2. https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
**************************************THE END************************************************