The validation set approach is a cross-validation technique in Machine learning. Cross-validation techniques are often used to judge the performance and accuracy of a machine learning model. In the Validation Set approach, the dataset which will be used to build the model is divided randomly into 2 parts namely training set and validation set(or testing set). The model is trained on the training dataset and its accuracy is calculated by predicting the target variable for those data points which is not present during the training that is validation set. This whole process of splitting the data, training the model, testing the model is a complex task. But the R language consists of numerous libraries and inbuilt functions which can carry out all the tasks very easily and efficiently.
Steps Involved in the Validation Set Approach
- A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred)
- Training of the model on the training data set
- The resultant model is applied to the validation set
- Model's accuracy is calculated through prediction error by using model performance metrics
This article discusses the step by step method of implementing the Validation set approach as a cross-validation technique for both classification and regression machine learning models.
For Classification Machine Learning Models
This type of machine learning model is used when the target variable is a categorical variable like positive, negative, or diabetic, non-diabetic, etc. The model predicts the class label of the dependent variable. Here, the Logistic regression algorithm will be applied to build the classification model.
Step 1: Loading the dataset and other required packages
Before doing any exploratory or manipulation task, one must include all the required libraries and packages to use various inbuilt functions and a dataset which will make it easier to carry out the whole process.
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# package Used to split the data
# used during classification into
# train and test subsets
library(caTools)
# loading package to
# import desired dataset
library(ISLR)
Step 2: Exploring the dataset
It is very necessary to understand the structure and dimension of the dataset as this will help in building a correct model. Also, as this is a classification model, one must know the different categories present in the target variable.
R
# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)
Output:
Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...
> table(dataset$Direction)
Down Up
602 648
According to the above information, the imported dataset has 250 rows and 9 columns. The data type of columns as <dbl> means the double-precision floating-point number (dbl came from double). The target variable must be of factor datatype in classification models. Since the data type of the Direction column is already <fct>, there is no need to change anything.
Moreover, the response variable or target variable is a binary categorical variable(as the values in the column are only Down and Up) and the proportion of both class labels is approximately 1:1 means they are balanced. If there will be a case of class imbalance as if the proportion of class labels would be 1:2, we have to make sure that both the categories are in approximately equal proportion. For this purpose, there are many techniques like:
- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE
Step 3: Building the model and generating the validation set
This step involves the random splitting of the dataset, developing training and validation set, and training of the model. Below is the implementation.
R
# setting seed to generate a
# reproducible random sampling
set.seed(100)
# dividing the complete dataset
# into 2 parts having ratio of
# 70% and 30%
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
# selecting that part of dataset
# which belongs to the 70% of the
# dataset divided in previous step
train = subset(dataset, spl == TRUE)
# selecting that part of dataset
# which belongs to the 30% of the
# dataset divided in previous step
test = subset(dataset, spl == FALSE)
# checking number of rows and column
# in training and testing dataset
print(dim(train))
print(dim(test))
# Building the model
# training the model by assigning Direction column
# as target variable and rest other columns
# as independent variables
model_glm = glm(Direction ~ . , family = "binomial",
data = train, maxit = 100)
Output:
> print(dim(train))
[1] 875 9
> print(dim(test))
[1] 375 9
Step 4: Predicting the target variable
As the training of the model is completed, it is time to make predictions on the unseen data. Here, the target variable has only 2 possible values so in the predict() function it is desirable to use type = response such that the model predicts the probability score of the target categorical variable as 0 or 1.
There is an optional step of transforming the response variable into the factor variable of 1's and 0's so that if the probability score of a data point is above a certain threshold, it will be treated as 1 and if below that threshold it will be treated as 0. Here, the probability cutoff is set as 0.5. Below is the code to implement these steps
R
# predictions on the validation set
predictTest = predict(model_glm, newdata = test,
type = "response")
# assigning the probability cutoff as 0.5
predicted_classes <- as.factor(ifelse(predictTest >= 0.5,
"Up", "Down"))
Step 5: Evaluating the accuracy of the model
The Best way to judge the accuracy of a classification machine learning model is through Confusion Matrix. This matrix gives us a numerical value which suggests how many data points are predicted correctly as well as incorrectly by taking reference with the actual values of the target variable in the testing dataset. Along with the confusion matrix, other statistical details of the model like accuracy and kappa can be calculated using the below code.
R
# generating confusion matrix and
# other details from the
# prediction made by the model
print(confusionMatrix(predicted_classes, test$Direction))
Output:
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 177 5
Up 4 189
Accuracy : 0.976
95% CI : (0.9549, 0.989)
No Information Rate : 0.5173
P-Value [Acc > NIR] : <2e-16
Kappa : 0.952
Mcnemar's Test P-Value : 1
Sensitivity : 0.9779
Specificity : 0.9742
Pos Pred Value : 0.9725
Neg Pred Value : 0.9793
Prevalence : 0.4827
Detection Rate : 0.4720
Detection Prevalence : 0.4853
Balanced Accuracy : 0.9761
'Positive' Class : Down
For Regression Machine Learning Models
Regression models are used to predict a quantity whose nature is continuous like the price of a house, sales of a product, etc. Generally in a regression problem, the target variable is a real number such as integer or floating-point values. The accuracy of this kind of model is calculated by taking the mean of errors in predicting the output of various data points. Below are the steps to implement the validation set approach in Linear Regression Models.
Step 1: Loading the dataset and required packages
R language contains a variety of datasets. Here we are using trees dataset which is an inbuilt dataset for the linear regression model. Below is the code to import the required dataset and packages to perform various operations to build the model.
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# access the data from R’s datasets package
data(trees)
# look at the first several rows of the data
head(trees)
Output:
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
So, in this dataset, there are a total of 3 columns among which Volume is the target variable. Since the variable is of continuous nature, a linear regression algorithm can be used to predict the outcome.
Step 2: Building the model and generating the validation set
In this step, the model is split randomly into a ratio of 80-20. 80% of the data points will be used to train the model while 20% acts as the validation set which will give us the accuracy of the model. Below is the code for the same.
R
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(trees $ Volume,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- trees[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- trees[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(Volume ~., data = training_dataset)
Step 3: Predict the target variable
After building and training the model, predictions of the target variable of the data points belong to the validation set will be done.
R
# predicting the target variable
predictions <- predict(model, testing_dataset)
Step 4: Evaluating the accuracy of the model
Statistical metrics that are used for evaluating the performance of a Linear regression model are Root Mean Square Error(RMSE), Mean Squared Error(MAE), and R2 Error. Among all R2 Error, metric makes the most accurate judgement and its value must be high for a better model. Below is the code to calculate the prediction error of the model.
R
# computing model performance metrics
data.frame(R2 = R2(predictions, testing_dataset $ Volume),
RMSE = RMSE(predictions, testing_dataset $ Volume),
MAE = MAE(predictions, testing_dataset $ Volume))
Output:
R2 RMSE MAE
1 0.9564487 5.274129 4.73567
Advantages of the Validation Set approach
- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.
Disadvantages of the Validation Set approach
- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.
Similar Reads
R Tutorial | Learn R Programming Language R is an interpreted programming language widely used for statistical computing, data analysis and visualization. R language is open-source with large community support. R provides structured approach to data manipulation, along with decent libraries and packages like Dplyr, Ggplot2, shiny, Janitor a
4 min read
R Programming Language - Introduction R is a programming language and software environment that has become the first choice for statistical computing and data analysis. Developed in the early 1990s by Ross Ihaka and Robert Gentleman, R was built to simplify complex data manipulation and create clear, customizable visualizations. Over ti
4 min read
R-Data Frames R Programming Language is an open-source programming language that is widely used as a statistical software and data analysis tool. Data Frames in R Language are generic data objects of R that are used to store tabular data. Data frames can also be interpreted as matrices where each column of a matr
6 min read
Read contents of a CSV File in R Programming - read.csv() Function read.csv() function in R Language is used to read "comma separated value" files. It imports data in the form of a data frame. The read.csv() function also accepts a number of optional arguments that we can use to modify the import procedure. we can choose to treat the first row as column names, sele
3 min read
R-Data Types Data types in R define the kind of values that variables can hold. Choosing the right data type helps optimize memory usage and computation. Unlike some languages, R does not require explicit data type declarations while variables can change their type dynamically during execution.R Programming lang
5 min read
Data Visualization in R Data visualization is the practice of representing data through visual elements like graphs, charts, and maps. It helps in understanding large datasets more easily, making it possible to identify patterns and trends that support better decision-making. R is a language designed for statistical analys
5 min read
apply(), lapply(), sapply(), and tapply() in R In this article, we will learn about the apply(), lapply(), sapply(), and tapply() functions in the R Programming Language. The apply() collection is a part of R essential package. This family of functions helps us to apply a certain function to a certain data frame, list, or vector and return the r
4 min read
R-Matrices R-matrix is a two-dimensional arrangement of data in rows and columns. In a matrix, rows are the ones that run horizontally and columns are the ones that run vertically. In R programming, matrices are two-dimensional, homogeneous data structures. These are some examples of matrices:R - MatricesCreat
10 min read
R-Operators Operators are the symbols directing the compiler to perform various kinds of operations between the operands. Operators simulate the various mathematical, logical, and decision operations performed on a set of Complex Numbers, Integers, and Numericals as input operands. R supports majorly four kinds
5 min read
Functions in R Programming A function accepts input arguments and produces the output by executing valid R commands that are inside the function. Functions are useful when we want to perform a certain task multiple times.In R Programming Language when we are creating a function the function name and the file in which we are c
5 min read