Training vs Testing vs Validation Sets
Last Updated :
22 Nov, 2021
In this article, we are going to see how to Train, Test and Validate the Sets.
The fundamental purpose for splitting the dataset is to assess how effective will the trained model be in generalizing to new data. This split can be achieved by using train_test_split function of scikit-learn.
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this data to predict the outcome or to make the right decisions. Most of the training data is collected from several resources and then preprocessed and organized to provide proper performance of the model. Type of training data hugely determines the ability of the model to generalize .i.e. the better the quality and diversity of training data, the better will be the performance of the model. This data is more than 60% of the total data available for the project.
Example:
Python3
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
# Making a dummy array to
# represent x,y for example
# Making a array for x ranging
# from 0-15 then reshaping it
# to form a matrix of shape 8x2
x = np.arange(16).reshape((8,2))
# y is just a list of 0-7 number
# representing target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e.
# Testing set is 20% of total data
# Training set is 80% of total data
x_train, x_test, y_train, y_test = train_test_split(x,y,
train_size=0.8,
random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
Training set y: [0, 7, 2, 4, 3, 6]
Explanation:
- Firstly we created a dummy matrix of 8x2 shape using NumPy library to represent input x. And a list of 0 to 7 integers representing our target variable y.
- Now in order to split our dataset into training and testing data, a function named train_test_split of sklearn library is used.
- Input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in train_size i.e. if train_size=0.8 is given then the dataset will be divided in such an way that the training set will be 80% of given dataset and testing set will be 20% of given dataset.
- And as we specify random_state to be a positive number, train_test_split function will randomly split data.
Testing Set
This dataset is independent of the training set but has a somewhat similar type of probability distribution of classes and is used as a benchmark to evaluate the model, used only after the training of the model is complete. Testing set is usually a properly organized dataset having all kinds of data for scenarios that the model would probably be facing when used in the real world. Often the validation and testing set combined is used as a testing set which is not considered a good practice. If the accuracy of the model on training data is greater than that on testing data then the model is said to have overfitting. This data is approximately 20-25% of the total data available for the project.
Example:
Python3
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
# Making a dummy array to represent x,y for example
# Making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))
# y is just a list of 0-7 number representing
# target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Testing set is 20% of total data
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=42)
# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)
Output:
Testing set x: [[ 2 3]
[10 11]]
Testing set y: [1, 5]
Explanation:
- To show how the train_test_split function works we first created a dummy matrix of 8x2 shape using NumPy library to represent input x. And a list of 0 to 7 integers representing our target variable y.
- Now in order to split our dataset into training and testing data, input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in test_size i.e. if test_size=0.2 is given then the dataset will be divided in such an away that testing set will be 20% of given dataset and training set will be 80% of given dataset.
- And as we specify random_state to be a positive number, train_test_split function will randomly split data.
Validation Set
The validation set is used to fine-tune the hyperparameters of the model and is considered a part of the training of the model. The model only sees this data for evaluation but does not learn from this data, providing an objective unbiased evaluation of the model. Validation dataset can be utilized for regression as well by interrupting training of model when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing bias and variance. This data is approximately 10-15% of the total data available for the project but this can change depending upon the number of hyperparameters .i.e. if model has quite many hyperparameters then using large validation set will give better results. Now, whenever the accuracy of model on validation data is greater than that on training data then the model is said to have generalized well.
Example:
Python3
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
# Making a dummy array to represent x,y for example
# Making a array for x ranging from 0-23 then reshaping it
# to form a matrix of shape 8x3
x = np.arange(24).reshape((8,3))
# y is just a list of 0-7 number representing
# target variable
y = range(8)
# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Combined set of testing & validation is
# 20% of total data
x_train, x_Combine, y_train, y_Combine = train_test_split(x,y,
train_size=0.8,
random_state=42)
# Splitting combined dataset in 50-50 fashion .i.e.
# Testing set is 50% of combined dataset
# Validation set is 50% of combined dataset
x_val, x_test, y_val, y_test = train_test_split(x_Combine,
y_Combine,
test_size=0.5,
random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
print(" ")
# Testing set
print("Testing set x: ",x_test)
print("Testing set y: ",y_test)
print(" ")
# Validation set
print("Validation set x: ",x_val)
print("Validation set y: ",y_val)
Output:
Training set x: [[ 0 1 2]
[21 22 23]
[ 6 7 8]
[12 13 14]
[ 9 10 11]
[18 19 20]]
Training set y: [0, 7, 2, 4, 3, 6]
Testing set x: [[15 16 17]]
Testing set y: [5]
Validation set x: [[3 4 5]]
Validation set y: [1]
Explanation:
- So as to get the validation set, a dummy matrix of 8x3 shape is created using the NumPy library to represent input x. And a list of 0 to 7 integers representing our target variable y.
- Now it gets a bit tricky to divide dataset into 3 parts. To begin with, the dataset is divided into two parts, input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in train_size (from this we'll get our training set) i.e. if train_size=0.8 is given then the dataset will be divided in such a way that training set will be 80% of given dataset and another set will be 20% of given dataset.
- So now we have validation and testing combined set having 20% of the initially given dataset. This dataset is divided further to get validation set and testing set, output of above distribution is then passed as parameters to train_test_split again which then divides the combined dataset into 2 parts on the size given in test_size .i.e. if test_size=0.5 is given then the dataset will be divided in such a way that testing set and validation set will be 50% of the combined dataset.
Similar Reads
Agile Testing vs Traditional Testing
Agile and traditional testing are software testing practices that fulfill the customer's need to provide quality software. Agile testing starts when the development process begins, but in conventional testing, the test starts after the development ends. In this article, we will cover the brief expla
11 min read
Training data vs Testing data
There are two key types of data used for machine learning training and testing data. They each have a specific function to perform when building and evaluating machine learning models. Machine learning algorithms are used to learn from data in datasets. They discover patterns and gain knowledge. mak
7 min read
Training and Validation Loss in Deep Learning
In deep learning, loss functions are crucial in guiding the optimization process. The loss represents the discrepancy between the predicted output of the model and the actual target value. During training, models attempt to minimize this loss by adjusting their weights. Training loss and validation
6 min read
Cross-Validation vs. Bootstrapping
When developing machine learning models, it's crucial to assess their performance accurately. Cross-validation and bootstrapping are two fundamental techniques used for this purpose. They help estimate how well a model will generalize to an independent dataset, thereby guiding decisions on model sel
6 min read
Cross Validation in Machine Learning
Cross-validation is a technique used to check how well a machine learning model performs on unseen data. It splits the data into several parts, trains the model on some parts and tests it on the remaining part repeating this process multiple times. Finally the results from each validation step are a
7 min read
Stratified K Fold Cross Validation
Stratified K-Fold Cross Validation is a technique used for evaluating a model. It is particularly useful for classification problems in which the class labels are not evenly distributed i.e data is imbalanced. It is a enhanced version of K-Fold Cross Validation. Key difference is that it uses strati
3 min read
How to Write Email Validation Test Cases?
Email validation is a critical part of any web application. It ensures that users are entering valid email addresses, which can be used for communication and authentication. Writing email validation test cases can be tricky, as there are a lot of different things to consider. Email validation test c
10 min read
Software Testing - Unit Testing Tools
Unit Testing is a part of software testing where testing is performed on isolated small parts of the code. The objective of Unit Testing is to isolate small portions of application code that could be individually tested and verified, provided all the isolated code units have been developed in the co
8 min read
Validation Curve using Scikit-learn
Validation curves are essential tools in machine learning for diagnosing model performance and understanding the impact of hyperparameters on model accuracy. This article will delve into the concept of validation curves, their importance, and how to implement them using Scikit-learn in Python. Table
7 min read
K- Fold Cross Validation in Machine Learning
K-Fold Cross Validation is a statistical technique to measure the performance of a machine learning model by dividing the dataset into K subsets of equal size (folds). The model is trained on K â 1 folds and tested on the last fold. This process is repeated K times, with each fold being used as the
4 min read