How To Do Train Test Split Using Sklearn In Python
Last Updated :
27 Jun, 2022
In this article, let's learn how to do a train test split using Sklearn in Python.
Train Test Split Using Sklearn
The train_test_split() method is used to split our data into train and test sets.
First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test. X_train and y_train sets are used for training and fitting the model. The X_test and y_test sets are used for testing the model if it's predicting the right outputs/labels. we can explicitly test the size of the train and test sets. It is suggested to keep our train sets larger than the test sets.
Train set: The training dataset is a set of data that was utilized to fit the model. The dataset on which the model is trained. This data is seen and learned by the model.
Test set: The test dataset is a subset of the training dataset that is utilized to give an accurate evaluation of a final model fit.
validation set: A validation dataset is a sample of data from your model's training set that is used to estimate model performance while tuning the model's hyperparameters.
by default, 25% of our data is test set and 75% data goes into training tests.
Syntax: sklearn.model_selection.train_test_split()
parameters:
- *arrays: sequence of indexables. Lists, numpy arrays, scipy-sparse matrices, and pandas dataframes are all valid inputs.
- test_size: int or float, by default None. If float, it should be between 0.0 and 1.0 and represent the percentage of the dataset to test split. If int is used, it refers to the total number of test samples. If the value is None, the complement of the train size is used. It will be set to 0.25 if train size is also None.
- train_size: int or float, by default None.
- random_state : int,by default None. Controls how the data is shuffled before the split is implemented. For repeatable output across several function calls, pass an int.
- shuffle: boolean object , by default True. Whether or not the data should be shuffled before splitting. Stratify must be None if shuffle=False.
- stratify: array-like object , by default it is None. If None is selected, the data is stratified using these as class labels.
returns: splitting: list
Example 1:
The numpy, pandas, and scikit-learn packages are imported. The CSV file is imported. X contains the features and y is the labels. we split the dataframe into X and y and perform train test split on them. random_state acts like a numpy seed, it is used for data reproducibility. test_size is given as 0.25 , it means 25% of our data goes into our test size. 1-test_size is our train size, we don't need to specify that. shuffle =True, shuffles our data before spilling. The X_train and X_test sets are used to fit and train our model and the test sets are used for testing and validating.
To access the CSV file click here.
Python3
# import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# importing data
df = pd.read_csv('headbrain1.csv')
# head of the data
print(df.head())
X= df['Head Size(cm^3)']
y=df['Brain Weight(grams)']
# using the train test split function
X_train, X_test,
y_train, y_test = train_test_split(X,y ,
random_state=104,
test_size=0.25,
shuffle=True)
# printing out train and test sets
print('X_train : ')
print(X_train.head())
print('')
print('X_test : ')
print(X_test.head())
print('')
print('y_train : ')
print(y_train.head())
print('')
print('y_test : ')
print(y_test.head())
Output:
Head Size(cm^3) Brain Weight(grams)
0 4512 1530
1 3738 1297
2 4261 1335
3 3777 1282
4 4177 1590
X_train :
99 3478
52 4270
184 3479
139 3171
107 3399
Name: Head Size(cm^3), dtype: int64
(177,)
X_test :
66 3415
113 3594
135 3436
227 4204
68 4430
Name: Head Size(cm^3), dtype: int64
(60,)
y_train :
99 1270
52 1335
184 1160
139 1127
107 1226
Name: Brain Weight(grams), dtype: int64
(177,)
y_test :
66 1310
113 1290
135 1235
227 1380
68 1510
Name: Brain Weight(grams), dtype: int64
(60,)
Example 2:
In this example, the same steps are followed, instead of specifying the test_size we specify the train_size. test_size is 1-train_size. 80% of the data is train set, so 20% of our data is our test set. If we don't specify the sizes of test and train sets by default test_size will be 0.25. X_train and y_train have the same shape and indexes, as y_train is the label for X_train features. same goes with X_test and y_test.
Python3
# import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# importing data
df = pd.read_csv('headbrain1.csv')
print(df.shape)
# head of the data
print(df.head())
X= df['Head Size(cm^3)']
y=df['Brain Weight(grams)']
# using the train test split function
X_train, X_test, y_train,
y_test = train_test_split(X,y ,
random_state=104,
train_size=0.8, shuffle=True)
# printing out train and test sets
print('X_train : ')
print(X_train.head())
print(X_train.shape)
print('')
print('X_test : ')
print(X_test.head())
print(X_test.shape)
print('')
print('y_train : ')
print(y_train.head())
print(y_train.shape)
print('')
print('y_test : ')
print(y_test.head())
print(y_test.shape)
Output:
(237, 2)
Head Size(cm^3) Brain Weight(grams)
0 4512 1530
1 3738 1297
2 4261 1335
3 3777 1282
4 4177 1590
X_train :
110 3695
164 3497
58 3935
199 3297
182 4005
Name: Head Size(cm^3), dtype: int64
(189,)
X_test :
66 3415
113 3594
135 3436
227 4204
68 4430
Name: Head Size(cm^3), dtype: int64
(48,)
y_train :
110 1310
164 1280
58 1330
199 1220
182 1280
Name: Brain Weight(grams), dtype: int64
(189,)
y_test :
66 1310
113 1290
135 1235
227 1380
68 1510
Name: Brain Weight(grams), dtype: int64
(48,)
Similar Reads
How to split a Dataset into Train and Test Sets using Python
One of the most important steps in preparing data for training a ML model is splitting the dataset into training and testing sets. This simply means dividing the data into two parts: one to train the machine learning model (training set), and another to evaluate how well it performs on unseen data (
3 min read
How to split data into training and testing in Python without sklearn
Here we will learn how to split a dataset into Train and Test sets in Python without using sklearn. The main concept that will be used here will be slicing. We can use the slicing functionalities to break the data into separate (train and test) parts. If we were to use sklearn this task is very easy
2 min read
Split the Dataset into the Training & Test Set in R
In this article, we are going to see how to Splitting the dataset into the training and test sets using R Programming Language. Method 1: Using base RÂ The sample() method in base R is used to take a specified size data set as input. The data set may be a vector, matrix or a data frame. This method
4 min read
How to Run Django's Test Using In-Memory Database
By default, Django creates a test database in the file system, but it's possible to run tests using an in-memory database, which can make our tests faster because it avoids disk I/O operations.In this article, weâll explore how to run Django tests with an in-memory database, the advantages of this a
6 min read
How to Generate a Train-Test-Split Based on a Group ID?
Splitting a dataset into training and testing sets is a common and critical step in building machine learning models. The typical train_test_split function randomly partitions the data into training and test subsets. However, there are cases when you need to ensure that data related to the same grou
10 min read
How to Split a Dataset Using PyTorch
Splitting a dataset is an important step in training machine learning models. It helps to separate the data into different sets, typically training, and validation, so we can train our model on one set and validate its performance on another. In this article, we are going to discuss the process of s
6 min read
How to import datasets using sklearn in PyBrain
In this article, we will discuss how to import datasets using sklearn in PyBrain Dataset: A Dataset is defined as the set of data that is can be used to test, validate, and train on networks. On comparing it with arrays, a dataset is considered more flexible and easy to use. A dataset resembles a 2-
2 min read
How to Use Pytest for Unit Testing
Unit Testing is an important method when it comes to testing the code we have written. It is an important process, as with testing, we can make sure that our code is working right. In this article, we will see how to use Pytest for Unit Testing in Python. Pytest for Unit TestingStep 1: InstallationT
5 min read
How to run multiple test cases using TestNG Test Suite in Selenium?
In Selenium automation testing, efficiently managing and executing multiple test cases is crucial. TestNG, a powerful Java checking-out framework, offers an excellent answer to this. Using TestNG, you can group tests, create test units, and keep them organized. This article guides you through the st
4 min read
How to Conduct a Paired Samples T-Test in Python
Paired sample T-test: This test is also known as the dependent sample t-test. It is a statistical concept and is used to check whether the mean difference between the two sets of observation is equal to zero. Â Each entity is measured is two times in this test that results in the pairs of observation
3 min read