Datasets And Dataloaders in Pytorch
Last Updated :
18 Jul, 2021
PyTorch is a Python library developed by Facebook to run and train machine learning and deep learning models. Training a deep learning model requires us to convert the data into the format that can be processed by the model. PyTorch provides the torch.utils.data library to make data loading easy with DataSets and Dataloader class.
Dataset is itself the argument of DataLoader constructor which indicates a dataset object to load from. There are two types of datasets:
- map-style datasets: This data set provides two functions __getitem__( ), __len__( ) that returns the indices of the sample data referred to and the numbers of samples respectively. In the example, we will use this type of dataset.
- iterable-style datasets: Datasets that can be represented in a set of iterable data samples, for this we use __iter__( )function.
Dataloader on the other hand, not only allows us to iterate through the dataset in batches but also gives us access to inbuilt functions for multiprocessing(allows us to load multiple batches of data in parallel, rather than loading one batch at a time), shuffling, etc.
Syntax:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)
Dataset Used: heart
Let us deal with an example so that the concept becomes clearer.
First import all required libraries and the dataset to work with. Load dataset in torch tensors which are accessed through __getitem__( ) protocol, to get the index of the particular dataset. Then we unpack the data and print corresponding features and labels.
Example:
Python3
# importing libraries
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
# class to represent dataset
class HeartDataSet():
def __init__(self):
# loading the csv file from the folder path
data1 = np.loadtxt('heart.csv', delimiter=',',
dtype=np.float32, skiprows=1)
# here the 13th column is class label and rest
# are features
self.x = torch.from_numpy(data1[:, :13])
self.y = torch.from_numpy(data1[:, [13]])
self.n_samples = data1.shape[0]
# support indexing such that dataset[i] can
# be used to get i-th sample
def __getitem__(self, index):
return self.x[index], self.y[index]
# we can call len(dataset) to return the size
def __len__(self):
return self.n_samples
dataset = HeartDataSet()
# get the first sample and unpack
first_data = dataset[0]
features, labels = first_data
print(features, labels)
Output:
tensor([ 63.0000, 1.0000, 3.0000, 145.0000, 233.0000, 1.0000, 0.0000,
150.0000, 0.0000, 2.3000, 0.0000, 0.0000, 1.0000]) tensor([1.])
The torch dataLoader takes this dataset as input, along with other arguments for batch_size, shuffle, etc, calculate nums_samples per batch, then print out the targets and labels in batches.
Example:
Python3
# Loading whole dataset with DataLoader
# shuffle the data, which is good for training
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True)
# total samples of data and number of iterations performed
total_samples = len(dataset)
n_iterations = total_samples//4
print(total_samples, n_iterations)
for i, (targets, labels) in enumerate(dataloader):
print(targets, labels)
Output:
We now train the data by first looping over the epoch and then over samples after that printing out the number of epochs, input tensor and label tensor with each iteration.
Example:
Python3
num_epochs = 2
for epoch in range(num_epochs):
for i, (inputs, labels) in enumerate(dataloader):
# here: 303 samples, batch_size = 4, n_iters=303/4=75 iterations
# Run our training process
if (i+1) % 5 == 0:
print(f'Epoch: {epoch+1}/{num_epochs}, Step {i+1}/{n_iterations}|\
Inputs {inputs.shape} | Labels {labels.shape}')
Output:
Similar Reads
How to use a DataLoader in PyTorch?
Operating with large datasets requires loading them into memory all at once. In most cases, we face a memory outage due to the limited amount of memory available in the system. Also, the programs tend to run slowly due to heavy datasets loaded once. PyTorch offers a solution for parallelizing the da
2 min read
PyTorch DataLoader
PyTorch's DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we'll explore how PyTorch's DataLoader works
14 min read
PyTorch Lightning Multi Dataloader Guide
PyTorch Lightning provides a streamlined interface for managing multiple dataloaders, which is essential for handling complex datasets and training scenarios. This guide will explore the various methods and best practices for using multiple dataloaders in PyTorch Lightning, covering everything from
4 min read
Load a Computer Vision Dataset in PyTorch
Computer vision is a subset of Artificial Intelligence that gives the ability to the computer to understand images. In Deep Learning, Convolution Neural Network is used to process the image. For building the good we need a lot of images to process. There are several ways to load a computer vision da
3 min read
Implementing an Autoencoder in PyTorch
Autoencoders are neural networks that learn to compress and reconstruct data. In this guide weâll walk you through building a simple autoencoder in PyTorch using the MNIST dataset. This approach is useful for image compression, denoising and feature extraction.Implementation of Autoencoder in PyTorc
4 min read
Save and Load Models in PyTorch
It often happens that we need to use the already-trained models to perform some operations in our development environment. In this case, would you create the model again and again? Or, you will save the model somewhere else and load it as per the requirement. You would definitely choose the second o
10 min read
Pytorch - Index-based Operation
PyTorch is a python library developed by Facebook to run and train deep learning and machine learning algorithms. Tensor is the fundamental data structure of the machine or deep learning algorithms and to deal with them, we perform several operations, for which PyTorch library offers many functional
7 min read
Graphs, Automatic Differentiation and Autograd in PyTorch
Graphs, Automatic Differentiation and Autograd are powerful tools in PyTorch that can be used to train deep learning models. Graphs are used to represent the computation of a model, while Automatic Differentiation and Autograd allow the model to learn by updating its parameters during training. In t
7 min read
How do you use PyTorch's Dataset and DataLoader classes for custom data?
PyTorch is a powerful deep-learning library that offers flexible and efficient tools for handling data. Among its many features, the Dataset and DataLoader classes stand out for their ability to streamline data preprocessing and loading. This article will guide you through the process of using these
4 min read
How to load CIFAR10 Dataset in Pytorch?
The CIFAR-10 dataset is a popular resource for training machine learning models, especially in the field of image recognition. It consists of 60,000 32x32 color images in 10 different classes, with 6,000 images per class. The dataset is divided into 50,000 training images and 10,000 testing images.
3 min read