Python

PyTorch’s DataLoader

PyTorch’s DataLoader is a useful feature that keeps your data organized and simplifies your machine learning pipeline. It allows us to iterate the data, manage batches, and shuffle the samples to avoid overfitting. We’ll go through the DataLoader implementation in PyTorch in this article. Before that, we will go through the basic terminologies that we will be using while implementing the data loader. We’ll then start with the Fashion MNIST dataset bundled with PyTorch. Later, we will use DataLoader with our custom dataset.

What is PyTorch?

PyTorch is an open-source deep learning framework for constructing network architectures and other high-level techniques like RNN, CNN, and LSTM. It is utilized by researchers, businesses, and ML and AI communities.

Facebook’s artificial intelligence research group is principally responsible for its development.

What is Tensor (mathematics-based approach)?

Exert a force on a surface and watch how it deflects in different directions. You might anticipate it to move in the same direction as the force, but this does not always happen; the reason for this is that the material is not uniform in all directions; it may have some structure, such as a crystal or layers. A force, which is a vector, is your starting point. (The x, y, and z directions each have three components.) You receive a deflection vector (movement in x, y, and z). The force, however, is in the reverse direction from the motion! Let’s pretend that the response is proportionate to the force, meaning doubling the force twice the movement. This is called a “linear reaction.”

How can you put all of this into mathematical terms? With a tensor, of course. Consider a tensor to be a mathematical instrument that takes a vector (such as the force) and returns a new vector (the motion). Tensors are only required when the two vectors are pointing in opposite directions.

Iterations, Batches, Epocs. What are they in terms of Neural Networks?

The number of iterations (denoted by n here) is the number of times a batch of training instances estimates the gradient and updates the neural network’s parameters.

The batch size B refers to how many training instances are employed in a single iteration. This is usually used when the number of training instances is quite large, and it is usually effective to divide the entire data into mini-batches. Each batch has the size: 1< B < N.

To use the full training data once, it takes n (N/B) iterations. This is what an epoch is. So (N/B)*E, where E is the number of epochs, is the total number of times the parameters are changed.

There are three types of gradient descent. There is a tradeoff between them as one can give a good accuracy but is slow. On the other hand, one is faster, but it does not guarantee a good accuracy:

N=B, one epoch equals one iteration in batch mode.

Mini-batch mode: 1 < B < N, with N/B iterations per epoch.

B=1, one epoch takes N iterations in the stochastic model of gradient descent.

Implementing DataLoader on Fashion MNIST Dataset

Loading the Fashion MNIST Dataset from Pytorch

Fashion-MNIST is an image dataset that includes 60,000 training and 10,000 test instances. Each example includes a 28 x 28 grayscale image with a label from one of ten categories. Below are some parameters that you satisfy while loading the dataset.

root: This is the directory in which the dataset is saved.

train: indicates that the trained or tested data must be loaded.

download: If the data is not available at root, it is downloaded via the internet.

transform and target_transform: These parameters specify the feature and label transformations.

import torch
fromtorch.utils.dataimport Dataset
fromtorchvisionimport datasets
fromtorchvision.transformsimportToTensor
importmatplotlib.pyplotasplt


train = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

Custom Dataset of your Files

importos
import pandas as pd
from torchvision.io importread_image

# create class for Custom Dataset
classCustomDataset(Dataset):

# initialize the dataset
def__init__(self, annotations, img_dir, trans=None, target_trans=None):
self.labels = pd.read_csv(annotations)
self.img_dir = img_dir
self.trans = trans
self.target_trans = target_trans

# function to return length of data
def__len__(self):
returnlen(self.labels)

# function to get sample at given index
def__getitem__(self, index):

        path = os.path.join(self.img_dir, self.labels.iloc[index, 0])
img = read_image(path)
        label = self.labels.iloc[index, 1]
ifself.trans:
img = self.trans(img)
ifself.target_trans:
            label = self.target_trans(label)

returnimg, label

In the above code, we see three important methods:

__init__: This function is called when the Dataset object is created. Both transforms, as well as the directory containing the images and the annotations file, are set up.

__len__: This function returns you the length of the dataset or the number of samples in the dataset.

__getitem__: This method gives us the sample present at an index.

 

Training with DataLoader

Store the data into the data loaders. The data loaders are iterable that allows you to pass the samples one at a time at the time of training and even shuffle the data after all batches are processed.

from torch.utils.data import DataLoader

train_loader = DataLoader(train, batch_size=32, shuffle=True)

test_loader = DataLoader(test, batch_size=32, shuffle=True)

Iterate the DataLoader

# Display image and label.
train_features, train_labels = next(iter(train_loader))

print(f"Features shape of the current batch is {train_features.size()}")
print(f"Labels shape of the current batch shape is {train_labels.size()}")

img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

Output

Features shape of the current batch is torch.Size([32, 1, 28, 28])

Labels shape of the current batch shape is torch.Size([32])

Label: 5

Each iteration in the above code returns a batch of training features and training labels for each iteration. To avoid overfitting, the data is shuffled after all of the batches have been processed.

Implementing Data Loader on a Custom Dataset

# importing the libraries we will be using
import torch
fromtorch.utils.dataimport Dataset
fromtorch.utils.dataimportDataLoader

# defining the Dataset class
classDatasets(Dataset):
# initializing the dataset here
    def__init__(self):
        numbers = list(range(0, 20, 1))
        self.data = numbers
# get length of the dataset here
    def__len__(self):
        returnlen(self.data)
# get the item at an index
    def__getitem__(self, index):
        returnself.data[index]

# create an object of data_set class
dataset = Datasets()

# implementing data loader on the dataset and specifying the parameters
dataloader = DataLoader(dataset, batch_size=5, shuffle=True)
fori, batch in enumerate(dataloader):
    print(i, batch)

Output

0 tensor([ 0, 4, 9, 15, 14])

1 tensor([11, 16, 12, 3, 10])

2 tensor([ 6, 8, 2, 17, 1])

3 tensor([ 7, 18, 5, 13, 19])

Conclusion

We went through the implementation of PyTorch’s DataLoader to manage the training of our data. We now realize how easily we can manage the batches, shuffling, iteration of our datasets using DataLoader. This helps in better analysis of our models and ultimately improve them.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.