Dataset class
Any custom dataset class, say for example, our Dogs dataset class, has to inherit from the PyTorch dataset class. The custom class has to implement two main functions, namely __len__(self) and __getitem__(self, idx). Any custom class acting as a Dataset class should look like the following code snippet:
from torch.utils.data import Dataset
class DogsAndCatsDataset(Dataset):
def __init__(self,):
pass
def __len__(self):
pass
def __getitem__(self,idx):
pass
We do any initialization, if required, inside the init method—for example, reading the index of the table and reading the filenames of the images, in our case. The __len__(self) operation is responsible for returning the maximum number of elements in our dataset. The __getitem__(self, idx) operation returns an element based on the idx every time it is called. The following code implements our DogsAndCatsDataset class:
class DogsAndCatsDataset(Dataset):
def __init__(self,root_dir,size=(224,224)):
self.files = glob(root_dir)
self.size = size
def __len__(self):
return len(self.files)
def __getitem__(self,idx):
img = np.asarray(Image.open(self.files[idx]).resize(self.size))
label = self.files[idx].split('/')[-2]
return img,label
Once the DogsAndCatsDataset class is created, we can create an object and iterate over it, which is shown in the following code:
for image,label in dogsdset:
#Apply your DL on the dataset.
Applying a deep learning algorithm on a single instance of data is not optimal. We need a batch of data, as modern GPUs are optimized for better performance when executed on a batch of data. The DataLoader class helps to create batches by abstracting a lot of complexity.