Deep Learning with PyTorch
上QQ阅读APP看书,第一时间看更新

Dataset class

Any custom dataset class, say for example, our Dogs dataset class, has to inherit from the PyTorch dataset class. The custom class has to implement two main functions, namely __len__(self) and __getitem__(self, idx). Any custom class acting as a Dataset class should look like the following code snippet:

from torch.utils.data import Dataset
class DogsAndCatsDataset(Dataset):
def __init__(self,):
pass
def __len__(self):
pass
def __getitem__(self,idx):
pass

We do any initialization, if required, inside the init method—for example, reading the index of the table and reading the filenames of the images, in our case. The __len__(self) operation is responsible for returning the maximum number of elements in our dataset. The __getitem__(self, idx) operation returns an element based on the idx every time it is called. The following code implements our DogsAndCatsDataset class:

class DogsAndCatsDataset(Dataset):

def __init__(self,root_dir,size=(224,224)):
self.files = glob(root_dir)
self.size = size

def __len__(self):
return len(self.files)

def __getitem__(self,idx):
img = np.asarray(Image.open(self.files[idx]).resize(self.size))
label = self.files[idx].split('/')[-2]
return img,label

Once the DogsAndCatsDataset class is created, we can create an object and iterate over it, which is shown in the following code:

for image,label in dogsdset:
#Apply your DL on the dataset.

Applying a deep learning algorithm on a single instance of data is not optimal. We need a batch of data, as modern GPUs are optimized for better performance when executed on a batch of data. The DataLoader class helps to create batches by abstracting a lot of complexity.