A deeper dive into loading data

1. A deeper dive into loading data

It's time to train our neural network!

2. Our animals dataset

Efficient data handling is key to training deep learning models. Our animal classification data is in a CSV file, and can be loaded using pd.read_csv(). We'll use hair, feathers, eggs, milk, predator, legs, and tail as features to predict an animal's type. The animal_name column isn't needed since names don't determine classification. Note the type column has three categories: bird (0), mammal (1), and reptile (2).

3. Our animals dataset: defining features

We'll use .iloc to select all columns except the first (animal_name) and last (type), giving us our input features. These are converted into a NumPy array (X), for easier handling with PyTorch.

4. Back to our animals dataset: defining target values

Similarly, we can extract the last column (type), and store it as an array of our target values, which represent the class labels for each animal. We'll call this y.

5. TensorDataset

We'll use TensorDataset to prepare data for PyTorch models. We first import torch and TensorDataset from torch.utils.data. This allows us to store our features (X) and target labels (y) as tensors, making them easy to manage. We convert X and y into tensors using PyTorch’s tensor method and pass them to TensorDataset. To access a individual sample, we use square bracket indexing. dataset[0] returns a tuple containing the input features and label, which we unpack into input_sample and label_sample.

6. DataLoader

Once we've created our dataset using TensorDataset, we can pass it to DataLoader to efficiently manage data loading during training. We start by importing DataLoader from torch.utils.data. Next, we define two key parameters: batch_size determines how many samples are included in each iteration. Since deep learning models require large datasets, batching helps process multiple samples at once, making training more efficient. shuffle randomizes the data order at each epoch, helping improve model generalization. One epoch is a full pass through the training dataloader, and generalization means the model performs well on unseen data rather than just memorizing the training set. We then create a DataLoader instance with these parameters, making it easy to iterate through our dataset in batches.

7. DataLoader

Let's iterate through the DataLoader! Each element in the dataloader is a tuple, which we unpack as batch_inputs and batch_labels. Since our dataset contains five animals and we set a batch size of two, the first iteration randomly selects two animals and their corresponding labels. On the second iteration, two more samples are randomly selected with their labels. Finally, the last remaining sample is returned. Since our dataset has an odd number of samples, the last batch contains just one item. In real-world deep learning, datasets are much larger, with batch sizes of typically 32 or more for better computational efficiency.

8. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to Deep Learning with PyTorch

IntermediateSkill Level

4.8+

2843 reviews