Optimizers, training, and evaluation

1. Optimizers, training, and evaluation

Welcome back! Let's look at model training and evaluation and the optimizer's role in the training process.

2. Training loop

Let's review the PyTorch training loop. First, we define the loss function, conventionally called criterion, and the optimizer. We'll use Binary Cross-Entropy, or BCE Loss, commonly used for binary classification tasks. We use Stochastic Gradient Descent, or SGD, as the optimizer and tell it which parameters to optimize - here, it's all of net's parameters. Then we pass it the learning rate of point-zero-one. We start the loop by iterating over epochs and batches of training data. Next, we clear the gradients to start from zero for the new batch followed by a forward pass to get the model's outputs. Then, we compare the model's outputs to the ground-truth labels to compute the loss. We reshape the labels with the view method to match the shape of the outputs. We compute the gradients of the model's parameters for the loss using the backward method. These gradients contain information about the direction and size of the changes required to minimize the loss. Finally, we pass the gradients to the optimizer, which performs an optimization step. That is, it updates the values of the model's parameters based on the gradients. Let's take a closer look at the optimization step.

3. How an optimizer works

In practice, neural networks can have billions of parameters. Let's consider an example with only two. Imagine we have the following parameter values and gradients.

4. How an optimizer works

They are passed to the optimizer

5. How an optimizer works

which computes an update for each parameter.

6. How an optimizer works

The updates are applied to the parameters and the optimizer step is finished. But how does the optimizer know how much to update and in which direction? The direction depends on the gradient's sign.

7. How an optimizer works

The first parameter, for example, has a positive gradient, so it should be decreased in order to decrease the loss. Hence, the parameter update is negative. What about the size of the update? Different optimizers use different approaches to decide how much to update.

8. Stochastic Gradient Descent (SGD)

In Stochastic Gradient Descent, or SGD, the size of the parameter update depends only on the learning rate, a predefined hyperparameter. SGD is computationally efficient, but because of its simplicity, it's rarely used in practice.

9. Adaptive Gradient (Adagrad)

Using the same learning rate for each parameter cannot be optimal. Adaptive Gradient, or Adagrad, improves on it by decreasing the learning rate during training for parameters that are infrequently updated. This makes it well-suited for sparse data, that is, data in which some features are not often observed. However, Adagrad tends to decrease the learning rate too fast.

10. Root Mean Square Propagation (RMSprop)

Root Mean Square Propagation, or RMSprop, addresses Adagrad's aggressive learning rate decay by adapting the learning rate for each parameter based on the size of its previous gradients.

11. Adaptive Moment Estimation (Adam)

Finally, Adaptive Moment Estimation or Adam is arguably the most versatile and widely used optimizer. It combines RMSprop with the concept of momentum: the average of past gradients where the most recent gradients have more weight. Basing the update on both gradient size and momentum helps accelerate training. Adam is often the default go-to optimizer, and we will use it throughout the course.

12. Model evaluation

Once the model is trained, we can evaluate its performance on test data. First, we set up the binary accuracy metric from torchmetrics. Then, we put the model in the evaluation mode with net.eval and iterate over the dataloader_test with no gradients calculation and do the forward pass to get predicted probabilities, which we then transform into predicted labels based on the 0.5 threshold. Finally, we update the accuracy score. Let's compute and print the overall accuracy. We got over 67%, not bad for a basic model and small dataset!

13. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.