Get startedGet started for free

Learning rate and momentum

1. Learning rate and momentum

Learning rate has come up a few times. The time has come for us to dig deeper.

2. Updating weights with SGD

Training a neural network means solving an optimization problem by minimizing the loss function and adjusting model parameters. To do this, we use an algorithm called stochastic gradient descent, or SGD, implemented in PyTorch. Recall this is the optimizer we used to find the global minimum of loss functions. The optimizer takes the model's parameters along with two key arguments: Learning rate, which controls the step size of updates. And momentum, which adds inertia to help the optimizer move smoothly and avoid getting stuck. Understanding their impact helps us optimize efficiency.

3. Impact of the learning rate: optimal learning rate

Let's try to find the minimum of a U-shaped function. We start at x = -2 and run the SGD optimizer for ten steps. After these steps, we observe that the optimizer is close to the minimum. We can also note that as we approach the minimum, the step size gradually decreases. This happens because the step size is the gradient multiplied by the learning rate. Since the function is less steep near zero, the gradient, and thus the step size, gets smaller.

4. Impact of the learning rate: small learning rate

However, if we use the same algorithm for a learning rate ten times smaller, we realize that we are still far from the minimum of the function after ten steps. The optimizer will take much longer to find the function's minimum.

5. Impact of the learning rate: high learning rate

If we use a high value for the learning rate, we observe that the optimizer cannot find the minimum and bounces back and forth on both sides of the function.

6. Convex and non-convex functions

Recall that loss functions are non-convex. One of the challenges when trying to find the minimum of a non-convex function is getting stuck in a local minimum.

7. Without momentum

Let's run our optimizer for a hundred steps with a null momentum on this non-convex function. We see that the optimizer gets stuck in this first dip of the function, which is not its global minimum.

8. With momentum

However, when using a momentum of 0.9, we can find the minimum of the function. This parameter provides momentum to the optimizer enabling it to overcome local dips, as shown here. The momentum keeps the step size large when previous steps were also large, even if the current gradient is small.

9. Summary

In summary, two key optimizer parameters impact training: learning rate and momentum. The learning rate controls the step size and typical values range from 0.01 to 0.0001. If it's too high, the optimizer may not find the minimum. If it's too low, training slows down. Momentum helps the optimizer move past local minimum. Without it, the optimizer may get stuck. Here, typical values range from 0.85 to 0.99.

10. Let's practice!

Let's experiment with different learning rate and momentum values and discover their impact on training.