1. Using derivatives to update model parameters
Excellent job handling those loss functions. Let's now see how we can minimize loss.
2. An analogy for derivatives
We know a model predicts poorly when loss is high. We can use derivatives, or gradients, to minimize this loss.
Imagine the loss function as a valley. The derivative represents the slope, how steeply the curve rises or falls.
Steep slopes, shown by red arrows, indicate high derivatives and large steps. Gentler slopes, represented by green arrows, have smaller derivatives and smaller steps. On the valley floor, shown by the blue arrow, the slope is flat, and the derivative is zero. This point is the loss function's minimum, which we aim to reach.
3. Convex and non-convex functions
Convex functions have one global minimum. Non-convex functions, such as loss functions, have multiple local minima, where the value is lower than nearby points but not the lowest overall.
When minimizing loss functions, we aim to locate the global minimum when x is approximately one.
4. Connecting derivatives and model training
During training, we run a forward pass on the features and compute loss by comparing predictions to the target value.
Recall that layer weights and biases are randomly initialized when a model is created. We update them during training using a backward pass or backpropagation.
5. Connecting derivatives and model training
In deep learning, derivatives are known as gradients. We compute the loss function gradients and use them to update the model parameters, including weights and biases, with backpropagation, repeating until the layers are tuned.
6. Backpropagation concepts
During backpropagation, if we consider a network of three linear layers, we can calculate local loss gradients with respect to each layer's parameters.
We first calculate loss gradients with respect to L2 parameters, to L2 and L1, and repeat until we reach the first layer.
Let's see this with PyTorch.
7. Backpropagation in PyTorch
After running a forward pass, we define a loss function, here CrossEntropyLoss(), and use it to compare predictions with target values. Using .backward(), we calculate gradients based on this loss, which are stored in the .grad attributes of each layer's weights and biases.
Each layer in the model can be indexed starting from zero to access its weights, biases, and gradients.
8. Updating model parameters manually
To manually update model parameters, we access each layer gradient, multiply it by the learning rate, and subtract this product from the weight. Learning rate is another tunable parameter. We'll discuss this and the training loop later in the course.
9. Gradient descent
We use a mechanism called "gradient descent" to find the global minimum of loss functions.
PyTorch simplifies this with optimizers, like stochastic gradient descent (SGD).
We use optim to instantiate SGD. .parameters() returns an iterable of all model parameters, which we pass to the optimizer. We use a standard learning rate, "lr", here.
The optimizer automatically calculates gradients and updates model parameters with .step(). Magic!
10. Let's practice!
Let's practice!