1. Gradient descent
With gradient descent, you repeatedly
2. Gradient descent
repeatedly
3. Gradient descent
found a slope capturing
4. Gradient descent
how your loss function
5. Gradient descent
changes as a weight changes.
6. Gradient descent
You then made a
7. Gradient descent
small change to the weight
8. Gradient descent
to get to a lower point, and
9. Gradient descent
you repeated this
10. Gradient descent
until you couldn't go
11. Gradient descent
downhill any more.
12. Gradient descent
13. Gradient descent
If the slope is positive, going opposite the slope means moving to lower numbers. Subtracting the slope from the current value achieves this. But too big a step might lead us far astray. So, instead of directly subtracting the slope, we multiply the slope by a small number, called the learning rate, and we change the weight by the product of that multiplication. Learning rate are frequently around point-01. This ensures we take small steps, so we reliably move towards the optimal weights. But how do we find the relevant slope for each weight we need to update? Working this out for yourself involves calculus, especially the application of the chain rule. Don't worry if you don't remember or don't know the underlying calculus. We'll explain some basic concepts here, and Keras and TensorFlow do the calculus for us.
14. Slope calculation example
Here is a first example to calculate a slope for a weight, and in this example we will look at a single data point. Weights feed from one node into another, and you always get the slope you need by multiplying three things. First, the slope of the loss function with respect to the value at the node we feed into. Second, the value of the node that feeds into our weight. Third, the slope of the activation function with respect to the value we feed into. Let's start with the slope of the loss function with respect to the value of the node our weight feeds into. In this case,
15. Slope calculation example
that node is the model's prediction. If you work through some calculus,
16. Slope calculation example
you will find that the slope of the mean-squared loss function with respect to the prediction is 2 times (predicted value - actual value). Which is 2 times the error.
Here, the prediction from forward propagation was 6. The actual target value is 10, so the error is 6 minus 10, which is -4.
17. Slope calculation example
The second thing we multiply is the value at the node we are feeding from. Here, that is 3.
18. Slope calculation example
Finally, the slope of the activation function at the value we feed into.
19. Slope calculation example
Since we don't have an activation function here,
20. Slope calculation example
we can leave that out.
So our final result for the slope of the loss
21. Slope calculation example
if we graphed it against this weight is 2 times -4 times 3, or negative 24.We would now improve this weight by subtracting the learning rate times that slope, minus 24. If the learning rate were point-01, we would update this weight to be 2-point-24.That gives us a better model. And it would continue improving if we repeated this process. For multiple weights feeding to the output, we repeat this calculation separately for each weight. Then we update both weights simultaneously using their respective derivatives.
22. Network with two inputs affecting prediction
Here is a network with two weights going directly to an output, and again with no activation function.
23. Code to calculate slopes and update weights
Let’s see the code to calculate slopes and update the weights. First, we set up the weights, input data, and a target value to predict.
24. Code to calculate slopes and update weights
Here is the slope calculation. We uses numpy broadcasting, which multiplies an array by a number so that each entry in the array is multiplied by that number. We multiply the two times the error times the array with the input nodes. This gives us an array that used the 1st node value for the first calculated slope, and the second node value for the 2nd calculated slope. This is exactly what we wanted. Incidentally, the mathematical term for this array of slopes is a "gradient", and this is where the name gradient descent comes from. We update the weights by some small step in that direction, where the step size is partially determined by the learning rate. And the new error is 2-point-5, which is an improvement over the old error, which was 5. Repeating that process from the new values would give further improvements.
25. Let's practice!
Now it’s your turn.