The need for optimization

1. The need for optimization

You've seen the forward-propagation algorithm that neural networks use to make predictions. However, the mere fact that a model has the structure of a neural network does not guarantee that it will make good predictions. To see the importance of model weights,

2. A baseline neural network

we'll go back to a network you saw in the previous chapter. We'll use a simple example for the sake of explanation. For the moment, we won't use an activation function in this example, or if you prefer, you might think of an activation function that returns the input, sometimes called the identity function. We have values

3. A baseline neural network

of 2 and 3 for the inputs, and the true value of the target

4. A baseline neural network

is 13. So, the closer our prediction is to 13, the more accurate this model is for this data point. We use forward propagation to fill in the values of hidden layer. That gives us hidden node values

5. A baseline neural network

of 5 and 1. Continuing forward propagation, we use those hidden node values to make a prediction

6. A baseline neural network

of 9. Since the true target value is 13, our error is

7. A baseline neural network

13 minus 9, which is 4.Changing any weight will change our prediction. Let’s see what happens if we change the two weights from the hidden layer to the output. In this case,

8. A baseline neural network

we make the top weight 3 and the bottom weight -2. Now forward propagation gives us a prediction

9. A baseline neural network

of 13. That is exactly

10. A baseline neural network

the value we wanted to predict. So, this change in weights improved the model for this data point.

11. Predictions with multiple points

Making accurate predictions gets harder with multiple points. First of all, at any set of weights, we have many values of the error, corresponding to the many points we make predictions for. We use something called

12. Loss function

a loss function to aggregate all the errors into a single measure of the model's predictive performance. For example,

13. Squared error loss function

a common loss function for regression tasks is mean-squared error. You square each error,

14. Squared error loss function

and take the average of that as a measure of model quality.

15. Squared error loss function

The loss function aggregates all of the errors into a single score. For an illustration,

16. Loss function

consider a model with only two weights, we could plot the model's performance for each set of weights like this.

17. Loss function

The values of the weights are plotted on the x and y axis,

18. Loss function

and the loss function is on the vertical or z axis.

19. Loss function

Lower values mean a better model, so our goal is to find the weights giving the lowest value for the loss function. We do this with an algorithm called gradient descent. An analogy may be helpful.

20. Gradient descent

Imagine you are in a pitch dark field, and you want to find the lowest point. You might feel the ground to see how it slopes, and take a small step downhill. This gives an improvement, but not necessarily the lowest point yet. So you repeat this process until it is uphill in every direction. This is roughly how gradient descent works.

21. Gradient descent steps

The steps are: Start at a random point, until you are somewhere flat, find the slope, and take a step downhill. Let's look at optimizing a model with a single weight, and then we'll scale up

22. Optimizing a model with a single weight

to optimizing multiple weights. We have a curve showing

23. Optimizing a model with a single weight

the loss function on the vertical axis, at different values of the weight, which is on the horizontal axis. We are looking for the low point on this curve, because that means our model is as accurate as possible.

24. Optimizing a model with a single weight

We have drawn this tangent line to the curve at our current point. The slope of that tangent line captures the slope of the loss function at the our current weight. That slope corresponds to something called

25. Optimizing a model with a single weight

the derivative from calculus. We use this slope to decide what direction we step. In this case,

26. Optimizing a model with a single weight

the slope is positive. So if we want to go downhill,

27. Optimizing a model with a single weight

we go in the direction opposite the slope, towards lower numbers.

28. Optimizing a model with a single weight

If we repeatedly take small steps opposite the slope,

29. Optimizing a model with a single weight

recalculating the slope each time,

30. Optimizing a model with a single weight

we will eventually get to the minimum value. You will see more detail in the next video. -

31. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.