The need for optimization
1. The need for optimization
You've seen the forward-propagation algorithm that neural networks use to make predictions. However, the mere fact that a model has the structure of a neural network does not guarantee that it will make good predictions. To see the importance of model weights,2. A baseline neural network
we'll go back to a network you saw in the previous chapter. We'll use a simple example for the sake of explanation. For the moment, we won't use an activation function in this example, or if you prefer, you might think of an activation function that returns the input, sometimes called the identity function. We have values3. A baseline neural network
of 2 and 3 for the inputs, and the true value of the target4. A baseline neural network
is 13. So, the closer our prediction is to 13, the more accurate this model is for this data point. We use forward propagation to fill in the values of hidden layer. That gives us hidden node values5. A baseline neural network
of 5 and 1. Continuing forward propagation, we use those hidden node values to make a prediction6. A baseline neural network
of 9. Since the true target value is 13, our error is7. A baseline neural network
13 minus 9, which is 4.Changing any weight will change our prediction. Let’s see what happens if we change the two weights from the hidden layer to the output. In this case,8. A baseline neural network
we make the top weight 3 and the bottom weight -2. Now forward propagation gives us a prediction9. A baseline neural network
of 13. That is exactly10. A baseline neural network
the value we wanted to predict. So, this change in weights improved the model for this data point.11. Predictions with multiple points
Making accurate predictions gets harder with multiple points. First of all, at any set of weights, we have many values of the error, corresponding to the many points we make predictions for. We use something called12. Loss function
a loss function to aggregate all the errors into a single measure of the model's predictive performance. For example,13. Squared error loss function
a common loss function for regression tasks is mean-squared error. You square each error,14. Squared error loss function
and take the average of that as a measure of model quality.15. Squared error loss function
The loss function aggregates all of the errors into a single score. For an illustration,16. Loss function
consider a model with only two weights, we could plot the model's performance for each set of weights like this.17. Loss function
The values of the weights are plotted on the x and y axis,18. Loss function
and the loss function is on the vertical or z axis.19. Loss function
Lower values mean a better model, so our goal is to find the weights giving the lowest value for the loss function. We do this with an algorithm called gradient descent. An analogy may be helpful.20. Gradient descent
Imagine you are in a pitch dark field, and you want to find the lowest point. You might feel the ground to see how it slopes, and take a small step downhill. This gives an improvement, but not necessarily the lowest point yet. So you repeat this process until it is uphill in every direction. This is roughly how gradient descent works.21. Gradient descent steps
The steps are: Start at a random point, until you are somewhere flat, find the slope, and take a step downhill. Let's look at optimizing a model with a single weight, and then we'll scale up22. Optimizing a model with a single weight
to optimizing multiple weights. We have a curve showing23. Optimizing a model with a single weight
the loss function on the vertical axis, at different values of the weight, which is on the horizontal axis. We are looking for the low point on this curve, because that means our model is as accurate as possible.24. Optimizing a model with a single weight
We have drawn this tangent line to the curve at our current point. The slope of that tangent line captures the slope of the loss function at the our current weight. That slope corresponds to something called25. Optimizing a model with a single weight
the derivative from calculus. We use this slope to decide what direction we step. In this case,26. Optimizing a model with a single weight
the slope is positive. So if we want to go downhill,27. Optimizing a model with a single weight
we go in the direction opposite the slope, towards lower numbers.28. Optimizing a model with a single weight
If we repeatedly take small steps opposite the slope,29. Optimizing a model with a single weight
recalculating the slope each time,30. Optimizing a model with a single weight
we will eventually get to the minimum value. You will see more detail in the next video. -31. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.