Get startedGet started for free

Understanding model optimization

1. Understanding model optimization

At this point, you have a good understanding of how neural networks work, and how to build them in Keras. But you probably don't yet have a great intuition for how to choose things like model architecture and model optimization arguments. You'll learn that in this last chapter.

2. Why optimization is hard

In practice, optimization is a hard problem. The optimal value for any one weight depends on the values of the other weights, and we are optimizing many weights at once. Even if the slope tells us which weights to increase, and which to decrease, our updates may not improve our model meaningfully. A small learning rate might cause us to make such small updates to the model's weights that our model doesn't improve materially. A very large learning rate might take us too far in the direction that seemed good. A smart optimizer like Adam helps, but optimization problems can still occur. The easiest way to see the effect of different learning rates is to use the simplest optimizer,

3. Stochastic gradient descent

Stochastic Gradient Descent, sometimes abbreviated to SGD. This optimizer uses a fixed learning rate. Learning rates around point-01 are common. But you can specify the learning rate you need with lr argument as shown here. We have a function that creates a new model here. We create models in a for loop, and each time around we compile the model using SGD with a different learning rate. We pass in the optimizer with the same argument where we previously passed the string for "Adam". In an exercise, you will compare the results of training models trained with low, medium and high learning rates. Even if your learning rate is well tuned, you can run into the so-called

4. The dying neuron problem

"dying-neuron" problem. This problem occurs when a neuron takes a value less than 0 for all rows of your data. Recall that, with the ReLU activation function, any node with a negative input value produces an output of 0, and it also has a slope of 0 as you see in this graph. Because the slope is 0, the slope of any weights flowing into that node are also 0. So those weights don't get updated. In other words, once the node starts always getting negative inputs, it may continue only getting negative inputs. It's contributing nothing to the model at this point, and hence the claim that the node or neuron is "dead."At first, this might suggest using an activation function whose slope is never exactly zero. However, those types of functions were used for many years.

5. Vanishing gradients

For example, in an earlier video we used an s-shaped function called tanh. However, values that were outside the middle of the S were

6. Vanishing gradients

relatively flat, or had small slopes. A small but non-zero slope might work in a network with only a few hidden layers. But in a deep network, one with many layers, the repeated multiplication of small slopes causes the slopes to get close to 0, which meant updates in backprop were close to 0. This is called the vanishing gradient problem. This in turn might suggest using an activation function that isn't even close to flat anywhere. There is research in this area, including variations on ReLU. Those aren't widely used though. For now, it's a phenomenon worth keeping in mind if you are ever pondering why your model isn't training better. If it happens, changing the activation function may be the solution. -

7. Let's practice!