1. Hyperparameter tuning in deep learning
In this lesson, we will go over some of the hyperparameters talked about in last lesson in more detail and discuss how to better tune them.
2. Learning rate and number of iterations
A neural network iteratively updates weights by looking at a loss function called the log-loss, for classification. Typically, a set of optimization algorithms is used to conduct this process, which involves a term called backpropagation that allows the errors in training to affect the weights. The full details are beyond this course, but if you were to plot this loss over time, you would find something like the following: if the network has a good learning rate, then the loss will drop fairly quickly during training and stabilize. A very high learning rate will overshoot and end up in a very high loss, whereas too low of a rate will not learn quick enough and the loss will decrease very slowly over time. For simplicity, in our models we will use the default settings for learning rate.
3. Choosing hidden layers
The number of hidden layers, as well as the number of units within those layers, is an important parameter. Larger networks with more layers and more units, can model more complex functions, but do not necessarily have much better testing performance than smaller networks. Usually there is an increase in testing performance up to a certain level of complexity and then drop-off thereafter, as the larger networks overfit more. For example, as seen in the graph, where the y-axis represents error and the x-axis represents hidden layer sizes for a small dataset, the performance is most optimal around a size of 8 to 10 units, and is much worse at 1-2 units or 30+ units.
4. Grid search
Just as before in chapter 3, we can use grid search to optimize over the hyperparameters. As a recap, we create a list of values for each of our hyperparameters, and then pass them into a dictionary called param_grid, which is then used in the GridSearchCV object as follows. In the examples shown, we look at varying the number of iterations through max_iter and the hidden layer sizes. Additionally, because training neural networks is computationally intensive, it is possible to pass in an additional argument to the GridSearchCV objects, called n_jobs, which specifies the number of jobs that can be run in parallel, to speed up the computation. Lastly, remember that best_score_ attribute will return the best score according to the scoring function we specify, and best_estimator_ attribute will return the best model configuration that yielded the best results.
5. Real life extensions
For real-life large-scale neural networks, there are also other hyperparameters to tune and methods to apply. For example, when there is a very large volume of data, training often happens in batches, each called a mini-batch. Then batch size becomes a hyperparameter, and this goes hand in hand with the number of epochs, which is the number of times the algorithm will cycle through all of the training data. In the examples we have seen, the number of epochs is simply the number of iterations of training, because the entire training dataset is used at once. There are also other concerns, such as how to initialize the weights - which can done in many ways. Lastly, it is important to mention that building and training these networks is often done via Keras or TensorFlow, rather than through sklearn because the functionality is more limited on sklearn. For reference, you can check out some of the courses on DataCamp covering deep learning.
6. Let's practice!
Now that we've covered hyperparameter tuning in neural networks, let's practice!