1. Layer initialization and transfer learning
We've explored how neural networks learn by updating weights during training. This final chapter will look at techniques to evaluate and improve model performance efficiency.
Before we get started, please note the topics here are more advanced and we'll cover them at a high-level.
2. Layer initialization
Data normalization scales input features for stability; similarly, the weights of a linear layer are also initialized to small values. This is known as layer initialization.
Let's create a small linear layer and check its weight range. We observe that the weights are between -0.125 and +0.125.
Why is this important? The output of a neuron in a linear layer is a weighted sum of inputs from the previous layer. Keeping both the input data and layer weights small ensures stable outputs, preventing extreme values that could slow training.
Layers can be initialized in different ways and it remains an active area of research.
3. Layer initialization
Pytorch provides an easy way to initialize layers weights with the nn.init module. For example, here we initialize a linear layer with an uniform distribution. As you can see, the weights values now range from 0 to 1.
4. Transfer learning
In practice, engineers are rarely training a model from randomly initialized weights. Instead they rely on a concept called transfer learning.
Transfer learning takes a model that was trained on a first task and reuses it for a second task. For example, we trained a model on US data scientist salaries. We now have new data of salaries in Europe.
Instead of training a model using randomly initialized weights, we can load the weights from the first model and use them as a starting point to train on this new dataset.
Saving and loading weights can be done using the torch.save and the torch.load functions. These functions work on any type of PyTorch objects.
5. Fine-tuning
Sometimes, the second task is similar to the first task and we want to perform a specific type of transfer learning, called fine-tuning.
In this case, we load weights from a previously trained model, but train the model with a smaller learning rate.
We can even train part of a network, if we decide some of the network layers do not need to be trained and choose to freeze them. A rule of thumb is to freeze early layers of the network and fine-tune layers closer to the output layer.
This can be achieved by setting each parameter's requires_grad attribute to False. Here, we use the model's named_parameters() method, which returns the name and the parameter itself. We set requires_grad of the first layer's weight to False.
6. Let's practice!
Let's practice.