Get startedGet started for free

ReLU activation functions

1. ReLU activation functions

We've seen how activation functions introduce non-linearity to help neural networks learn complex patterns and we've learned about gradients and their role within the training loop.

2. Sigmoid and softmax functions

Sometimes activation functions can shrink gradients too much, making training inefficient. So far, we've worked with two activation functions: sigmoid and softmax, which are typically used in a model's final layer.

3. Limitations of the sigmoid and softmax function

We'll begin by understanding some of the limitations of the sigmoid function. The sigmoid's outputs are bounded between 0 and 1, meaning that for any input, the output will always fall in this range. Sigmoid could be used at any point in a network. However, the gradients of the sigmoid, shown in orange, are very small for large and small values of x. This phenomenon is called saturation. During backpropagation, this becomes problematic because each gradient depends on the previous one. When gradients are extremely small, they fail to update the weights effectively. This issue is known as the vanishing gradients problem, and it can make training deep networks very difficult. The softmax function, which also produces bounded outputs between 0 and 1, suffers from saturation in a similar way. Therefore both of these activation functions are not ideal for hidden layers, and are best used in the last layer only.

4. ReLU

We'll discover two widely used activation functions designed for use between linear layers or in hidden layers. Here is the rectified linear unit or ReLU. ReLU outputs the maximum value between its input and zero, as shown in the graph. For positive inputs, the output equals the input. For negative inputs, the output is zero. This function has no upper bound, and its gradients do not approach zero for large values of x, which helps overcome the vanishing gradients problem. In PyTorch, ReLU can be used through the torch.nn module. It's a reliable default activation function for many deep learning tasks.

5. Leaky ReLU

The leaky ReLU is a variation of the ReLU function. For positive inputs, it behaves exactly like ReLU. For negative inputs, it multiplies the input by a small coefficient (defaulted to 0.01 in PyTorch). This ensures the gradients for negative inputs remain non-zero, preventing neurons from completely stopping learning, which can happen with standard ReLU. In PyTorch, the leaky ReLU function is implemented using the torch.nn module. The negative_slope parameter controls the coefficient applied to negative inputs.

6. Let's practice!

Let's put this into action.