Get startedGet started for free

Activation functions

1. Activation functions

In the previous video, we discussed dense layers. We also briefly introduced the concept of an activation function through the sigmoid function. We will now return to activation functions.

2. What is an activation function?

A typical hidden layer consists of two operations. The first performs matrix multiplication, which is a linear operation, and the second applies an activation function, which is nonlinear operation.

3. Why nonlinearities are important

Why do we need this nonlinear component? Consider a simple model using the credit card data. The features are borrower age and credit card bill amount. The target variable is default.

4. Why nonlinearities are important

Let's say we create a scatterplot of age and bill amount. We can see that bill amount usually increases early in life and decreases later in life. This suggests that a high bill for young and older borrowers may mean something different for default. If we want our model to capture this, it can't be linear. It must allow the impact of the bill amount to depend on the borrower's age. This is what an activation function does.

5. A simple example

Let's look at a simple example, where we assume that the weight on age is 1 and the weight on the bill amount is 2. Note that ages are divided by 100 and the bill's amount is divided by 10000. We then perform the matrix multiplication step for all combinations of features: young with a high bill, young with a low bill, old with a high bill, and old with a low bill.

6. A simple example

If we don't apply an activation function and we assume the bias is zero, we find that the impact of bill size on default does not depend on age. In both cases, we predict a value of 0 point 8. Note that our target is a binary variable that is equal to 1 when the borrower defaults; however, predictions will be real numbers between 0 and 1, where values over 0 point 5 will be treated as predicting default.

7. A simple example

But what if we apply a sigmoid activation function? The impact of bill amount on default now depends on the borrower's age. In particular, we can see that the change in the predicted value for default is larger for young borrowers than it is for old borrowers.

8. The sigmoid activation function

In this course, we'll use the three most common activation functions: sigmoid, relu, and softmax. The sigmoid activation function is used primarily in the output layer of binary classification problems. When we use the low-level approach, we'll pass the sum of the product of weights and inputs into tf dot keras dot activations dot sigmoid. When we use the high-level approach, we'll simply pass sigmoid as a parameter to a keras dense layer.

9. The relu activation function

We'll typically use the rectified linear unit or relu activation in all layers other than the output layer. This activation simply takes the maximum of the value passed to it and 0.

10. The softmax activation function

Finally, the softmax activation function is used in the output layer in classification problems with more than two classes. The outputs from a softmax activation function can be interpreted as predicted class probabilities in multiclass classification problems.

11. Activation functions in neural networks

Let's wrap up by applying some activation functions in a neural network. We'll do this using the high-level approach, starting with an input layer. We'll pass this to our first dense layer, which has 16 output nodes and a relu activation. Dense layer 2 then reduces the number of nodes from 16 to 8 and applies a sigmoid activation. Finally, we apply a softmax activation function in the output layer, since there are more than 2 outputs.

12. Let's practice!

We've now seen what an activation function is and how to use the most common activation functions. Let's put that knowledge to work in some exercises!