The bias-variance tradeoff

1. The bias-variance tradeoff

Hello again. Let's try to identify when we have a good fitting model.

2. Variance

One way to do this is to consider bias and variance. Variance occurs when a model pays too close attention to the training data and fails to generalize to the testing data. These models perform well on only the training data, but not the testing data, and are considered to be overfit.

3. Overfitting models (high variance)

Overfitting occurs when our model starts to attach meaning to the noise in the training data. In this graphic, you can see the natural quadratic shape of the orange dots. However, our blue prediction line is hugging the data and would likely not extend well to new orange dots. Overfitting is easy to identify though, as the training error will be a lot lower than the testing error.

4. Bias

The second term, Bias, occurs when the model fails to find the relationships between the data and the response value. Bias leads to high errors on both the training and testing datasets and is associated with an underfit model.

5. Underfitting models (high bias)

Underfitting occurs when the model could not find the underlying patterns available in the data. This might happen if we don't have enough trees or the trees aren't deep enough. In this example, we have the average of the actual values acting as our prediction. Underfitting is more difficult to identify because the training and testing errors will both be high, and it's difficult to know if we got the most out of the data, or if we can improve the testing error.

6. Optimal performance

When our model is getting the most out of the training data, while still performing on the testing data, we have optimal performance. Notice how the blue line is matching the natural quadratic shape of the data and that it is not touching every orange dot. The blue line is a well fit prediction line for future data. So how do we tell if we have a good fit, or if we are just underfitting?

7. Parameters causing over/under fitting

For random forest models, some parameters that affect performance are max depth and max features. One way to check for a poorly fit model is to try additional parameter sets and check both the training and testing error metrics. Notice that the overall training accuracy is a bit higher than the testing accuracy. We might have some past experience with this type of data that suggests we can expect a much higher accuracy and we conclude that we are probably underfitting. As you run more random forest models, you will get a better sense of which parameters you should tweak. But in this case, a max depth of 4 is probably not deep enough.

8. Parameters continued

This time around, we may have made the max depth too large and are overfitting. Achieving 100% accuracy on the training dataset while only getting 83% on testing is a clear sign that we are overfitting. We always compare how well the model performed on the data it has seen to the data it has not seen.

9. Parameters continued

Finally, a max depth of 10 has brought the testing accuracy up, while also bringing it closer to the training accuracy. Indicating that the model is generalizing well to new data while still performing really well overall. We will never know if 86% is the best accuracy possible for this dataset. However, we have explored various parameter sets, checked the difference between the testing and training errors at each stage, and improved our accuracy by almost 10% over the first model that we created.

10. Remember, only you can prevent overfitting!

We will explore parameter tuning later in this course. For now, let's see how changing a single parameter value affects model performance.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.