1. Generalization Error
Welcome to chapter 2! In this video, you'll understand what is the generalization error of a supervised machine learning model.
2. Supervised Learning - Under the Hood
In supervised learning, you make the assumption that there's a mapping f between features and labels. You can express this as y=f(x).
f which is shown in red here is an unknown function that you want to determine.
In reality, data generation is always accompanied with randomness or noise like the blue points shown here.
3. Goals of Supervised Learning
Your goal is to find a model fhat that best approximates f.
When training fhat, you want to make sure that noise is discarded as much as possible.
At the end, fhat should achieve a low predictive error on unseen datasets.
4. Difficulties in Approximating $f$
You may encounter two difficulties when approximating f.
The first is overfitting, it's when fhat fits the noise in the training set.
The second is underfitting, it's when fhat is not flexible enough to approximate f.
5. Overfitting
When a model overfits the training set, its predictive power on unseen datasets is pretty low.
This is illustrated by the predictions of the decision tree regressor shown here in red. The model clearly memorized the noise present in the training set.
Such model achieves a low training set error and a high test set error.
6. Underfitting
When a model underfits the data, like the regression tree whose predictions are shown here in red, the training set error is roughly equal to the test set error. However, both errors are relatively high.
Now the trained model isn't flexible enough to capture the complex dependency between features and labels.
In analogy, it's like teaching calculus to a 3-year old. The child does not have the required mental abstraction level that enables him to understand calculus.
7. Generalization Error
The generalization error of a model tells you how much it generalizes on unseen data.
It can be decomposed into 3 terms: bias, variance and irreducible error where the irreducible error is the error contribution of noise.
8. Bias
The bias term tells you, on average, how much fhat and f are different.
To illustrate this consider the high bias model shown here in black; this model is not flexible enough to approximate the true function f shown in red. High bias models lead to underfitting.
9. Variance
The variance term tells you how much fhat is inconsistent over different training sets.
Consider the high variance model shown here in black; in this case, fhat follows the training data points so closely that it misses the true function f shown in red. High variance models lead to overfitting.
10. Model Complexity
The complexity of a model sets its flexibility to approximate the true function f.
For example: increasing the maximum-tree-depth increases the complexity of a decision tree.
11. Bias-Variance Tradeoff
The diagram here shows how the best model complexity corresponds to the lowest generalization error.
When the model complexity increases, the variance increases while the bias decreases. Conversely, when model complexity decreases, variance decreases and bias increases.
Your goal is to find the model complexity that achieves the lowest generalization error.
Since this error is the sum of three terms with the irreducible error being constant, you need to find a balance between bias and variance because as one increases the other decreases. This is known as the bias-variance trade-off.
12. Bias-Variance Tradeoff: A Visual Explanation
Visually, you can imagine approximating fhat as aiming at the center of a shooting-target where the center is the true function f.
If fhat is low bias and low variance, your shots will be closely clustered around the center.
If fhat is high variance and high bias, not only will your shots miss the target but they would also be spread all around the shooting target.
13. Let's practice!
Time to put this into practice.