Diagnose bias and variance problems

1. Diagnosing Bias and Variance Problems

In this video, you'll learn how to diagnose bias and variance problems.

2. Estimating the Generalization Error

Given that you've trained a supervised machine learning model labeled fhat, how do you estimate the fhat's generalization error? This cannot be done directly because: - f is unknown, - usually you only have one dataset, - you don't have access to the error term due to noise.

3. Estimating the Generalization Error

A solution to this is to first split the data into a training and test set. The model fhat can then be fit to the training set and its error can be evaluated on the test set. The generalization error of fhat is roughly approximated by fhat's error on the test set.

4. Better Model Evaluation with Cross-Validation

Usually, the test set should be kept untouched until one is confident about fhat's performance. It should only be used to evaluate fhat's final performance or error. Now, evaluating fhat's performance on the training set may produce an optimistic estimation of the error because fhat was already exposed to the training set when it was fit. To obtain a reliable estimate of fhat's performance, you should use a technique called cross-validation or CV. CV can be performed using K-Fold-CV or hold-out-CV . In this lesson, we'll only be explaining K-fold-CV.

5. K-Fold CV

The diagram here illustrates this technique for K=10: - First, the training set (T) is split randomly into 10 partitions or folds, - The error of fhat is evaluated 10 times on the 10 folds, - Each time, one fold is picked for evaluation after training fhat on the other 9 folds. - At the end, you'll obtain a list of 10 errors.

6. K-Fold CV

Finally, as shown in this formula, the CV-error is computed as the mean of the 10 obtained errors.

7. Diagnose Variance Problems

Once you have computed fhat's cross-validation-error, you can check if it is greater than fhat's training set error. If it is greater, fhat is said to suffer from high variance. In such case, fhat has overfit the training set. To remedy this try decreasing fhat's complexity. For example, in a decision tree you can reduce the maximum-tree-depth or increase the maximum-samples-per-leaf. In addition, you may also gather more data to train fhat.

8. Diagnose Bias Problems

On the other hand, fhat is said to suffer from high bias if its cross-validation-error is roughly equal to the training error but much greater than the desired error. In such case fhat underfits the training set. To remedy this try increasing the model's complexity or gather more relevant features for the problem.

9. K-Fold CV in sklearn on the Auto Dataset

Let's now see how we can perform K-fold-cross-validation using scikit-learn on the auto-dataset which is already loaded. In addition to the usual imports, you should also import the function cross_val_score() from sklearn-dot-model_selection. First, split the dataset into 70%-train and 30%-test using train_test_split(). Then, instantiate a DecisionTreeRegressor() dt with the parameters max_depth set to 4 and min_samples_leaf to 0-dot-14.

10. K-Fold CV in sklearn on the Auto Dataset

Next, call cross_val_score() by passing dt, X_train, y_train; set the parameters cv to 10 for 10-fold-cross-validation and scoring to neg_mean_squared_error to compute the negative-mean-squared-errors. The scoring parameter was set so because cross_val_score() does not allow computing the mean-squared-errors directly. Finally, set n_jobs to -1 to exploit all available CPUs in computation. The result is a numpy-array of the 10 negative mean-squared-errors achieved on the 10-folds. You can multiply the result by minus-one to obtain an array of CV-MSE. After that, fit dt to the training set and evaluate the labels of the training and test sets.

11. K-Fold CV in sklearn on the Auto Dataset

The CV-mean-squared-error can be determined as the mean of MSE_CV. Finally, you can use the function MSE to evaluate the train and test set mean-squared-errors. Given that the training set error is smaller than the CV-error, we can deduce that dt overfits the training set and that it suffers from high variance. Notice how the CV and test set errors are roughly equal.

12. Let's practice!

Now it's your turn.

This exercise is part of the course

Machine Learning with Tree-Based Models in Python

IntermediateSkill Level

4.9+

Start Course for Free

Classification and Regression Trees (CART) are a set of supervised learning models used for problems involving classification and regression. In this chapter, you'll be introduced to the CART algorithm.

Exercise 1: Decision tree for classification Exercise 2: Train your first classification tree Exercise 3: Evaluate the classification tree Exercise 4: Logistic regression vs classification tree Exercise 5: Classification tree Learning Exercise 6: Growing a classification tree Exercise 7: Using entropy as a criterion Exercise 8: Entropy vs Gini index Exercise 9: Decision tree for regression Exercise 10: Train your first regression tree Exercise 11: Evaluate the regression tree Exercise 12: Linear regression vs regression tree

The bias-variance tradeoff is one of the fundamental concepts in supervised machine learning. In this chapter, you'll understand how to diagnose the problems of overfitting and underfitting. You'll also be introduced to the concept of ensembling where the predictions of several models are aggregated to produce predictions that are more robust.

Exercise 1: Generalization Error Exercise 2: Complexity, bias and variance Exercise 3: Overfitting and underfitting Exercise 4: Diagnose bias and variance problems

Current Exercise

Exercise 5: Instantiate the model Exercise 6: Evaluate the 10-fold CV error Exercise 7: Evaluate the training error Exercise 8: High bias or high variance?Exercise 9: Ensemble Learning Exercise 10: Define the ensemble Exercise 11: Evaluate individual classifiers Exercise 12: Better performance with a Voting Classifier

Bagging is an ensemble method involving training the same algorithm many times using different subsets sampled from the training data. In this chapter, you'll understand how bagging can be used to create a tree ensemble. You'll also learn how the random forests algorithm can lead to further ensemble diversity through randomization at the level of each split in the trees forming the ensemble.

Exercise 1: Bagging Exercise 2: Define the bagging classifier Exercise 3: Evaluate Bagging performance Exercise 4: Out of Bag Evaluation Exercise 5: Prepare the ground Exercise 6: OOB Score vs Test Set Score Exercise 7: Random Forests (RF)Exercise 8: Train an RF regressor Exercise 9: Evaluate the RF regressor Exercise 10: Visualizing features importances

Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. In this chapter, you'll be introduced to the two boosting methods of AdaBoost and Gradient Boosting.

Exercise 1: Adaboost Exercise 2: Define the AdaBoost classifier Exercise 3: Train the AdaBoost classifier Exercise 4: Evaluate the AdaBoost classifier Exercise 5: Gradient Boosting (GB)Exercise 6: Define the GB regressor Exercise 7: Train the GB regressor Exercise 8: Evaluate the GB regressor Exercise 9: Stochastic Gradient Boosting (SGB)Exercise 10: Regression with SGB Exercise 11: Train the SGB regressor Exercise 12: Evaluate the SGB regressor

The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set. In this chapter, you'll learn how to tune the hyperparameters of a tree-based model using grid search cross validation.

Exercise 1: Tuning a CART's Hyperparameters Exercise 2: Tree hyperparameters Exercise 3: Set the tree's hyperparameter grid Exercise 4: Search for the optimal tree Exercise 5: Evaluate the optimal tree Exercise 6: Tuning a RF's Hyperparameters Exercise 7: Random forests hyperparameters Exercise 8: Set the hyperparameter grid of RF Exercise 9: Search for the optimal forest Exercise 10: Evaluate the optimal forest Exercise 11: Congratulations!