Get startedGet started for free

Model generalization: bootstrapping and cross-validation

1. Model generalization: bootstrapping and cross-validation

Welcome to Chapter 4 where we'll be covering model selection and evaluation techniques!

2. Chapter 4 overview

In this lesson, we'll start with diving a little deeper into bootstrapping and cross-validation methods as they apply to model generalization. Then we'll go over what to do when you have a classification model with imbalanced classes, when you have features that have high correlation, and finally, how to choose between ensemble models.

3. Model generalization

When data is split into train and test, we evaluate model performance with the test set only when we're confident in our trained model. What we're really doing is ensuring that our model is able to perform well on data not yet seen. Ultimately, similar evaluation metrics between the training and test sets are an indicator of model generalizability. Bootstrapping is one of the methods that helps with model generalization and to better understand it, let's take a look at the decision tree algorithm.

4. Decision tree

Decision trees are a supervised learning technique used to build predictive ML models for both categorical or continuous target variables. This image is a decision tree plot using the classic iris dataset.

5. Decision tree nodes

The top of the decision tree is called the root node, where we see a split was made on petal width less than or equal to 0.8. The criteria used to make that split, the gini index, is a measure of impurity. Samples correspond to the number of observations in the dataset and the value is the number of observations in each class, split here as 50 each. The observations that evaluate to true are split to the left, the remainder going to the right.

6. Advantages vs disadvantages

And so it continues, making splits and directing observations until the decision is made they belong to a particular class where they end up in a terminal or leaf node. If this were a regression tree instead, the split criteria would use the lowest mean squared error to make splits rather than gini. The advantages to using decision trees are that they are easy to understand and plot. The disadvantages are that they easily overfit, they are considered greedy in that they may not return globally optimal trees, and they are biased in cases of class imbalance, which we'll get to in the next lesson.

7. Random Forest

A random forest is simply a bootstrapped version of many decision trees! Recall that bootstrapping is a sampling technique where a subset of the data is selected with replacement, averaging the output predictions to reduce variance producing a more accurate model.

8. K-fold cross-validation

K-fold cross-validation is another tool we can use to help models generalize since it prevents model overfitting. The way it works is the training data is split into k folds. One fold is held out and used as the test set while the remaining folds are used for model training. This continues in an iterative manner until all of the folds have been used as the testing set.

9. Functions

The functions you'll use in the exercises are the decision tree and random forest classifiers, gridsearchCV for a cross-validated grid search, accuracy underscore score for model accuracy and the best model parameters from the grid search is given by best underscore params underscore and the best accuracy by best underscore score underscore.

10. GridSearchCV vs RandomSearchCV

The gridsearchCV function tests parameters for a given spacing in order to give parameter estimates. Since this doesn't search the entire space, it is also good to be aware of the randomsearchCV function as it is more likely to come up with an optimal parameter estimation. A random search space tends to have a longer run time, so keep that in mind. Also, worth noting, K-fold cross-validation can be used with many ML algorithms, not just the handful you've learned about in this course.

11. Let's practice!

Your turn to try model generalization techniques!