Random forests

1. Random forests

Now that we've grown some decision trees, let's plant a whole forest.

2. Bias-variance tradeoff

Random forests were created to reduce the variance of decision trees. Decision trees can work well on training data, but not so well on test data. This is a high variance model, like this polynomial fit to the data shown here. The predictions greatly vary following the training set and predict poorly on the test set.

3. High bias

The opposite of a high variance model is a high bias model, like this linear fit. It captures the general trend but misses small details. Random forests balance between high variance and high bias models.

4. Random forests

Random forests are named as such because they're a collection of decision trees. Here's 4-tree random forest.

5. Bootstrap aggregating (bagging)

Random forests have differences in the way trees are created. First, we sample with replacement from our training set to get datasets for each tree we fit. This means we take samples from our training set, shown on top, with each data point being drawn from the whole dataset. This is bootstrapping, and the bootstrapped sample is shown on the bottom. Bootstrapping means we can have repeated points in our sample, like the repeated ones you see in the bootstrapped sample. We may also omit some data points, like the missing 2 in the bootstrapped sample.

6. Feature sampling

So random forests are an ensemble of decision trees and use bootstrapping to get datasets for each tree, which is called bootstrap aggregating, or bagging. Another difference between normal decision trees and random forests is how splits happen. Instead of trying splits on all features, we sample a smaller number of features for each split. This helps reduce the variance of our random forest model.

7. sklearn implementation

We implement random forests in sklearn like decision trees. After importing the class, we create a new instance. Then we fit it to our features and targets and evaluate performance with the score function. This yields the R-squared value -- 1 means perfect predictions, 0 means very bad, and negative is terrible.

8. Hyperparameters

Random forests have many hyperparameters we can tune. We often want to tune max_features and max_depth. max_features is the number of features that are randomly chosen at splits; it's the square root of the total number of features by default. I often search a range of 2 or 3 up to the total number of features for this hyperparameter. max_depth limits the total number of splits in trees, which may range from 5-20. The n_estimators hyperparameter is the number of trees in the forest. We should set this to be a larger number than the default of 10. Typically performance flattens out at a high number of trees in the hundreds. Lastly, it's always a good idea to set the random_state, so our results will be reproducible.

9. ParameterGrid

sklearn has ParameterGrid to help search hyperparameters. We could also use GridSearchCV with TimeSeriesSplit in sklearn, but we don't have time to cover that in this course. ParameterGrid creates combinations of the entries in a dictionary we provide. Each entry in the dictionary should be a list, even if it's a single value. Here's an example.

10. ParameterGrid

Once we have the ParameterGrid, we loop through it. To set hyperparameters, we "unpack" the dictionary as individual arguments with the double asterisk. Then we fit to the train set and evaluate performance on the test set, appending the score to a list. We use numpy's argmax() function to get the index of the best test score, then get the best hyperparameters from our ParameterGrid. This gives max_depth of 5, max_features of 8, and 200 trees.

11. Plant some random forests!

Ok, it's time to see how well random forests perform!