Get Started

Random forests

1. Random forests

In this lesson, you will learn about random forests, and how to fit a random forest model with the ranger package.

2. Random Forests

Random forests try to resolve the issues with decision tree model by building multiple trees from the training data. Using slightly different data to build each tree adds diversity to the models. Averaging the results of multiple trees together reduces the risk of overfit. Multiple trees also gives finer-grain predictions than a single tree.

3. Building a Random Forest Model

Each individual tree is grown from a random sample of the training data. For a single tree, each node is formed by picking a variable to split the data on, and a value to make the split. In a random forest, the set of candidate variables to consider is randomly selected. All this randomization gives diversity to the set of trees. After all the trees are grown, the model makes a prediction on a datum by running it through all the trees and averaging the result.

4. Example: Bike Rental Data

For our example, we'll return to the bike rental data from a previous lesson, and predict hourly bike rental rates from the time of day, the type of day, and the weather conditions. We'll train a model on data from January, and evaluate the model on data from February.

5. Random Forests with ranger()

We'll use the ranger package to fit random forest. The ranger function takes a formula, the training data, and the number of trees. If the outcome variable is a numeric value, ranger will automatically do regression rather than classification. By default, ranger builds 500 trees. We recommend that you use at least 200. The argument respect-dot-unordered-dot-factors tells ranger how to treat categorical variables. We recommend setting this value to "order". This causes ranger to safely and meaningfully encode the categorical variables as numbers. This encoding runs faster than converting a categorical variable to indicator variables when the categorical variable has a very large number of possible values.

6. Random Forests with ranger()

Printing a ranger model will display what's called an out of bag estimate of the model's R-squared and mean squared error. These are the algorithm's estimates of how the model will perform on future data. When possible, you should still evaluate the model directly on test data.

7. Predicting with a ranger() model

The predict function for a ranger model takes the model and a new dataset. It returns an object with a field called predictions, containing the predicted outcomes. Here we get the model's predictions for the February data.

8. Evaluating the model

On the February data, our random forest model has a root mean squared error of 67.15, a slight improvement on the RMSE of 69.3 for the quasipoisson model from Chapter 3.

9. Evaluating the model

We can also compare the predictions to the actual hourly bike rentals via a scatterplot

10. Evaluating the model

or by plotting the data as a function of time.

11. Let's practice!

Now let's practice building and fitting random forest models.