1. Random forest
Welcome back!
Now you know that bagged trees are a big improvement over single decision trees.
You might have heard of random forests, which are an improvement upon bagged trees.
2. Random forest
Random forests are particularly suited for high-dimensional data.
As a result of their ease-of-use and out-of-the-box performance, random forest is a very popular machine learning algorithm and is implemented in a variety of packages, like the ranger or randomForest package.
Tidymodels has you covered: the rand_forest() function in the parsnip package provides an interface to these implementations.
3. Idea
The basic idea behind random forest is identical to bagging - both are ensembles of trees trained on bootstrapped samples of the training data.
However, in the random forest algorithm, there is a slight tweak to the way the decision trees are built that leads to better performance.
The key difference is that when training the trees that make up the ensemble, we add a bit of extra randomness to the model - hence the name, random forest.
At each split in the tree, rather than considering all features, or input variables, for the split, we sample a subset of these features or predictors and consider only these few variables as a candidate for the split.
4. Intuition
Let's sharpen our intuition using this picture.
In this case, there are four trees, and in every split, different predictor variables or features are used to make the next split.
Each tree then gives a vote for a class, and the majority vote is the final class prediction.
How can using fewer predictors lead to better performance?
Well, adding this extra bit of randomness leads to a collection of trees that are further de-correlated (or more different) from one another.
So, random forest improves upon bagging by reducing the correlation between the sampled trees.
5. Coding: Specify a random forest model
To run the random forest algorithm you will use the rand_forest() function from the parsnip package. Let's take a look at this call.
The hyperparameters here are: mtry, the number of predictors seen at each node (the default is the square root of total predictors), trees, the size of your forest, and min_n, which you know from decision trees - it's the smallest node size allowed.
As usual, you set the mode using set_mode() and the engine using set_engine().
You can use the ranger engine or the randomForest engine here.
6. Coding: Specify a random forest model
A complete sample specification looks like this:
Use the rand_forest() function with trees, the size of the forest, to be 100 trees. Use mode classification and the ranger engine.
You can always add more trees to improve the performance of the ensemble - more trees almost always means better performance in a random forest.
7. Training a forest
The syntax for training a random forest model follows standard conventions for parsnip models.
We take our specification and specify still_customer as the outcome variable and all other columns as the input variables using the familiar formula interface.
We choose the data to be customers_train, the training data of credit card customers.
8. Variable importance
When specifying the engine, you can also specify the algorithm that controls the node split.
Possible values for the ranger engine are impurity or permutation.
We'll use impurity here and train a model to the customers_train training data using all predictors.
We pass the result to the vip() function from the vip package, which calculates the variable importance for our predictors.
This way, you get an intuition for which predictors are more important in your dataset. Here we see that the total amount of transactions is the most helpful predictor.
9. Let's plant a random forest!
Now it's your turn to grow your first random forest!