Get startedGet started for free

Random forest models

1. Random forest models

Now that we've seen how to use lasso regression for dimensionality reduction, let's take a look at how we can reduce dimensionality using a random forest model.

2. Random Forest

A random forest classifier is an ensemble model. We can think of ensemble models as a way to draw upon the wisdom of the crowds. Ensemble models aggregate the predictions of many models to make a final prediction. Random forest models aggregate across uncorrelated trees made from random subsets of features. This aggregation helps to mitigate error from any one tree, avoid overfitting, and produce an accurate final prediction. A random forest classifier also naturally performs feature selection. Features that contain more information will be used more often. After the model is generated, we can retrieve the importance of the features. Features that contain less information will be used less often. Thus, like lasso regression the weight of less important features will be reduced toward zero.

3. Random Forest

When we fit a random forest to the credit dataset, it may look like this. Notice that the first tree on the left uses outstanding debt at the root of the tree and the number of credit cards at the second-level node to classify the individual as having standard credit. Features closer to the root of tree tend to be more important. By averaging how often features are used to make decisions and their proximity to the root of the tree, the model establishes feature importances.

4. Train a Random Forest

We can train a random forest model using tidymodels. We call the rand_forest() function, setting mode to classification and trees to two-hundred to train two-hundred trees. We explicitly call the set_engine function to specify the importance parameter to be impurity. This is needed so we can extract feature importances later. The engine input is a required input of set_engine, so we explicitly specify ranger, which is the default engine for rand_forest. We then fit the model to the train data. Notice that we train a full model, using all the available features. We predict the test data using predict and bind the predictions to the test data frame.

5. Evaluate the Model

We call the f_meas() function to evaluate the model performance. We pass f_meas() the prediction data frame and specify the columns that contain the truth values and predicted values. With the full model we achieve an F1 score of approximately sixty-nine percent.

6. Variable Importance

We can view the feature importances using the vip function from the vip package. The rf_fit object is passed vip. Here we can see that outstanding debt, interest rate, and delay from due date are the most important features.

7. Feature Mask

We can extract the most important features using the vi function. We pass vi() the fit object and set the rank parameter to be true so that vi will rank order the features by importance. We then use filter() to keep the top ten features. vi() produces a data frame with a Variable and Importance column. The Variable column contains the name of the features, so we pull out those features into a list.

8. Reduce the data

We can then use the top features list to reduce the dimensionality of the train and test data.

9. Performance

When we refit the random forest model to the reduced data, the F1 score drops from sixty-nine percent to sixty-seven-point-four percent. Depending on the problem requirements, this drop in performance may not be acceptable. For example, if accuracy is top priority. However, if speed is more important than accuracy, the simpler model would be preferable.

10. Let's practice!

Now, it's your turn to practice.