Reintroducing random forest

1. Reintroducing random forest

Next, lets try a random forest model on the churn dataset. After glmnet, random forest is always the second model I try on any new predictive modeling problem.

2. Random forest review

Random forests are slower than glmnet models, and are a bit more of a black-box in terms of interpretability, but in a lot of situations can yield much more accurate models with little parameter tuning. Another important aspect of random forests is that they require little pre-processing. There's no need to log transform or otherwise normalize your predictors, and they handle the missing-not-at-random case pretty well, even with median imputation. They also automatically capture threshold effects and variable interactions by default, both of which occur often in real-world data. These features make random forests typically (though not always) more accurate than glmnet models, and are also easier to tune (but slower-running).

3. Random forest on churn data

This model is even easier to fit than glmnet. The default caret values for the tuning parameters are great, so we don't need a custom tuning grid. Let's use our custom trainControl object from the last video, and fit a random forest model to the churn data using the ranger package.

4. Random forest on churn data

As with the glmnet model, we can plot the results from the cross-validation and see how mtry relates to AUC. Again, caret has automatically chooses the best results for mtry, so we don't need to do anything after viewing this plot, but it's a useful method for understanding the model.

5. Let’s practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.