1. Model comparison
Congratulations! You came a long way and made great progress so far.
The most important task after model research is comparing these models to choose one that is most useful for you.
2. Motivation
The remaining exercises will reveal exactly how to perform a model comparison across all models of this course: Decision Trees, Bagged Trees, Random Forest, and Gradient Boosting.
We'll use the test set predictions from each of the models to compute the out-of-sample AUC. The model with the highest AUC is considered to be the best performing model.
Finally, we’ll visualize the ROC curves of all models and plot them all on the same graph.
Let's jump right into it!
3. Combine predictions
To simplify programming overhead, we are going to leverage the tidy tibble format of the predictions.
Use the bind_cols() function to combine the predictions of the decision_tree, preds_tree,
4. Combine predictions
with the bagged tree predictions, preds_bagging,
5. Combine predictions
the random forest predictions, preds_forest,
6. Combine predictions
the boosted ensemble predictions, preds_boosting,
7. Combine predictions
and the column still_customer, which is extracted by selecting it from the test data.
8. Calculate decision tree AUC
This way, calculating the AUC is real easy:
Just call the roc_auc function with the combined predictions tibble giving the truth column still_customer and the estimate column containing the predicted probabilities. For the decision tree, this would be the preds_tree column, and we see an AUC of 91-point-1 percent.
9. Calculate bagged tree AUC
for the bagged ensemble this would be the preds_bagging column, where we see an AUC of 93-point-6 percent.
10. Calculate random forest AUC
for the random forest the preds_forest column, where we see an AUC of 97-point-4 percent.
11. Calculate boosted AUC
and for the boosted ensemble the preds_boosting column, which states an AUC of 98-point-4 percent.
12. Combine all AUCs
If you combine these calls using the bind_rows() function, the AUCs are combined into one single tibble.
13. Combine all AUCs
To state which AUC belongs to which model, it's advisable to name all the arguments in the bind_rows() call and provide the dot-id argument to create a result column that contains the model names.
The result is a beautiful tibble giving all the AUCs in a tabular form. We see that the boosted tree beats all other models.
Well, this can be done more tidy.
14. Reformat the results
Let's tidy that up while creating the ROC curves. Right now, the model predictions are stored in one column per model. It would be cleaner if all predictions were in one single numeric column.
The pivot_longer() function from the tidyr package can do that.
First, provide the tibble to be reshaped, preds_combined.
Then, in the cols argument, specify the columns to be transformed into rows, which is all columns that start with "preds_". The argument names_to specifies the name of the new column that stores our identifiers and values_to is the name of the column where all the numeric values go.
The result is a tibble with new columns model and predictions containing the same information as before, but in a different shape. We have 4044 rows now, exactly four times the number of rows as before, because we transformed four columns into one row each.
15. Calculate cutoff values
The rest is real easy. Group the predictions by model, so that one curve per model is calculated.
Then, for every possible cutoff, calculate sensitivity and specificity using the roc_curve() function that takes the truth column "still_customer" and the estimate column "predictions".
16. Plot ROC curves
As a last step, we call the autoplot() function on the result to see all the ROC curves on one single plot.
See the order in which your models improved? Violet, the decision tree, performs worst, and green, the boosted tree, performs best.
17. Let's compare!
Now it's your turn to compare your models!