1. Building decision trees using the rpart()-package
You've now seen how a perfect split can be found using the Gini measure. In the previous exercise, you only looked at one possible split.
2. Imagine...
Imagine having to manually compare all possible splits to find the best! That would be quite a challenge, right?
3. rpart() package! But...
Luckily, there is a package in R that builds decision trees for you, the rpart package.
While this package is generally very useful for building decision trees, it is important to warn you upfront that building decision trees in the credit risk context can be quite challenging. The main reason for this is the fact that credit risk data are generally very unbalanced due to the very low percentage of defaults. Using the default settings on the training set of loan_data, including all variables and specifying the method "class", as the response variable loan_status is categorical, you get a warning message stating that only a root is created. The reason for this is that, similar to what we have observed when setting a cutoff for a logistic regression model; the highest accuracy is achieved by simply predicting all cases to be non-defaults.
4. Three techniques to overcome unbalance
There are three main things we can do to overcome the imbalance. A first option is either oversampling your underrepresented group (in this case, the defaults) or undersampling the overrepresented group (or non-defaults). Balancing the training set will have a positive effect on the accuracy issue discussed before, and generally lead to better results. Note that over- or undersampling should only be applied to the training set and not to the test set!
A second option is changing the prior probabilities in the rpart() function. By default, the prior probabilities of default versus non-default are set equal to their proportions in the training set. By making the prior probabilities for default bigger, you kind of trick R into attaching more importance to defaults, leading to a better decision tree.
As a third option, the rpart() function allows the specification of a loss matrix. In this loss matrix, different costs can be associated with the misclassification of a default as a non-default verses the misclassification of a non-default as a default. By increasing the misclassification cost of the former, again, more attention is drawn to the correct classification of defaults, improving the quality of the decision tree.
As all three methods are intended to overcome the problem of class imbalance, validation is very important. Where under- or oversampling may work very well for some data sets, it might perform poorly for others, and the same could be true for the two other methods. What the best method is for the data at hand will only become clear upon proper model validation.
5. Let's practice!
Now, let's explore how we can change these settings appropriately in the rpart() function!