Other tree options and the construction of confusion matrices

1. Other tree options and the construction of confusion matrices

We've covered several options for facilitating the construction of a tree, especially on unbalanced data. Of course, the range of possibilities is endless, and there are some other arguments you can pass to the rpart() function to shape your loan default decision tree. I will go over some of the most important ones in this video.

2. Other interesting rpart() - arguments

One interesting argument in rpart() is weights, which allows you to include case weights for each of the cases in the training set. Increasing the weights for the default cases would have been another plausible strategy to counter the imbalance in the decision tree. The other interesting arguments I will mention here are all part of the rpart-dot-control function that can be passed to the argument control. We used this function before to change the complexity parameter cp. An interesting argument here is minsplit, which allows you to change the minimum number of observations that must exist in a node in order for a split to be attempted. The default value for this argument is equal to 20, but for unbalanced data, it might be useful to lower this value. The argument minbucket specifies the minimum number of observations in any leaf node. The default here is one-third of the value specified in minsplit, but it may be useful to try other values. Be aware that lowering this value too much could lead to overfitting. The last thing we need to do is to make sure we evaluate our decision tree.

3. Making predictions using the decision tree

Until now, we only used the training set, but we want to use our test set to evaluate the result. This can be done using the predict() function. The code to make predictions using the model with the undersampled data set is shown here. Note that we use the pruned version of the tree. Including the argument type equals "class", you get a vector of predictions straight away, without having to use a cutoff value anymore, as opposed to what we did with logistic regression. In some cases, however, you would like to have a non-binary prediction and choose a cutoff yourself. In that case, you can simply leave off the type argument in the predict() function, and you will get probabilities instead of binary predictions. These probabilities are derived from the proportions of the leaf node where a particular case ends up given its covariate values.

4. Constructing a confusion matrix

Here, the confusion matrix is given for the decision tree that was constructed using the undersampled data and binary predictions. Even though we used undersampling, very few defaults are predicted for our test set. It could be useful here to use probability predictions instead of binary predictions and to lower the cutoff.

5. Let's practice!

Let's explore!