Pruning the decision tree

1. Pruning the decision tree

As you have seen in the previous sequence of exercises, the resulting decision trees are very large. It is generally not advised to construct enormous decision trees.

2. Problems with large decision trees

Not only are they harder to interpret, but overfitting may occur, leading to inferior results when applying the model to the test set. The rpart package provides tools that make pruning of decision trees fairly easy. Useful functions are plotcp() and printcp(). Let's have a look at what these functions do when applying them to the tree we constructed using the undersampled training set.

3. Printcp and tree_undersample

Let's have a look at printcp() first. Applying this function to our object tree_undersample, you get an overview of how the tree grows using more splits (given in column nsplit). In fact, you can change the complexity parameter cp, which was fixed to point-001 before, to obtain the right complexity level. Now, what level should be chosen? You would want to minimize the so-called cross-validated error of the decision tree. These results are given in column xerror. As the name suggests, cross-validation inside the training set is used to obtain this error measure. Since there is a random element to the cross-validation process, it is necessary to set a seed if you want your results to be reproducible.

4. Plotcp and tree_undersample

You can also plot the cross-validation error as a function of the complexity parameter cp and size of the tree using the function plotcp.

5. Plotcp and tree_undersample

The minimum cross-validated error can be found right here, where the complexity parameter cp equals point-003653. For this exact cp value, you can have a look at printcp() again.

6. Plot the pruned tree

Now, to prune the tree, the function prune() can be used, with the initial unpruned tree as the first argument and the cp value that minimized the cross-validated error as the second argument. Plotting the pruned tree, you get a smaller decision tree. Note that this tree does not give you information on the true number of defaults versus non-defaults that are present in each leaf.

7. Plot the pruned tree

If you want this information, you can include the argument use-dot-n equals TRUE in the text() function. Note that these values refer to the training set only! Keep in mind that an answer of YES to the test statement in a node will lead you to follow the left-hand branch, and an answer of NO to the test statement will lead you to follow the right-hand branch. This is not very intuitive in the standard plot for the rpart package. Additionally, the test "home_ownership equals cd" is not very informative.

8. prp() in the rpart.plot-package

A more intuitive plot is obtained using function prp() in the rpart.plot package. When plotting using prp(), you receive guidance on which branch to take depending on the answer to the split question. You also get the actual factor name for factor variables, as you can see for the own and rent categories for home_ownership.

9. prp() in the part.plot-package

If you want to have the number of default versus non-default cases, you can simply include the argument extra equals 1.

10. Let's practice!

In the exercises that follow, you will prune the trees you constructed in the previous exercises!