Tending to classification trees

1. Tending to classification trees

In the previous video, you learned that decision trees have a tendency to grow overly large and complex very quickly. If this were to happen to trees in your yard, you'd be outside with clippers, looking to trim away some of the excess greenery. Grooming healthy classification trees likewise requires this kind of attention. In this lesson, you'll learn about pruning strategies, which help ensure the trees are just right not too large and not too small.

2. Pre-pruning

One method of preventing a tree from becoming too large involves stopping the growing process early. This is known as pre-pruning. Perhaps the simplest approach to pre-pruning stops divide-and-conquer once the tree reaches a predefined size. The figure here shows a tree that has been stopped early because it reached a maximum depth of three levels. Another pre-pruning method requires a minimum number of observations at a node in order for a split to occur. For example, this figure stops the tree from growing any branch with fewer than 10 observations. Both of these pre-pruning strategies prevent the tree from growing too large. However, a tree stopped too early may fail to discover subtle or important patterns it might have discovered later.

3. Post-pruning

To address this concern, it is also possible to grow a very large tree, knowing that it will be overly complex, but then prune it back to reduce the size. This is known as post-pruning. In post-pruning, nodes and branches with only a minor impact on the tree's overall accuracy are removed after the fact. This figure illustrates a tree that grew to four levels deep, but had a branch pruned away because its presence did not substantially improve the classification accuracy. The relationship between the tree's complexity and the accuracy can be depicted visually as illustrated here. As the tree becomes increasingly complex, the model makes fewer errors. However, though the performance improves a lot at first, it then improves only slightly for the later increases in complexity. This trend provides insight into the optimal point at which to prune the tree; simply look for the point at which the curve flattens. The horizontal dotted line identifies the point at which the error rate becomes statistically similar to the most complex model. Typically, you should prune the tree at the complexity level that results in a classification error rate just under this line.

4. Pre- and post-pruning with R

The rpart decision tree package provides a function for creating this visualization, as well as performing pre- and post-pruning. Pre-pruning is performed when building the decision tree model. The rpart-dot-control function can be supplied with a maxdepth parameter that controls the maximum depth of the decision tree, or a minsplit parameter that dictates the minimum number of observations a branch must contain in order for the tree to be allowed to split. Then, simply supply the resulting control object to the rpart function when building the tree. Post-pruning is applied to a decision tree model that has been previously built. The plotcp function will generate a visualization of the error rate versus model complexity, which provides insight into the optimal cutpoint for pruning. When this value has been identified, it can be supplied to the prune function's complexity parameter, cp, to create a simpler pruned tree.

5. Let's practice!

In the next several exercises, you will have a chance to apply both pre- and post-pruning methods to the Lending Club data to examine the impact on the tree complexity and test set accuracy. Let's see what happens!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.