Get startedGet started for free

Growing larger classification trees

1. Growing larger classification trees

Starting from a seed, real-world trees need the proper combination of soil, water, air, and light to grow. Just like understanding these principles helps you become a better gardener, knowing a bit about the growing conditions of decision trees will help you produce more robust classification models. In this lesson, you'll learn more about how trees grow, branch out, and sometimes even outgrow their environment.

2. Choosing where to split

Earlier, you learned that classification trees use divide-and-conquer to identify splits that create the most "pure", or homogeneous, partitions. To see how this works in practice, let's consider the tree being built with data on loan applicants' credit and requested amount. For each of these predictors, the algorithm attempts a split on the feature values and then calculates the purity of the resulting partitions. The split that produces the purest partitions will be used first. Here, split A divides the data into partitions with high and low credit scores, while split B divides the data into large and small loan amounts. Split B results in one very homogeneous partition, but its other partition is very mixed. In comparison, split A results in two partitions that are both relatively pure. As a result, the tree will choose split A first. It then continues to divide-and-conquer further.

3. Axis-parallel splits

As the tree continues to grow, it creates smaller and more homogeneous partitions as shown here. You may have noticed, however, that there was an easier way to create a set of perfectly-pure partitions simply use a diagonal line to divide the outcomes. Unfortunately, a decision tree cannot discover this itself because a diagonal line requires a it to consider two features at once, which is not possible in the divide-and-conquer process. Instead, a decision tree always creates so-called axis-parallel splits. This limitation is a potential weakness of decision trees; they can be overly complex when modeling certain patterns in the data.

4. The problem of overfitting

Generally speaking, decision trees have the tendency to become very complex very quickly. A tree can happily divide-and-conquer until it classifies every example correctly, or until it runs out of feature values to split upon. When a tree has grown overly large and overly complex, it may experience the problem of overfitting. Rather than modeling the most important trends in the data, a tree that has been over-fitted tends to model the noise. It focuses on extremely subtle patterns that may not apply more generally. More-so than many other machine learning algorithms, classification trees have this tendency to overfit the dataset it is trained on.

5. Evaluating model performance

When a machine learning model has been over-fitted to its training dataset, you must take care not to over-estimate how well the model will perform in the future. Just because it perfectly classifies every training example correctly does not mean it will do so on unseen data. Thus, it is important to simulate unseen future data by constructing a test dataset that the algorithm cannot use when growing the tree. A simple method for constructing test sets involves holding out a small random portion of the full dataset. This is a fair estimate of the tree's performance; if the tree performs much more poorly on the test set than the training set, it suggests the model may have been over-fitted.

6. Let's practice!

You'll get a chance to construct random test sets in the next exercises. Good luck!