Get startedGet started for free

Seeing the forest from the trees

1. Seeing the forest from the trees

Consider the ways that decision trees parallel trees in the natural environment: from a root node that grows into branches and leaf nodes that sometimes need pruning, you might think that by now we would have exhausted the tree metaphors. In fact, there is one more. Just as living trees can be grouped as a forest, a number of classification trees can be combined into a collection known as a decision tree forest. For reasons that you will soon see, these forests are among the most powerful machine learning classifiers, yet remain remarkably efficient and easy to use.

2. Understanding random forests

Because of their combine versatility and power, decision tree forests have become one of the most popular approaches for classification. This power does not come from a single tree that has grown large and complex, but rather from a collection of smaller, simpler trees that together reflect the data's complexity. Each of the forest's trees is diverse, and may reflect some subtle pattern in the outcome to be modeled. Generating this diversity is the key to building powerful decision tree forests. However, if you were to grow 100 trees on the same set of data, you'd have 100 times the same tree. Growing diverse trees requires the growing conditions to be varied from tree to tree. This is done by allocating each tree a random subset of data; one may receive a vastly different training set than another. The term random forests refers to a specific growing algorithm in which both the features and examples may differ from tree to tree.

3. Making decisions as an ensemble

It seems somewhat counterintuitive to think that a group of trees built on small, random subsets of the data could perform any better than a single really complex tree that had the benefit of learning the entire dataset. But the forest's power is based on the same principles that govern successful team work in business or on the athletic field. In these cases, it is certainly advantageous to have team members that are extremely good at some tasks. However, these people typically have weaknesses in other areas. For this reason, it is even better for the team to have members with complementary skills. Even if none of the members is especially strong, good teamwork usually wins. Machine learning methods like random forests that apply this principle are called ensemble methods. All ensemble methods are based on the principle that weaker learners become stronger with teamwork. In a random forest, each tree is asked to make a prediction, and the group's overall prediction is determined by a majority vote. Though each tree may reflect only a narrow portion of the data, the overall consensus is strengthened by these diverse perspectives.

4. Random forests in R

The R package randomForest implements the random forest algorithm. The function offers two parameters of note. The first, ntree, dictates the number of trees to include in the forest. Setting this sufficiently large will ensure good representation of the complete set of data. Don't worry even with a large number of trees, the model typically runs relatively quickly as it uses only a portion of the full dataset. The second parameter, mtry, is the number of features selected at random for each tree. By default, it uses the square root of the total number of predictors. Generally, this parameter is OK to leave as is. As usual, the predict function uses the model to make predictions.

5. Let's practice!

By now I'm sure you're excited to see a random forest in action. In the next exercises you'll have a chance to grow one and compare its performance to the best decision tree.