Continuous outcomes

1. Continuous outcomes

In this section, we’ll talk about another type of decision tree, the regression tree. In regression, the goal is to predict a numeric or quantitative outcome. An example of a continuous numeric outcome is house prices or the number of online store visitors - the underlying value is continuous and can take any positive value.

2. The dataset

These examples and the exercises will use data from a survey of chocolate tastings. There is information about the amount of cocoa (in percent), the origin of the bean, the company location, and others. The outcome or response variable is final_grade, which is a numeric double value from 1 to 5.

3. Construct the regression tree

Construct the regression tree almost in the same way as a classification tree: Define the model class decision_tree(), set the mode to regression and the engine to "rpart". Afterward, you can train the model by using the fit function as usual. Provide the formula, and data to train on: the dataset chocolate_train.

4. Predictions using a regression tree

Predicting with a regression tree is again the same as predicting with a classification tree. Simply call the predict() function and provide the model and new data. In this example, we supply our testing data as the new_data argument. The predicted numbers are chocolate grade scores.

5. Divide & conquer

What actually happens during the training of a decision tree? Beginning at the root node, the data is divided into groups according to the feature that will result in the greatest increase in homogeneity in the groups after a split.

6. Hyperparameters

For regression trees, the variance or deviation from the mean within a group is aimed to be minimized. There are several “knobs” that we can turn that affect how the tree is grown, and often, turning these knobs - or model hyperparameters - will result in a better performing model. Some of these design decisions are: The min_n parameter defines the minimum number of data points in a node to be split further. The tree_depth is the maximum depth of the tree. cost_complexity is a penalty parameter (a smaller value makes more complex trees). So far, we’ve trained our trees using the default values. These defaults are chosen to provide a decent starting point on most datasets. Finding the optimal hyperparameters is called "tuning", and you will dive into this in Chapter 3. You choose hyperparameters in the very first decision_tree() step. In this example, the hyperparameter tree_depth is four, and the cost_complexity is 0-point-05.

7. Understanding model output

Suppose you create a decision tree with tree_depth 1 and fit it to the chocolate_train data. When you print your model to the console, you'll see the following output. First, some information about the fit time and number of samples, then, at the bottom, more details about the decision tree. The first column is the numbered node, here: one to three. The second column is the split criterion, here: root node or cocoa_percent greater than or equal to 0-point-905. The third column is the number of samples in that node, here: 1000 at the root node, 16 at node 2, and 984 at node three. The last column is the mean outcome of all samples in that node. For leaf nodes, that is, those having an asterisk, the model will predict that value.

8. Let's do regression!

That was a lot to digest. But since many things are the same as in classification trees, you’ll see the patterns. Off to the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.