Decision trees

1. Decision Trees

A go-to model for data scientists is the random forest, and the building block for a random forest is a decision tree. Let's learn how to use decision trees to predict future price changes of securities.

2. Decision trees

Decision trees have nodes where they split the data into groups based on the features. Trees start at a root node and end with leaf nodes.

3. Decision trees

Trees split data based on features to get the best possible predictions. In the case of binary classification, we would try to group all the 0s on one side, and the 1s on another side. The tree uses "purity" of the leaf nodes to choose the best feature for making splits at each node. Purity is how uniform the targets are in a leaf node.

4. Decision tree splits

For categorical data, decision trees split on values being a category type. For example, if the weekday is Friday, then the tree could split the data based on the weekday being Friday.

5. Decision tree splits

For numeric features, trees try splitting at each unique value of the feature in the data. For example, we could split on the previous 10-day price change. In this hypothetical dataset, the target is 1 whenever the previous 10-day price change is greater than 10%.

6. Bad tree

Regression trees use a reduction in variance, or spread of the data, to decide on the best splits. If a split ends up with spread out target values in leaves, this is bad.

7. Good tree

If a split yields leaf nodes with tightly clustered target values, this is a good split. We can measure spread of the targets in a leaf with variance or standard deviation.

8. Decision tree regression

Once our decision tree is created, we make predictions by averaging the training set targets that ended up in each leaf node.

9. Regression trees

Creating and fitting decision trees is easy with sklearn. We import the DecisionTreeRegressor, create the model, then simply use the fit method, giving it the train_features and train_targets as arguments. We also have a huge number of settings we can choose when creating the regressor, but we will only look at max_depth here.

10. Decision tree hyperparameters

max_depth is a hyperparameter, which is a setting we choose for our model. This controls how tall our tree can grow. Trees with no limit to max_depth turn out like this huge tree, and will exactly fit the training set and do poorly on new data; so we definitely want to limit the depth.

11. Max depth of 3

Trees with a limit to max_depth can turn out like this one, which has a max_depth of 3.

12. Evaluate model

Once the model is fit, we want to check performance. We'll look at the training and test scores first. The built-in score method in sklearn regressor models calculates the R-squared score based on features and targets. Essentially R-squared of 1 means perfect predictions, 0 means useless predictions, and negative values mean our predictions are horrible. The value of 0-dot-66 on the train set means we did alright there, but the -0-dot-089 R-squared means we aren't doing well on the test set. We'll also scatter the predictions versus the actual values to check performance. First, we get predictions from our model with predict and the features, then we scatter predictions on the x-axis and actual targets on the y-axis. Finally, we show the legend so we know which points are which, and show the plot.

13. Decision tree predictions

Here we have great predictions on the training set, and poor predictions on the test set. This is an example of overfitting, which we'll learn how to fix in chapter 3.

14. Grow some trees!

Ok, I think you're ready. Let's grow some trees!