Get startedGet started for free

Predict churn with decision trees

1. Predict churn with decision trees

Great work, now we will learn about the decision tree model which is another simple and popular machine learning model that can be used for both classification and regression.

2. Introduction to decision trees

Here, we have an example decision tree that was built on a famous Titanic survival dataset. The decision tree outlines the if-else rules that were inferred from the survival dataset. We can see that the survival probabilities differ for each of the leaves depending on the rules.

3. Modeling steps

To cement our knowledge, we'll go through the supervised learning modeling steps again: Split the data into training and testing. Initialize the model. Fit the model on the training dataset. Then, predict the values on the testing data. And finally, evaluate the model performance by comparing the predicted values with the actual ones in the testing data. Since we have already learned and tested how to split the data to training and testing, we will now move to the second step with the decision trees.

4. Fitting the model

Now, we will fit the model. First, we import the classifier from scikit learn library. Then, we initialize the decision tree instance. Finally, we fit the model on the training data by first providing the input features and then the target variable.

5. Measuring model accuracy

Now, let's move to calculating the model accuracy on both training and testing datasets. As with the logistic regression, the steps are the same. First, we import the accuracy_score from sklearn.metrics module. Then we predict the labels calling the predict method on the fitted tree instance. Once completed, we call the accuracy_score and feed the actual labels first, and the predicted ones afterwards. We store the accuracy scores in separate objects. Finally, we print the rounded accuracy, and can see that the training accuracy is around 99.7%, while testing accuracy is only 72%. This is different from logistic regression where both numbers were similar around 80%. This indicates that the tree memorized the patterns and rules for the training data almost perfectly, but failed to generalize the rules for the testing data. We will learn how to reduce the size of the tree to manage this in the next slides.

6. Measuring precision and recall

Now, let's calculate the precision and recall. The process is identical to the one we did with logistic regression. First, we import the functions to calculate precision and recall. Then, we calculate the precision score for both training and testing data, round it to 4 decimals, then the same for recall score. Finally, we print the numbers. One thing that stands out, is the low value for testing recall, while the other scores are over 99%. Remembering that recall means the number of total churned instances correctly captured by the model, we can see that the model is very precise with its prediction, but fails to identify more than half of the actually churned customers.

7. Tree depth parameter tuning

The decision tree is very prone to over-fitting as it will build rules that will memorize all the patterns down to each observation level. To manage this, we need to prune the tree, which means limiting the number of if-else rules. To do this, we need to provide max_depth parameter. We will tune it in the same way we tuned the C value for logistic regression. First, we create a list of max_depth candidates between 2 and 14, then create a numpy array with zeros, and store the depth candidates in the first column. Then we iterate through the depth values, and fit a decision tree for each. Afterwards, we calculate the accuracy, precision and recall scores on the testing data and store them into the numpy array. Finally, we print the results as pandas DataFrame for better formatting.

8. Choosing optimal depth

As we can see, the testing accuracy first increases with more depth and then starts to decline. The precision declines with more depth, yet the recall increases first, then starts falling.

9. Choosing optimal depth

We can see that at the max_depth of 5, the tree solution produces good scores, and a pretty high recall metric before it starts declining. This should be the starting point for the first choice of the model.

10. Let's build a decision tree!

Great work! Let's test our knowledge of the decision trees by doing some exercises!