Tuning employee turnover classifier

1. Tuning employee turnover classifier

In chapter 2 we shortly touched the concept of overfitting. As it was mentioned there, train/test split helps us to learn whether we have any overfitting error or not. Yet, it does not really provide solution to that. In this chapter, we will concentrate on tuning our classifier to get better results and some of these methods will be related to fighting overfitting.

2. Overfitting

For that reason, let's remember what Overfitting was about. Once we develop model on the train component, it may work perfectly on that, but fail outside of it. This is the reason we use test component to understand whether our model is useful outside of train data or not. As you can see, the accuracy score is perfect on training set, but not that much high on testing set. This is speaking about overfitting problem. The reason we have it, is because currently, our tree is growing as much as it can grow, and in the end becomes very large and very specific to training data only. To solve this issue, we have two solutions: either we need to limit the maximum depth of the tree, say we do not let the tree to grow more than 5 steps OR we limit the sample size in each leaf and, say, do not allow the tree to grow more if only 100 employees are left in the node/leaf. Let's go on and apply both separately.

3. Pruning the tree

In the upper block we limit the tree depth. As you can see, this can easily be done by setting an additional parameter `max_depth=5` in the DecisionTreeClassifier during the initialization process. It will help us to keep everything else the same, but limit the tree to at most 5 levels to grow in depth. Thus, let's call this model `model_depth_5`. As you can see, afterwards, the fitting and scoring processes are still the same, with only one tiny but important difference: we fit features to the target and we calculate the accuracy for `model_depth_5` instead of the general model without any limitation. As a result, the accuracy is decreased on both sets, but the difference between them is negligible, which means we reduced overfitting and current model is more realistic. In the lower block we implement everything absolutely the same, apart from the model initialization step again: this time we set `min_sample_leaf=100` to limit the sample size inside a leaf. After fitting and scoring this new model we receive a test accuracy of 96.13% which is again lower, but again, more realistic than the old one.

4. Let's practice!

We will learn more realistic metrics for evaluating the model. Until then, let's practice.