Class imbalance

1. Class imbalance

General accuracy score is a good choice only if classes in the dataset are balanced. However, as discussed in this chapter, class imbalance may lead to higher accuracy score, when in fact our model is failing to correctly predict churn. This was the reason we covered evaluation metrics other than accuracy score. While those other metrics are more robust and informative, they only partially solve the class imbalance problem. To solve it, what we can do is to change prior probabilities.

2. Prior probabilities

As you remember, Gini index was the objective of our Decision tree to minimize and it was calculated based on probability of being 1 or 0. As we have no other information about probabilities, in the very beginning, when the tree just starts to grow, in order to calculate the Gini index, it takes proportions of 0s and 1s as probabilities in Gini formula. As a result, Class 0, which are stayers, becomes more influential as they are 76% of the observations in our dataset. This is the reason, our algorithm was able to correctly predict 0s but not 1s. To solve it, we just need to tell Python to balance class weights which will make probability of both being 0 and 1 equal to 50%. This will probably negatively affect the general accuracy as a result of increased Gini, but AUC and especially Recall should probably be improved, as now both classes are equally important.

3. Let's practice!

Let's now implement this change and see what happens.