1. Evaluating the model
Before now, we were using only general accuracy score to evaluate the performance of our model. However, it turns out only accuracy is not enough to claim that the model is a good one.
2. Prediction errors
To understand what other metrics of evaluation are doing, let me introduce you to prediction errors first. We have two possible outcomes in reality which means in general we have 4 possible situations presented in this so called confusion matrix. When the prediction is 0, we call it negative, and when it is 1, it is widely accepted to call it positive. Similarly, when prediction is correct, we say it is True, otherwise it is False. Thus, if in reality someone left the company but was predicted to be a stayer, then we have False Negative, as the prediction was both False and Negative.
Based on this 4 possibilities, many different metrics are developed in analytics to measure performance of the model.
3. Evaluation metrics (1)
If the target of your predictions is mostly to focus on those who are churning, then you probably want to have less False Negatives, people who leave in reality but your algorithm is not able to predict it. For that reason, Recall score can be useful. Higher values of recall correspond lower values of False Negatives.
One the other hand, if you want to keep your attention on those who stay, less False Positives will be your target, which can be achieved with higher Specificity score.
4. Evaluation metrics (2)
There are some other metrics that can be derived from the same confusion matrix. For example, if one is interested in learning what is the percentage of people who truly left the company among those who were predicted to leave, then Precision score will be handy to use.
The reason those scores are important is that general accuracy is not providing information about separate classes. For example, in our model around 76% are stayers. So if we just say "everybody is staying" we will have 76% accurate prediction. But in terms of recall, we will have very low value, as everybody who churned will be wrongly classified.
5. Let's practice!
My experience shows that sometimes those scores sound very similar and are difficult to differentiate in between. If you feel so, do not worry, and take your time to go over confusion matrix again to understand the intuition behind each of them.
As for now let's calculate some measures for our employee dataset.