Get startedGet started for free

Evaluating classification models

1. Evaluating Classification Models

Now that you've prepared the train-test-validate splits and built your logistic regression models you need to learn how to evaluate their performance.

2. Ingredients for Performance Measurement

The ingredients needed to measure performance are the same as before. First you need the actual classes of your observations. Second you will need the predicted classes of these observations. Finally you will need a metric relevant to your problem to compare the two and measure performance.

3. 1) Prepare Actual Classes

To prepare the vector of actual classes you need to convert the attrition vector from character to a binary. If you look at one validate data frame from the cross validation folds you can see that the values are all either Yes or No. To convert these to a binary vector you simply need to use the equal to operator to convert all Yes values to TRUE and No values to FALSE.

4. 2) Prepare Predicted Classes

To prepare the predicted classes you first need to prepare the probability vector. To do this for a logistic regression model you will use the predict() function with the argument type equal to response. This generates predicted probability of attrition for each observation. Next, you will need to convert these probability values into a binary vector. Here you can assume that any probability greater than 0.5 will correspond to TRUE and any less than or equal to 0.5 will correspond to FALSE.

5. 3) A metric to compare 1) & 2)

Now that you have the actual and predicted binary vectors you can think about what metric is appropriate for the problem you are trying to solve. Here I will introduce you to three popular metrics, accuracy, precision and recall, all three of which are available in the Metrics package you've previously used. To understand these metrics let's start with the contigency table that compares the actual and predicted values. In R we can generate this using the table() function.

6. 3) Metric: Accuracy

The first metric we will consider is Accuracy. Accuracy measures how well your model predicted both the TRUE and FALSE classes. This metric can be useful if it is equally important for you to predict employees that quit and those that don't. You can calculate accuracy by using the function of the same name from the Metrics package. Here you have an accuracy of 90% which, when looking at the contingency table, you can see is primarily driven by the model's ability to correctly classify cases where attrition is FALSE.

7. 3) Metric: Precision

The next metric we will consider is precision. This metric calculates how often the model is correct at predicting the TRUE class. You calculate it using the precision() function. The resulting value tells us that of the employees the model classified as having quit, 78% of them did indeed leave the company. This metric can be appropriate when you want to minimize how often the model incorrectly predicts an observation to be in the positive class.

8. 3) Metric: Recall

Finally, there is recall. This metric compares the number of observations the model has correctly identified as TRUE to the total number of TRUE observations. In other words, it measures the rate at which the model can capture the TRUE class. If you are interested in building a model that would capture as many risky employees as possible you should consider this metric. You can calculate it using the recall() function. The resulting value tells us that of the employees that quit, the model was able to capture 51% of them correctly. For the attrition model let's assume that you need to identify as many employees that are at risk of leaving, as such the best performing model will be selected using the recall metric.

9. Let's practice!

Now you're ready to evaluate your classification models.