Validating logistic regression results

1. Validating logistic regression results

Welcome back! In this lesson, you will answer two questions. First, "Whether an employee will leave the organization?", and second "How well does the model perform?". You will answer these questions by converting the probabilities to binary responses and creating a confusion matrix.

2. Turnover probability distribution of test cases

In the previous chapter, you plotted the distribution of turnover probability for all observations of the testing dataset using the hist() function.

3. Turn probabilities in categories by using a cut-off

Now, to test the accuracy of the model, you need to turn these probabilities into binary responses, i.e, 1 or 0, which means "Inactive" or "Active". To do this, you need to decide a cut-off point, i.e., a probability above which you are willing to consider that an employee will leave the organization. Let's assume a cut-off point of 0.5. This means we will assume that any employee with a turnover probability of 0.5 or more will leave the organization.

4. Turn probabilities in categories by using a cut-off

You can use the ifelse() function along with predicted probabilities to classify employees as either 1 or 0, which means Inactive or Active, respectively. Now that you have converted the probabilities to binary responses, you are ready to create a confusion matrix.

5. What is confusion matrix?

A confusion matrix is used to determine the performance of a classification model. It tabulates how many observations were correctly classified by your model.

6. Creating confusion matrix

You can create a confusion matrix by using the table() function. Pass in the vector of predicted values and the turnover column from the test set and here is your confusion matrix.

7. Understanding confusion matrix

Let's talk about the four numbers in the confusion matrix. The model is trying to predict the probability of turnover, so let's consider Inactive as the positive class, and Active as the negative class. Columns correspond to the actual employment status as given in the dataset and the rows show employee status as predicted by the model. True negatives are the cases where the model correctly identified active employees. True positives are the cases where the model correctly identified inactive employees. False positives are those cases where your model predicted employees to be inactive, but they are actually active. Finally, False negatives are those cases where your model predicted employees to be active, but are actually inactive.

8. Confusion matrix: accuracy

To know how often your model is correct in classifying employees as active or inactive, you can calculate the accuracy of the model. The accuracy of a model is defined as the sum of true positives and true negatives divided by the total number of cases. In this case, the model is able to classify 93% of observations accurately. Actually, you don't have to calculate the accuracy manually.

9. Creating confusion matrix

You can use the confusionMatrix() function from the caret package which does this for you. You pass in the output of table() to confusionMatrix() and this is how the output looks like.

10. Output of confusion matrix

As you can see here, several other metrics along with the accuracy are reported here.

11. Resources for advanced methods

There are advanced methods such as ROC and AUC to determine the best cut-off but that is beyond the scope of this course and so are the other metrics you saw on the previous slide. You can learn about them in other courses on DataCamp.

12. Let's practice!

Go ahead and build some confusion matrices!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.