Get startedGet started for free

How good is your model?

1. How good is your model?

Thinking back to classification problems,

2. Classification metrics

recall that we can use accuracy, the fraction of correctly classified labels, to measure model performance. However, accuracy is not always a useful metric.

3. Class imbalance

Consider a model for predicting whether a bank transaction is fraudulent, where only 1% of transactions are actually fraudulent. We could build a model that classifies every transaction as legitimate; this model would have an accuracy of 99%! However, it does a terrible job of actually predicting fraud, so it fails at its original purpose. The situation where one class is more frequent is called class imbalance. Here, the class of legitimate transactions contains way more instances than the class of fraudulent transactions. This is a common situation in practice and requires a different approach to assessing the model's performance.

4. Confusion matrix for assessing classification performance

Given a binary classifier, such as our fraudulent transactions example, we can create a 2-by-2 matrix that summarizes performance called a confusion matrix.

5. Assessing classification performance

Across the top are the predicted labels,

6. Assessing classification performance

and down the side are the actual labels.

7. Assessing classification performance

Given any model, we can fill in the confusion matrix according to its predictions.

8. Assessing classification performance

The true positives are the number of fraudulent transactions correctly labeled;

9. Assessing classification performance

The true negatives are the number of legitimate transactions correctly labeled;

10. Assessing classification performance

The false negatives are the number of legitimate transactions incorrectly labeled;

11. Assessing classification performance

And the false positives are the number of transactions incorrectly labeled as fraudulent.

12. Assessing classification performance

Usually, the class of interest is called the positive class. As we aim to detect fraud, the positive class is an illegitimate transaction. So why is the confusion matrix important? Firstly, we can retrieve accuracy: it's the sum of true predictions divided by the total sum of the matrix.

13. Precision

Secondly, there are other important metrics we can calculate from the confusion matrix. Precision is the number of true positives divided by the sum of all positive predictions. It is also called the positive predictive value. In our case, this is the number of correctly labeled fraudulent transactions divided by the total number of transactions classified as fraudulent. High precision means having a lower false positive rate. For our classifier, this translates to fewer legitimate transactions being classified as fraudulent.

14. Recall

Recall is the number of true positives divided by the sum of true positives and false negatives. This is also called sensitivity. High recall reflects a lower false negative rate. For our classifier, it means predicting most fraudulent transactions correctly.

15. F1 score

The F1-score is the harmonic mean of precision and recall. This metric gives equal weight to precision and recall, therefore it factors in both the number of errors made by the model and the type of errors. The F1 score favors models with similar precision and recall, and is a useful metric if we are seeking a model which performs reasonably well across both metrics.

16. Confusion matrix in scikit-learn

Using our churn dataset, to compute the confusion matrix, along with the metrics, we import classification_report and confusion_matrix from sklearn-dot-metrics. We instantiate our classifier, split the data, fit the training data, and predict the labels of the test set.

17. Confusion matrix in scikit-learn

We pass the test set labels and the predicted labels to the confusion matrix function. We can see 1106 true negatives in the top left.

18. Classification report in scikit-learn

Passing the same arguments to classification report outputs all the relevant metrics. It includes precision and recall by class, point-seven-six and point-one-six for the churn class respectively, which highlights how poorly the model's recall is on the churn class. Support represents the number of instances for each class within the true labels.

19. Let's practice!

Now let's evaluate a classification model using our diabetes dataset!