1. How good is your model?
Thinking back to classification problems,
2. Classification metrics
recall that we can use accuracy, the fraction of correctly classified labels, to measure model performance.
However, accuracy is not always a useful metric.
3. Class imbalance
Consider a model for predicting whether a bank transaction is fraudulent, where only 1% of transactions are actually fraudulent.
We could build a model that classifies every transaction as legitimate; this model would have an accuracy of 99%!
However, it does a terrible job of actually predicting fraud, so it fails at its original purpose.
The situation where one class is more frequent is called class imbalance. Here, the class of legitimate transactions contains way more instances than the class of fraudulent transactions.
This is a common situation in practice and requires a different approach to assessing the model's performance.
4. Confusion matrix for assessing classification performance
Given a binary classifier, such as our fraudulent transactions example, we can create a 2-by-2 matrix that summarizes performance called a confusion matrix.
5. Assessing classification performance
Across the top are the predicted labels,
6. Assessing classification performance
and down the side are the actual labels.
7. Assessing classification performance
Given any model, we can fill in the confusion matrix according to its predictions.
8. Assessing classification performance
The true positives are the number of fraudulent transactions correctly labeled;
9. Assessing classification performance
The true negatives are the number of legitimate transactions correctly labeled;
10. Assessing classification performance
The false negatives are the number of legitimate transactions incorrectly labeled;
11. Assessing classification performance
And the false positives are the number of transactions incorrectly labeled as fraudulent.
12. Assessing classification performance
Usually, the class of interest is called the positive class. As we aim to detect fraud, the positive class is an illegitimate transaction.
So why is the confusion matrix important?
Firstly, we can retrieve accuracy: it's the sum of true predictions divided by the total sum of the matrix.
13. Precision
Secondly, there are other important metrics we can calculate from the confusion matrix.
Precision is the number of true positives divided by the sum of all positive predictions. It is also called the positive predictive value. In our case, this is the number of correctly labeled fraudulent transactions divided by the total number of transactions classified as fraudulent.
High precision means having a lower false positive rate.
For our classifier, this translates to fewer legitimate transactions being classified as fraudulent.
14. Recall
Recall is the number of true positives divided by the sum of true positives and false negatives. This is also called sensitivity.
High recall reflects a lower false negative rate.
For our classifier, it means predicting most fraudulent transactions correctly.
15. F1 score
The F1-score is the harmonic mean of precision and recall. This metric gives equal weight to precision and recall, therefore it factors in both the number of errors made by the model and the type of errors. The F1 score favors models with similar precision and recall, and is a useful metric if we are seeking a model which performs reasonably well across both metrics.
16. Confusion matrix in scikit-learn
Using our churn dataset, to compute the confusion matrix, along with the metrics, we import classification_report and confusion_matrix from sklearn-dot-metrics.
We instantiate our classifier, split the data, fit the training data, and predict the labels of the test set.
17. Confusion matrix in scikit-learn
We pass the test set labels and the predicted labels to the confusion matrix function.
We can see 1106 true negatives in the top left.
18. Classification report in scikit-learn
Passing the same arguments to classification report outputs all the relevant metrics.
It includes precision and recall by class, point-seven-six and point-one-six for the churn class respectively, which highlights how poorly the model's recall is on the churn class. Support represents the number of instances for each class within the true labels.
19. Let's practice!
Now let's evaluate a classification model using our diabetes dataset!