Get startedGet started for free

Performance evaluation

1. Performance evaluation

In this video, you'll learn more about performance metrics for fraud detection models.

2. Accuracy isn't everything

As you can see on these two images, accuracy is not a reliable performance metric when working with highly imbalanced data, as is the case in fraud detection. By doing nothing, aka predicting everything is the majority class, in the picture on the right, you often obtain a higher accuracy than by actually trying to build a predictive model, in the picture on the left. So let's discuss other performance metrics that are actually informative and reliable.

3. False positives, false negatives, and actual fraud caught

First of all, you need to understand the concepts of false positives, false negatives etc really well for fraud detection. Let's refresh that for a moment. The true positives and true negatives are the cases you are predicting correct, in our case, fraud and non-fraud. The images on the top left and bottom right are true negatives and true positives, respectively. The false negatives as seen in the bottom left, is predicting the person is not pregnant, but actually is. So these are the cases of fraud you are not catching with your model. The false positives in the top right are the cases that we predict to be pregnant, but aren't actually. These are "false alarm" cases, and can result in a burden of work whilst there actually is nothing going on. Depending on the business case, one might care more about false negatives than false positives, or vice versa. A credit card company might want to catch as much fraud as possible and reduce false negatives, as fraudulent transactions can be incredibly costly, whereas a false alarm just means someone's transaction is blocked. On the other hand, an insurance company can not handle many false alarms, as it means getting a team of investigators involved for each positive prediction.

4. Precision-recall tradeoff

The credit card company, therefore, wants to optimize for recall, whereas the insurance company cares more for precision. Precision is the fraction of actual fraud cases out of all predicted fraud cases, ie, the true positives relative to the true positives plus false positives, while vice-versa recall is the fraction of predicted fraud cases out of all the actual fraud cases, ie, the true positives relative to the true positives plus false negatives. Typically, precision and recall are inversely related. Basically as precision increases, recall falls and vice versa. You can plot the tradeoff between the two in the precision-recall curve, as seen here on the left. The F-score weighs both precision and recall into one measure, so if you want to use a performance metric that takes into account a balance between precision and recall, F-score is the one to use.

5. Obtaining performance metrics

Obtaining precision and recall from scikit-learn is relatively straightforward. These are the packages you need. The average precision is calculated with the average_precision_score, which you need to run on the actual labels y_test and your predictions. The curve is obtained in a similar way, which you can then plot to look at the trade-off between the two.

6. Precision-recall Curve

This returns the following graph.

7. ROC curve to compare algorithms

Another useful tool in the performance toolbox is the ROC curve. ROC stands for receiver operating characteristic curve, and is created by plotting the true positive rate against the false positive rate at various threshold settings. The ROC curve is very useful for comparing performance of different algorithms for your fraud detection problem. The "area under the ROC curve"-metric is easily obtained by getting the model probabilities like this, and then comparing those with the actual labels.

8. Confusion matrix and classification report

The confusion matrix and classification report are an absolute must have for fraud detection performance. You can obtain these from the scikit-learn metrics package. You need the model predictions for these, so not the probabilities. The classification report gives you precision, recall, and F1-score across the different labels. The confusion matrix plots the false negatives, false positives, etc for you.

9. Let's practice!

Let's practice!