Get startedGet started for free

Loss functions Part II

1. Loss functions Part II

In the previous lesson, we saw that some users might care more about false positives than false negatives, or vice versa. The good news is that you can actually tune your classifier to get this balance right. Let's see how.

2. Probability scores

Take the example of Gaussian Naive Bayes. Rather than use the method "predict", try calling the method ".predict_proba()" - this stands for "predict probability". This outputs the probabilities of a negative and a positive class label for each example. These two probabilities sum to 1, and can be converted to a label by thresholding the probability of the positive class, which is the second element of each pair of scores, by 0.5. List comprehension is an easy way to do this.

3. Probability scores

You can see here the effect the threshold has on false positives and negatives. The classifier initially labels all examples positive, producing too many false positives. Gradually, it starts labeling more examples as negative, eliminating false positives, but increasing the number of false negatives, until it hits the other extreme.

4. ROC curves

This trade-off between false positives and false negatives is captured by the so-called Receiver Operating Characteristic curve, or ROC curve. It uses two metrics you have seen before: the false positive rate, or FPR, and recall, which is also known as true positive rate, or TPR. The function roc_curve returns the values of FPR and TPR for a number of different values for the threshold.

5. ROC curves

The ROC curve plots one against the other, for all possible values of the threshold. So at the bottom left, there are no false positives and at the top right there are no false negatives. The closer the ROC curve to the top, the better the classifier, whereas a random classifier will perform along the diagonal of this chart.

6. ROC curves

Comparing classifiers in ROC space is much better than comparing them on the basis of single metrics. If a ROC curve lies strictly above another, this means that the the former classifier is superior to the latter for all possible misclassification costs. For example, AdaBoost here outperforms Gaussian Naive Bayes everywhere.

7. ROC curves

If the two curves cross instead, then that means that the choice should depend on the relative misclassification cost or other domain-specific conditions.

8. AUC

A metric which tries to capture this graphical way of comparing classifiers in a single number is the Area Under the Curve, or AUC, computed by the roc_auc_score() function. A value of 0.5 indicates that the classifier is no better than chance, and a value of 1.0 indicates perfect performance. So Adaboost here is nearly perfect.

9. Cost minimisation

AUC is a great way to identify algorithms that do well in a variety of settings. However, if you have domain knowledge about costs, it is always better to use it. For example, if you know that false positives are ten times more costly than false negatives, you could create a custom metric that computes the total cost of false positives and false negatives. Here you can see the total cost for several possible thresholds. The fourth value is the smallest, which corresponds to a threshold of 0.75. This makes sense: the default choice of 0.5 places equal weight on both false positives and negatives, but here false positives are costlier, pushing the threshold a bit to the right!

10. Each use case is different!

In this chapter, you learned to make use of as much domain knowledge as you can to optimize your pipeline. You are nearly ready to productize and maintain your models, which is the topic of the next chapter. But before that, let's practice!