Loss functions Part I

1. Loss functions Part I

So far in this course you have used the percentage of examples that the classifier gets right, known as accuracy, to measure classification performance. Accuracy might not always be the right choice, and there are many interesting alternatives.

2. The KDD '99 cup dataset

Let's introduce a new cyber dataset from the KDD 1999 competition. The entity analyzed is again a flow, but this time a cyber analyst has extracted a very large number of additional features from the raw data. Don't worry, you don't need to understand what these mean. Just note that some of them, like `protocol_type`, are categorical, so you would have to encode them numerically using the techniques of Chapter 1.

3. False positives vs false negatives

It is good practice to explicitly encode the label using a Boolean flag, to be sure you know the meaning of "True" vs "False". Usually "True" is used for the detection of the event of interest. So "bad" events map to "True". We then fit a naive Bayes classifier and store its prediction alongside ground truth in a DataFrame. Let us inspect some examples

4. False positives vs false negatives

The last example was a case of normal traffic that was mislabelled as bad; a false alarm. We can also refer to this as a false positive.

5. False positives vs false negatives

The converse mistake involves classifying a case of bad traffic as normal, which is known as a miss, or a false negative.

6. False positives vs false negatives

We can similarly split the correct classifications into true positives, and true negatives.

7. The confusion matrix

False positives, false negatives, true positives and true negatives together form the so-called confusion matrix. You can use the ravel() function to flatten it back into a tuple.

8. Scalar performance metrics

Many performance metrics can be expressed as simple linear combinations of the confusion matrix. For example, accuracy is the proportion of the sum of true positives and true negatives over the total number of examples. Another example is recall, also known as the true positive rate, which is equal to the proportion of true positives over all positive examples. The false positive rate is equal to the proportion of false positives over all negative examples. Precision is the proportion of true positives over all examples classified by the algorithm as positive. Finally, f1, is the harmonic mean of precision and recall. Most metrics are available from the metrics module in scikit-learn.

9. False positives vs false negatives

The confusion matrix is preferred over the previous metrics, because it makes it clear that classification performance is not one-dimensional. Consider a classifier producing 3 errors of each type, versus one that produces no false positives, but 26 false negatives. You might think that the first classifier is better because it makes fewer errors in total. But what if the real-life cost of a false positive is much higher than that of a false negative? For example, in forensic science, a false conviction due to a false positive DNA test is a grave error. Ideally, you should assign a specific value to this relative cost. For example, let's assume that the cost of a false positive is ten times greater and compute the total cost of each classifier. According to this relative misclassification cost, the second classifier achieves a lower cost of 26, rather than 33. Determining the cost is yet another thing that you have to do in close collaboration with the domain expert, and cannot be ascertained from the data itself.

10. Which classifier is better?

You will now revisit some of the classifiers you built in previous lessons, and reevaluate them using different metrics. Is the winner still the same?

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks