Loss functions Part I
1. Loss functions Part I
So far in this course you have used the percentage of examples that the classifier gets right, known as accuracy, to measure classification performance. Accuracy might not always be the right choice, and there are many interesting alternatives.2. The KDD '99 cup dataset
Let's introduce a new cyber dataset from the KDD 1999 competition. The entity analyzed is again a flow, but this time a cyber analyst has extracted a very large number of additional features from the raw data. Don't worry, you don't need to understand what these mean. Just note that some of them, like `protocol_type`, are categorical, so you would have to encode them numerically using the techniques of Chapter 1.3. False positives vs false negatives
It is good practice to explicitly encode the label using a Boolean flag, to be sure you know the meaning of "True" vs "False". Usually "True" is used for the detection of the event of interest. So "bad" events map to "True". We then fit a naive Bayes classifier and store its prediction alongside ground truth in a DataFrame. Let us inspect some examples4. False positives vs false negatives
The last example was a case of normal traffic that was mislabelled as bad; a false alarm. We can also refer to this as a false positive.5. False positives vs false negatives
The converse mistake involves classifying a case of bad traffic as normal, which is known as a miss, or a false negative.6. False positives vs false negatives
We can similarly split the correct classifications into true positives, and true negatives.7. The confusion matrix
False positives, false negatives, true positives and true negatives together form the so-called confusion matrix. You can use the ravel() function to flatten it back into a tuple.8. Scalar performance metrics
Many performance metrics can be expressed as simple linear combinations of the confusion matrix. For example, accuracy is the proportion of the sum of true positives and true negatives over the total number of examples. Another example is recall, also known as the true positive rate, which is equal to the proportion of true positives over all positive examples. The false positive rate is equal to the proportion of false positives over all negative examples. Precision is the proportion of true positives over all examples classified by the algorithm as positive. Finally, f1, is the harmonic mean of precision and recall. Most metrics are available from the metrics module in scikit-learn.9. False positives vs false negatives
The confusion matrix is preferred over the previous metrics, because it makes it clear that classification performance is not one-dimensional. Consider a classifier producing 3 errors of each type, versus one that produces no false positives, but 26 false negatives. You might think that the first classifier is better because it makes fewer errors in total. But what if the real-life cost of a false positive is much higher than that of a false negative? For example, in forensic science, a false conviction due to a false positive DNA test is a grave error. Ideally, you should assign a specific value to this relative cost. For example, let's assume that the cost of a false positive is ten times greater and compute the total cost of each classifier. According to this relative misclassification cost, the second classifier achieves a lower cost of 26, rather than 33. Determining the cost is yet another thing that you have to do in close collaboration with the domain expert, and cannot be ascertained from the data itself.10. Which classifier is better?
You will now revisit some of the classifiers you built in previous lessons, and reevaluate them using different metrics. Is the winner still the same?Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.