Anomaly detection

1. Anomaly detection

Welcome back! In this last chapter, we will examine cases where supervised learning workflows fail because of lack of labels, and what we can do to overcome this problem.

2. Anomalies and outliers

Anomaly detection is about detecting data points that seem out-of-place in a big dataset, like the red dots in the left picture, but without using labeled training examples, so like the picture on the right. Anomaly detection falls under the broader category of unsupervised learning.

3. Anomalies and outliers

Learning without supervision seems harder, but is sometimes the only practical approach. For example, if one of the two classes is very rare, than supervised learning does not work very well, and alternatives are needed. Moreover, if rapid dataset shift is present, then past examples might be very unlikely to be representative of future ones, so that looking for "abnormal" patterns in an unsupervised manner is a better strategy. Domains posing such challenges include cybersecurity, fraud detection, anti-money laundering, and fault detection.

4. Unsupervised workflows

The lack of labels poses two challenges: first, how do we fit an estimator to unlabeled examples? And, second, how do we know whether it is any good, without access to labeled examples for performance assessment? The first challenge can be overcome in several ways, but there is no way to escape the second challenge: in practice, anomaly detection still needs a few labeled examples. To avoid overfitting, try to resist the temptation to switch to supervised learning, and use these labels for model selection only. Keeping labels aside for validation is also sometimes a luxury you cannot afford.

5. Local outliers

The most important class of anomaly detectors is outlier-based techniques. An outlier is a data point that lies far away from most of the data. This can be because it exceeds the minimum or maximum value of the data in some dimension, or in the case of a local outlier because it lives in the empty space between clusters.

6. Local outlier factor (LoF)

This concept of a local outlier forms the basis of one of the most popular anomaly detection algorithms, known as local outlier factor, or LoF. LoF compares the density around a given data point with the density of its nearest neighbour to detect outliers. For example, the red point is very isolated when compared to its nearest neighbour, shown in blue.

7. Local outlier factor (LoF)

The LoF algorithm is available from the neighbors module in scikit-learn. It only supports a single method: fit_predict, reflecting the fact that we fit and predict on the same data. Only X is needed for training: labels will only be used for validation. You can see from the picture and the confusion matrix that the algorithm detected all outliers, but also falsely flagged some normal data at the edge of the two clusters as outliers. LoF indicates normal data by 1, and anomalous by -1. These predictions are obtained by thresholding the scores available in the negative_outlier_factor_ attribute. The higher the value, the more normal the example.

8. Local outlier factor (LoF)

Given what you know about classification scores, you might suspect we can improve our false positive rate by tuning some threshold. You are right! The LoF algorithm produces scores, but rather than thresholding with a fixed value, it asks the user to express a belief about what percentage of the data are likely to be anomalous. It then ranks all the data points by their outlier score, and flags the highest scoring ones as anomalous. This parameter is called contamination and its default value is 0.1. Setting it to 0.02 gets us to perfect performance - without using any labels for training!

9. Who needs labels anyway!

Time for you to build your first unsupervised workflow!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection

Current Exercise

Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks