Anomaly detection
1. Anomaly detection
Welcome back! In this last chapter, we will examine cases where supervised learning workflows fail because of lack of labels, and what we can do to overcome this problem.2. Anomalies and outliers
Anomaly detection is about detecting data points that seem out-of-place in a big dataset, like the red dots in the left picture, but without using labeled training examples, so like the picture on the right. Anomaly detection falls under the broader category of unsupervised learning.3. Anomalies and outliers
Learning without supervision seems harder, but is sometimes the only practical approach. For example, if one of the two classes is very rare, than supervised learning does not work very well, and alternatives are needed. Moreover, if rapid dataset shift is present, then past examples might be very unlikely to be representative of future ones, so that looking for "abnormal" patterns in an unsupervised manner is a better strategy. Domains posing such challenges include cybersecurity, fraud detection, anti-money laundering, and fault detection.4. Unsupervised workflows
The lack of labels poses two challenges: first, how do we fit an estimator to unlabeled examples? And, second, how do we know whether it is any good, without access to labeled examples for performance assessment? The first challenge can be overcome in several ways, but there is no way to escape the second challenge: in practice, anomaly detection still needs a few labeled examples. To avoid overfitting, try to resist the temptation to switch to supervised learning, and use these labels for model selection only. Keeping labels aside for validation is also sometimes a luxury you cannot afford.5. Local outliers
The most important class of anomaly detectors is outlier-based techniques. An outlier is a data point that lies far away from most of the data. This can be because it exceeds the minimum or maximum value of the data in some dimension, or in the case of a local outlier because it lives in the empty space between clusters.6. Local outlier factor (LoF)
This concept of a local outlier forms the basis of one of the most popular anomaly detection algorithms, known as local outlier factor, or LoF. LoF compares the density around a given data point with the density of its nearest neighbour to detect outliers. For example, the red point is very isolated when compared to its nearest neighbour, shown in blue.7. Local outlier factor (LoF)
The LoF algorithm is available from the neighbors module in scikit-learn. It only supports a single method: fit_predict, reflecting the fact that we fit and predict on the same data. Only X is needed for training: labels will only be used for validation. You can see from the picture and the confusion matrix that the algorithm detected all outliers, but also falsely flagged some normal data at the edge of the two clusters as outliers. LoF indicates normal data by 1, and anomalous by -1. These predictions are obtained by thresholding the scores available in the negative_outlier_factor_ attribute. The higher the value, the more normal the example.8. Local outlier factor (LoF)
Given what you know about classification scores, you might suspect we can improve our false positive rate by tuning some threshold. You are right! The LoF algorithm produces scores, but rather than thresholding with a fixed value, it asks the user to express a belief about what percentage of the data are likely to be anomalous. It then ranks all the data points by their outlier score, and flags the highest scoring ones as anomalous. This parameter is called contamination and its default value is 0.1. Setting it to 0.02 gets us to perfect performance - without using any labels for training!9. Who needs labels anyway!
Time for you to build your first unsupervised workflow!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.