1. Novelty detection
Outlier detection detects anomalies in the training data. To train an anomaly detector on some data, and then deploy it to detect anomalies in future, unseen examples, you need switch to novelty detection instead.
2. One-class classification
The key feature of novelty detection is that any anomalies appear as "novel" patterns in future data, and must be absent during training. It is sometimes known as one-class classification because it is similar to supervised learning where only one class, the normal class, is represented in the training data.
3. Novelty LoF
There is a very simple workaround that converts any outlier detector into a novelty detector.
First, concatenate the test data with the training data. Then, filter the output to only contain outliers from the test data.
The results are good, but the code quality is low. Recall that our production script must contain a single command: a predict call on a model object.
LoF supports this mode by setting its novelty parameter to True.
You can then fit the model to the training data, and predict the labels of the test data, just like previous workflows. This also gives you direct control of the percentage of data points that are flagged as anomalies in the test data.
4. One-class Support Vector Machine
Another popular novelty detector is the one-class Support Vector Machine or SVM. All you need to know for this course is that SVM follows exactly the same interface as novelty LoF.
First, you fit the estimator to an unlabelled training set which is assumed free of anomalies. Then, you can produce predictions on any test data.
Just like LoF, one-class SVM uses 1 to label normal data points and -1 for anomalies.
However, a quick glance at the results reveals a problem: the percentage of outliers is far too large. One-class SVM does not have a contamination parameter like LoF does, but you can still control it indirectly by accessing the raw classification scores, just like we did for supervised classification.
5. One-class Support Vector Machine
Raw scores are available via the method .score_samples(). The lower the score, the more anomalous the data point.
So to label 10% of the test data as anomalous, you need to look at data points that score in the bottom 10% tail of the score distribution. You can use the numpy function quantile to calculate this threshold: setting q to 0.1 will give you the value that separates the lowest-scoring 10% of the scores from the rest.
You then threshold the scores by this value. In a production workflow, you might want to recalculate this threshold in regular intervals to protect yourself against dataset shift.
6. Isolation Forests
Another popular model is Isolation Forest, a modification of Random Forests for novelty detection. It is available from the ensemble module in scikit-learn, and follows exactly the same interface. The .score_samples() method is implemented by other novelty detectors, too, like LoF.
7. Model selection
The sensitivity of novelty detection algorithms on the choice of threshold can be a problem when comparing several detectors.
It is therefore best to use AUC on the raw scores for model selection. For example, SVM performs best with respect to AUC.
8. Model selection
However, it appears to be useless when using accuracy score. This is because of the fact that it uses a very different default value for the choice of threshold, which is inappropriate for this dataset. However, the way it ranks the data points remains superior to the other algorithms, which is revealed using AUC.
9. What's new?
Identifying novel patterns is a very useful skill to have. Let's practice it!