Get startedGet started for free

Outlier classifier ensembles

1. Outlier classifier ensembles

Since we have very limited ways of measuring the accuracy of outlier classifiers, it is risky to trust the predictions of a single classifier.

2. Back to Airbnb

We've seen an example of this already, where we printed the probabilities and saw that IForest marked datapoints as outliers even though the confidence was very low. Here are the probabilities again on the Airbnb dataset. We see that almost all ten datapoints are marked as outliers even though some of their outlier probabilities are lower than 50%.

3. Probability threshold best practice

In practice, a probability of at least 75% for a datapoint to be an anomaly sounds more reasonable, especially in lower-risk scenarios. In high-cost cases such as in medicine or cyber security, we can consider increasing the probability threshold to 90% in our predictions.

4. What is an ensemble?

Even when the confidence is high, there is still the risk of receiving a high probability by chance. This instability in predictions in anomaly detection brings us to the use of ensembles. An outlier ensemble is a collection of two or more outlier classifiers, combined to make their predictions more stable and robust.

5. Look at the data

We can build an ensemble manually for the Google stocks dataset.

6. Scaling numeric features

Initially, let's scale the first five columns with QuantileTransformer. We won't touch the extra three since they are categorical features. First, we create a list of the column names to be scaled after importing QuantileTransformer. We initialize a QuantileTransformer that casts features to a normal distribution and then, use it to scale the list of the columns. We store the result back into the dataset using the dot-loc accessor trick of pandas.

7. Creating arrays

Now, we create a list of three outlier classifiers - KNN, LOF and IForest. Then, we create an empty numpy array that has the same number of rows as the Google stocks dataset and the same number of columns as the length of the estimators list. This empty array will later store the outlier probabilities of each estimator in the estimators list.

8. Inside the loop

To do so, we loop over each estimator and their indices with the enumerate function. Inside the loop, we first fit the estimator to the dataset. Then, we extract the outlier probabilities into probs. Finally, we store the second column of probs, which contains the probability scores of a sample being an outlier into the empty probability_scores array. The final probability_scores array contains three probability scores for each row in the Google stocks dataset. The columns correspond to the scores returned from KNN, LOF and IForest defined in the estimators list.

9. Aggregating - mean

Now, we can average these scores via two ways. First, let's use an arithmetic mean. Setting axis to one ensures the mean is taken across the row, which results in the shown 1D numpy array.

10. Aggregating - median

The same is true with the median function.

11. Probability filter

Now, we can use either of the two arrays to create a probability filter. We use a threshold of 75% to create a boolean mask with the median_scores. Finally, we use the mask to filter the outliers in Google stocks. We only find three outliers and this time, we have much more confidence in the output because the result is coming from multiple classifiers.

12. Summary of the steps

Here is a summary of the steps we took. We first created two arrays: one for classifiers and another to store outlier probabilities. Then, we looped over each classifier, fitting it to the data, generating outlier probabilities and storing them inside the empty scores array.

13. Summary of the steps

Outside the loop, we took the mean (or median) of probabilities and used it to filter the outliers with a 75% threshold.

14. Let's practice!

Now, let's practice!