1. Interpreting the output of IForest
Even though we have a best performing classifier after hyperparameter tuning, there is still no guarantee that it is as accurate as it looks, mainly because of the uncertainty that comes with contamination.
2. An alternative
So, as a last step, we have to manually explore the output of outlier classifiers to ensure the found outliers are truly outliers.
First, let's see a better way of fitting IForest and generating outlier labels on the training data. This time, we will only use the fit function of IForest on the Airbnb dataset.
Then, to get the labels on the training set, we access the labels_ attribute of IForest, which gives the same output as calling fit_predict on the data.
pyod documentation recommends using this method of generating outlier labels on datasets over the fit_predict function.
Note the trailing underscore of labels - it is a syntactic convention in sklearn and pyod to inform the user the attribute is only accessible once the estimator was fit to a dataset.
3. Predictions on new data
Now, if we want to generate labels on unseen data, we can call predict.
4. Probability scores
Coming back to exploring the output of IForest, the first thing we are interested in is how confident the model is in its predictions. So, we will turn to another property of pyod estimators - which is predict_proba.
predict_proba is a method that outputs probability scores as a measure of confidence and is available once the estimator is fit. Let's try it on the Airbnb data.
It returns a numpy array of shape (n_samples, 2) where the first column is the probability of the sample being an inlier while the second is the probability of it being an outlier.
5. Outlier probability scores
Let's look at the probability scores of only outliers. By using the new labels_ attribute, we isolate the outlier probabilities into outlier_probs and print the first ten.
In the first and ninth lines, we see that the probabilities of inliers and outliers are almost equal. This means our model was on the fence on whether those data points were outliers or not. In the end, it decided on outliers even though the differences were marginal. In contrast, in line four and the last line, we see that the model was very confident that the datapoints were outliers with 65% ad 79% percent probabilities.
In lines 5-8, the model marked the datapoints as outliers even though the probabilities of inliers were greater. This suggests that we chose too high a value for contamination, which led to mistakes in classifying.
6. Abandoning contamination
To account for these mistakes, we can drop the use of contamination and only mark datapoints based on their probability scores. Here is how that would work.
After fitting IForest to Airbnb data, we extract the probability scores for all datapoints into probs.
Then, we select the second column, which contains the probabilities of samples being an outlier, into outlier_probs.
7. Abandoning contamination
Then, we mark datapoints as outliers only when their outlier probability is greater than 65%. In code, we can use pandas subsetting on outlier_probs to achieve this.
We get only 193 outliers which is much lower than the 2000 we would have found via 20% contamination. Now, we can be much more confident that the majority of these 193 outliers are truly outliers.
8. Let's practice!
Now, let's practice all these new concepts.