Get Started

Overview of Isolation Forest hyperparameters

1. Overview of Isolation Forest hyperparameters

In the last video, we learned the foundations of Isolation Forest and its iTrees.

2. Most important hyperparameters

In this video, we will take an even closer look at Isolation Forest by learning about its most important hyperparameters, which are contamination, n_estimators, max_samples, and max_features.

3. What is contamination?

Let's start with the contamination parameter. After training, IForest generates raw anomaly scores for each datapoint. At this stage, we do not know which are inliers or outliers. To make the classification, we choose a threshold where raw anomaly scores are translated into inliers or outliers. This threshold is called contamination. For example, a 10% contamination means we are choosing the observations with the top 10% of anomaly scores as outliers.

4. Setting contamination

Contamination isn't a special parameter of IForest—it exists in all pyod estimators except for MAD. It accepts a value between 0 and 0-point-5, 0-point-1 being the default value. Setting the right contamination is critical to trust the predictions of a multivariate outlier detection algorithm. Low contamination results in undetected anomalies, while high contamination can lead to inliers being marked as anomalies. We will discuss a couple of tuning techniques for contamination in the next video.

5. What is n_estimators?

When we want to specify the exact number of iTrees in IForest, we use the n_estimators parameter. It defaults to 100, which is usually enough for small datasets. We use more trees for high-dimensional datasets to have enough predictive power to learn all the relevant patterns in the data.

6. max_samples and max_features

Each of these trees train on a sub-sample of the dataset and a sub-sample of features, controlled by max_samples and max_features parameters, which accept values between zero and one. For example, an IForest with 0-point-6 max_samples and 0-point-9 max_features trains its iTrees on 60% of the rows and 90% of the features of the dataset. For every iTree, a different 60% of the rows and a different 90% of features will be selected. This frequent sub-sampling reduces the risk of overfitting.

7. Tree growth

Then, iTrees grow in a randomized way, choosing a random split value between the minimum and maximum of the selected feature until all samples are fully isolated into leaf nodes or until the tree reaches maximum depth.

8. Max tree depth

Max depth is equal to the logarithm of max_samples. Then, each datapoint is assigned an anomaly score depending on the depth they were found. Datapoints closer to the root will score higher while the ones that travel deep into the tree score low. In the end, IForest averages the anomaly scores across all trees and selects the proportion of most anomalous samples controlled by contamination.

9. IForest advantages

This way of building iTrees makes IForest very efficient with large datasets. Because of frequent sub-sampling, IForest does not require all normal instances to isolate the abnormal. Only a fraction of the inliers will suffice to differentiate the outliers, which drastically reduces computation time. Another advantage of IForest is that it makes almost no statistical assumptions on the distribution of features. Although it isn't a silver bullet to all anomaly detection problems, it performs well out-of-the-box on many real-world datasets.

10. Challenges of outlier detection

Supervised-learning algorithms rely on metrics like RMSE or log loss to check whether chosen hyperparameters are effective. Outlier classifiers do not have this luxury because outlier detection is an unsupervised learning problem. We do not have inlier/outlier labels beforehand to measure the effectives of outlier classifiers. There is no easy way of knowing whether 7% contamination is better than 15% or if increasing n_estimators will lead to better results. The only way to check if the chosen set of hyperparameters is effective is by combining the outlier classifier with a supervised-learning model and measuring the final performance with metrics like RMSE, log loss or accuracy. We will see how to do this in the next video.

11. Let's practice!

Now, let's practice!