Get Started

Hyperparameter tuning of Isolation Forest

1. Hyperparameter tuning of Isolation Forest

In this video, we will cover techniques to tune the parameters of the IForest estimator.

2. Tuning contamination

Let's start with contamination. As said in the previous video, contamination is challenging to tune because we don't know the exact number of outliers beforehand. There is no one determined way of choosing it as we have to rely on our intuition, insights gained from exploratory data analysis, domain knowledge and the business expectations of the problem we are trying to solve.

3. Survey example

For example, let's say we are working on survey data related to household incomes. We could research similar surveys and use their results to set a rough contamination level. Or we could leverage complementary data sources to find the percentage of the poorest and wealthiest households in the survey area. Setting a contamination level based on research rather than blind guesswork is generally advised.

4. Big Mart sales data

When this is not possible, we can combine any outlier classifier with a supervised-learning model so that we can quantify its performance with supervised-learning metrics. Let's combine IForest with Linear Regression on the full version of the Big Mart sales dataset to predict sales. The effectiveness of IForest will be measured based on the performance of Linear Regression, which we will do using the Root Mean Squared Error metric.

5. Encode categoricals

The dataset has two numeric and two categorical features. We will convert categorical data to numeric with the pd-dot-get_dummies function.

6. evaluate_outlier_classifier

Then, we will create an evaluate_outlier_classifier function that will fit any pyod model to the given data and return inliers.

7. evaluate_regressor

Next, we create another function called evaluate_regressor that fits and evaluates Linear Regression model on any given dataset. Before writing the body, we import the LinearRegression class, and train_test_split and mean_squared_error functions from their relevant modules.

8. evaluate_regressor

In the body of evaluate_regressor, we first extract feature and target arrays into X and y. Here, the target is "sales". Next, we split the data and fit LinearRegression to the training data. We generate predictions and return the rounded Root Mean Squared Error (RMSE).

9. Tuning contamination

Armed with these functions, let's create a list of possible values for contamination and an empty dictionary to store the RMSE scores of LinearRegression for each. We will loop through contaminations. Inside the loop, we pass an instance of IForest and the data to evaluate_outlier_classifier to find the inliers. Then, we use the evaluate_regressor function on them, storing the current contamination as key and the RMSE as a value into the scores dictionary.

10. Look at the output

In the end, we print scores and see that 30% contamination gives the lowest RMSE. Other parameters of IForest can be tuned in the same way - we replace the list of contaminations with possible values for n_estimators, max_features or max_samples.

11. Tuning multiple hyperparameters

Let's tune max_samples and n_estimators simultaneously as a more complex example. First, we create two lists of possible values and an empty dictionary for scores.

12. Cartesian product

Then, instead of using nested for loops, we will use the product function from the itertools library which will return all possible combinations of values of two or more lists.

13. Inside the loop

Next, we loop through the output of the product function with two variables, e and m, for the n_estimators and max_samples parameters. We fix contamination at 30% while passing the current e and m variables to n_estimators and max_samples. The rest is the same as above. The only difference is that we will use both e and m inside a tuple as a key for the scores dictionary.

14. Looking at the output

Finally, we print scores and find that 200 iTrees with 80% max_features provide the lowest RMSE when contamination is set to 30%.

15. Parallel execution

One last thing before we finish this lesson is parallel execution with the n_jobs parameter. Setting it to -1 ensures we use all system resources.

16. Let's practice!

Now, let's practice!