Get startedGet started for free

Getting started with Isolation Forests

1. Getting started with Isolation Forests

This chapter is focused on multivariate outliers, which are more common in real-world data than univariate outliers.

2. Survey data

Consider that we are analyzing healthcare data gathered from people aged 10 to 20, and we find a respondent is 12 years-old, 160cm tall, and weighs 190 pounds. When we look at his age, weight, and height individually, these seem to be within the range of typical human characteristics. We only realize that 12-year-old children who are 160cm tall don't usually weigh 190 pounds when we consider all three characteristics simultaneously. This particular 12-year-old is a multivariate outlier.

3. Multivariate anomalies

So, we define multivariate outliers as datapoints with two or more attributes, which when examined individually are not necessarily anomalous, but are different from the rest when all their attributes are considered at the same time. We will be exploring algorithms to detect multivariate outliers, starting with Isolation Forest.

4. Decision trees

Isolation Forest uses an ensemble of decision trees called "isolation trees" to "isolate" the anomalies. To understand how they work, let's look at a tree that checks if 5 is prime. In the root node, we ask whether five is divisible by two. If "Yes", we don't have to check further.

5. Decision trees

If no, we continue asking yes or no questions for numbers below five. The nodes in which no further branching or splitting happens are called leaves. This decision tree has three levels or a depth of three. Every time a new split happens, a new level of depth is added to the tree.

6. Isolation Trees

Isolation Trees, or iTrees are randomized versions of decision trees. Instead of asking specific questions, splitting happens randomly. In other words, to classify a multi-dimensional data point into an inlier or an outlier, an iTree selects a random feature of the data point and selects a random split between the minimum and maximum values of that feature at each depth level. Since outliers leave a large "gap" between the inliers, the random split is more likely to happen within that gap, which results to isolating the outliers early in building the tree.

7. Example 2D data

To illustrate this, let's look at an example data. Points A through G are inliers, while H and I are clearly outliers.

8. Fitting an iTree

Let's fit a single iTree to this data. In the first split, we randomly select the Feature y and randomly choose a split value of 1-point-5. This single split already isolates H as an outlier.

9. Fitting an iTree

Another random split for x with a value of 1-point-8 isolates I as another outlier.

10. Fitting an iTree

To isolate the rest, we need more splits - here are four more, which separate points A, B, F and G. To separate C, D, E, we would need even more splits.

11. How points are classified

So, points that require fewer splits or close to the root node will become outliers. Isolation Forest uses a collection of such iTrees and averages their results. The exact number of trees are determined by us, which we will look at in the next video.

12. US Airbnb data

Isolation Forest algorithm is implemented as IForest estimator in pyod. Let's test it on the full version of the US Airbnb listings data we have been using in the exercises.

13. US Airbnb data

There are 10000 listings with five attributes and a single target - which is price.

14. fit_predict

We will import IForest from pyod-dot-models-dot-iforest and initialize IForest with default parameters and use its fit_predict method to generate inlier/outlier labels for the data. IForest marks outliers with one, whereas inliers are marked with zero.

15. Filter outliers

We'll use pandas subsetting to filter the outliers.

16. Let's practice!

Let's practice!