1. Getting started with Isolation Forests
This chapter is focused on multivariate outliers, which are more common in real-world data than univariate outliers.
2. Survey data
Consider that we are analyzing healthcare data gathered from people aged 10 to 20, and we find a respondent is 12 years-old, 160cm tall, and weighs 190 pounds.
When we look at his age, weight, and height individually, these seem to be within the range of typical human characteristics. We only realize that 12-year-old children who are 160cm tall don't usually weigh 190 pounds when we consider all three characteristics simultaneously. This particular 12-year-old is a multivariate outlier.
3. Multivariate anomalies
So, we define multivariate outliers as datapoints with two or more attributes, which when examined individually are not necessarily anomalous, but are different from the rest when all their attributes are considered at the same time.
We will be exploring algorithms to detect multivariate outliers, starting with Isolation Forest.
4. Decision trees
Isolation Forest uses an ensemble of decision trees called "isolation trees" to "isolate" the anomalies. To understand how they work, let's look at a tree that checks if 5 is prime.
In the root node, we ask whether five is divisible by two. If "Yes", we don't have to check further.
5. Decision trees
If no, we continue asking yes or no questions for numbers below five. The nodes in which no further branching or splitting happens are called leaves. This decision tree has three levels or a depth of three. Every time a new split happens, a new level of depth is added to the tree.
6. Isolation Trees
Isolation Trees, or iTrees are randomized versions of decision trees. Instead of asking specific questions, splitting happens randomly. In other words, to classify a multi-dimensional data point into an inlier or an outlier, an iTree selects a random feature of the data point and selects a random split between the minimum and maximum values of that feature at each depth level.
Since outliers leave a large "gap" between the inliers, the random split is more likely to happen within that gap, which results to isolating the outliers early in building the tree.
7. Example 2D data
To illustrate this, let's look at an example data.
Points A through G are inliers, while H and I are clearly outliers.
8. Fitting an iTree
Let's fit a single iTree to this data. In the first split, we randomly select the Feature y and randomly choose a split value of 1-point-5. This single split already isolates H as an outlier.
9. Fitting an iTree
Another random split for x with a value of 1-point-8 isolates I as another outlier.
10. Fitting an iTree
To isolate the rest, we need more splits - here are four more, which separate points A, B, F and G. To separate C, D, E, we would need even more splits.
11. How points are classified
So, points that require fewer splits or close to the root node will become outliers. Isolation Forest uses a collection of such iTrees and averages their results. The exact number of trees are determined by us, which we will look at in the next video.
12. US Airbnb data
Isolation Forest algorithm is implemented as IForest estimator in pyod. Let's test it on the full version of the US Airbnb listings data we have been using in the exercises.
13. US Airbnb data
There are 10000 listings with five attributes and a single target - which is price.
14. fit_predict
We will import IForest from pyod-dot-models-dot-iforest and initialize IForest with default parameters and use its fit_predict method to generate inlier/outlier labels for the data.
IForest marks outliers with one, whereas inliers are marked with zero.
15. Filter outliers
We'll use pandas subsetting to filter the outliers.
16. Let's practice!
Let's practice!