Isolation trees

1. Isolation trees

In the previous chapter, we saw how kNN and LOF can use distance and relative density to tell if a point was unusual. In this chapter, we introduce isolation trees that are a tree-based approach to producing anomaly scores.

2. Isolation tree

Animals like deer gather into herds for safety. A wolf pack will attempt to attack by targeting members on the periphery that can be easily separated from the safety of the herd. A single deer is therefore more vulnerable to attack if it can be easily isolated by the wolves. An isolation tree works in the same way as wolves choosing a target. Points that are more easily separated from the other points are more anomalous. Let's see how this works!

3. Isolation tree plots

An isolation tree attempts to separate all of the points by randomly splitting the region into smaller and smaller pieces. The isolation tree chooses a data feature and splits the data at a random value on the feature. The left plot shows the furniture data split at a Width near to 80. The tree continues to make random splits within the regions defined by the previous split. The middle plot shows the progress after three random splits. Random splitting continues until each point lies inside its own subregion, or each subregion contains a maximum number of points. The final result of continued splitting of the furniture data is shown in the right plot.

4. Fit an isolation tree

Fitting an isolation tree uses the iForest function from the isofor package. The iForest function has two arguments, data and nt. The data argument expects a dataframe containing the features, which is the furniture data in the example shown here. nt is a positive integer specifying the number of isolation trees to grow, which is just one here. The nt argument implies that many trees could be fitted at one time. This point is revisited in the next lesson, but for now we'll stick to the case of a building a single isolation tree. The resulting isolation tree is assigned to the new object, furniture underscore tree.

5. Generate an isolation score

To generate an anomaly score from an isolation tree, use the predict function. The first argument to the predict function must be the isolation tree object, here the furniture underscore tree object we just created. The second argument is furniture, which is the data we'd like anomaly scores for. The result returned by predict is assigned to the new object furniture underscore score.

6. Interpreting the isolation score

The result of the predict function is a set of numeric scores with as many elements as there are rows in the furniture data; the first 10 are printed here, but what do these values mean? An isolation tree measures the isolation of a point by how quickly it can be separated by a sequence of random splits. The number of random splits needed to separate each point is a positive integer called the path length. The maximum possible path length partly depends on the size of the data used to build the tree and doesn't have an intuitive interpretation. The values returned by the predict function are the path lengths standardized to vary between 0 and 1. If the score is close to 1, then the path length is very small, which means the point was easily isolated with random splits, and therefore more likely to be an anomaly. If the score is close to 0, then the path length is large, implying that the point was not anomalous.

7. Let's practice!

Let's practice fitting isolation trees!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.