1. Isolation forest
In this lesson, we'll explore how growing many isolation trees can be used together to produce better measures of how anomalous a data point is.
2. Sampling to build trees
The iForest function always uses a subsample of the data to grow an isolation tree. The number of points to sample can be changed by specifying the phi argument of the iForest function. For example, the code shown here will grow a single tree for the furniture data using a random sample of 100 points.
Each time iForest grows an isolation tree, a different sample is used and the tree grown will also be different. The three scatterplots each show the random splits resulting from repeatedly growing trees using the code at the top. Can you see that although the splits are different each time, splits seem to appear in approximately the same regions?
3. A forest of many trees
An individual isolation tree measures how anomalous a point is based on how many random splits are needed to isolate it.
An isolation forest is composed of many isolation trees each grown using a different sample of the original data. A score is generated for each tree and then averaged over each tree grown in the forest. The code to grow 100 trees for the furniture data is shown. This is identical to a single tree, except that the nt argument has been set to 100.
There are two benefits of using a forest to generate many scores and averaging. First, while individual trees might be sensitive to chance patterns in the data sample, the average score is very stable. Secondly, the full data are never used to build a tree, so it is possible to obtain anomaly scores much more quickly.
4. How many trees?
How do we know if we've grown enough trees?
The anomaly score for each point in the data usually converges to a fixed value after a sufficient number of trees have been grown. 100 trees is a good rule of thumb to begin with.
To be extra careful, we should fit several isolation forests with different numbers of trees, and compare whether the score changes when more trees are added.
The furniture underscore scores data frame contains the anomaly scores for furniture data using forests with different numbers of trees. For example, the column trees underscore 1000 contain the anomaly scores for a forest of 1000 trees.
5. Score convergence
A simple way to check if it is worth growing larger forests, is to compare the anomaly scores for a pair of isolation forests each with different numbers of trees. If the scores are very similar for both forests then we can be confident that the larger forest isn't required.
A scatterplot is a great way to compare the anomaly scores. An example is shown in which the scores for forests with 1000 and 500 trees are plotted together. The formula contains the column names trees underscore 500 and trees underscore 1000 from the furniture underscore scores introduced in the previous slide.
The abline function is used to add straight lines to plots and has the two arguments a and b, which specify the intercept and slope of the straight line to show. Here, the arguments a equals 0 and b equals 1 specify a line of equality to be shown as a visual reference for comparing the similarity of the scores. The points lie very close to the reference line, suggesting the scores are very similar. This indicates that an isolation forest of 1000 trees offers little advantage over 500 trees.
6. Let's practice!
Let's practice building isolation forests!