1. The local outlier factor (LOF)
In this lesson, we'll explore an algorithm called the local outlier factor, usually abbreviated LOF, which uses density instead of distance to construct anomaly scores for each point.
2. Postmortem of kNN distance
kNN distance seems to be good at detecting points that are really far from their neighbors, sometimes called global anomalies, like the point at the top left of the scatterplot. However, this doesn't capture all of the points that might be considered anomalous.
For example, the point indicated in the top right of the scatterplot is relatively close to its nearest neighbors and therefore has a low kNN distance score. However, this point sits apart from its nearest neighbors that are densely clustered together.
Points like these are sometimes called local anomalies and are more easily identified using the LOF. LOF at a point is defined as the average density around the k nearest neighbors of the point divided by the density around the point itself.
3. Calculating LOF
Here we've calculated the LOF for the furniture data using the lof function from the dbscan package. The first argument is the input data furniture, and the second argument k, is the number of neighbors, which has been set to 5. Notice that the furniture data have again been scaled, because under the bonnet, LOF still uses kNN distance.
The lof function returns a numeric vector containing scores with length equal to the number of rows in the input data. The first ten values of the LOF score for the furniture data are shown. Notice that the scores are centered around 1 because LOF scores are ratios. Next, we'll consider how these scores should be interpreted.
4. Interpreting LOF
The LOF can be thought of as the density around each of a point's nearest neighbors divided by the density around the point itself. This means that if the LOF is much larger than 1, the points nearby are much more densely packed than the point itself. This is more likely to indicate an isolated, and therefore unusual, observation.
Conversely, if the LOF is less than 1, then the neighbors are less densely packed than the point itself. This behavior usually happens when the point is in the middle of a cluster, and is much less likely to indicate an anomaly.
Therefore, to find anomalous points we should be looking out for the largest values of the LOF! Next, let's consider how to visualize the LOF scores.
5. Visualizing LOF
Before doing any plotting, the LOF score is appended as a new column to the unscaled furniture dataframe. In this case, notice that we've called the new column score underscore lof.
Next, the furniture data is plotted so that the points scaled in size according to the LOF score, by setting cex equals score underscore lof. The rest of the code is similar to what we previously used to visualize kNN distance scores.
You can see from the plot that the point at the top left is no longer the highest scoring point, while several points near to the dense cluster are far more prominent. Although these points are relatively close to the dense cluster, they are in sufficiently low-density regions as to appear unusual compared to their neighbors. The LOF is particularly good at picking up local outliers like these because it relies on density instead of distance.
6. Let's practice!
Now let's get some practice using the local outlier factor!