1. k-nearest neighbors distance score
Now you'll learn techniques for finding anomalous data points when there are several features.
2. Furniture dimensions
First, we introduce a new data set to illustrate. The scatterplot shows the heights and widths of items of furniture.
Notice how the points mostly fall into two clusters, but there are a small number of points that don't seem to belong to either. We might consider these points as possible anomalies, but because each point is defined by two features, we cannot use univariate techniques like Grubbs' test that we saw in the previous chapter.
Remember that Grubbs' test looks for anomalies among the points that lie farthest from the overall mean. A similar idea can be applied when there are multiple features, but first, we need a different way to measure how far away a point is.
3. k-nearest neighbors (kNN) distance
The k-nearest neighbors or kNN distance measures the average distance from a point to each of the points that are closest.
The pair of plots illustrate the distances to the 5 closest neighbors for two different points shown in blue. The individual distances are represented by the lengths of the red lines, while each neighbor is shown as a yellow point. The kNN distance is shown in the plot title, which is the mean length of the red lines.
Notice that the left plot shows a point that is further away from its neighbors than the right, and has a correspondingly larger kNN distance than the right plot. The kNN distance provides an intuitive measure of how isolated a point is from neighboring points, where larger values are more likely to indicate anomalies.
4. Inputs for distance matrix calculation
To calculate the kNN distance for each point in the furniture data, we use the get dot knn function from the package FNN.
The get dot knn function has two main input arguments, data, which expects a matrix containing input features as columns, and k, which specifies the number of neighbors to consider when calculating distances from each point.
Here, we've calculated the 5 nearest neighbors for points in the furniture data using the arguments data equals furniture and k equals 5, and saved the output to the object furniture underscore knn.
5. Distance matrix output
The get dot knn function returns a list containing two matrices, each with k columns and as many rows as there are data.
The nn dot dist matrix contains the distances between pairs of points. For example, the first row of the first column of nn dot dist is 5 point 1283, which is the distance from the first data point in the furniture data to its closest neighbor. The first element of the second column is the distance to the next closest and so on.
The other matrix nn dot index isn't shown, but contains the row numbers of each of the 5 nearest neighbors for every data point. This is helpful for identifying which points are nearest to any other point.
6. kNN distance score
get dot knn calculates the distance between each point and their 5 nearest neighbors. To build the kNN score for each point, we need to take the average value in each row of the distance matrix nn dot dist. The rowMeans function returns the mean of each row of an input matrix, which here is saved as furniture underscore score.
Remember that higher scores are more likely to correspond to anomalous points, so the which dot max function is useful to find the point with the biggest score. In this case, it's row 29.
7. Let's practice!
Let's practice calculating kNN scores!