1. Visualizing kNN distance score
In this video, we'll use visual summaries to explore the kNN distance score.
2. Standardizing feature scales
The definition of kNN distance means that the size of the score depends on the scale of the individual features used to calculate the distance matrix.
For example, if one input feature has a mean that is 100 times larger than any other feature, then the final score may be dominated by variation in that feature. Consequently, we need to take extra care to ensure that the inputs to the kNN distance calculation are on comparable scales.
This slide shows the same scatterplot of the furniture data from the previous video. Notice that in this case, height and weight appear to vary on similar scales, but we should still standardize both features to avoid sensitivity to the scale.
3. Standardizing features
In R, the scale function is a simple way to standardize the scales of one or more variables. Scale accepts a numeric vector or data frame as an input, like the furniture data shown here. For each column in the furniture data, the mean is subtracted and then divided by the standard deviation. The resulting object, furniture underscore scaled, is a set of transformed numeric features that each have a mean of zero and similar levels of spread.
The scatterplot shows the effect of scaling the height and width features in the furniture data. Notice that the scale of the x and the y-axes has changed, but that the pattern in the points is completely unaffected by standardizing.
4. Create and append distance score
Having standardized the input features, we should now recalculate the kNN distance matrix and scores, using the furniture underscore scaled data instead of the unstandardized version, furniture. As before, use the get dot knn function to obtain the distance matrix for the 5 nearest neighbors to each point.
Next, the kNN score is calculated by taking the average of each row of the distance matrix using the rowMeans function. As a final step, the distance score has been appended to the original furniture data set as a new column called score. Adding the score to the data frame makes the score much easier to visualize in the next step.
5. Visualizing distance score
Once the Height, Width and anomaly score are held in the same data frame, it's straightforward to visualize them together using the plot function. To do this, we add the new arguments, cex equals squareroot score and pch equals 20, to the usual scatterplot call.
The cex argument accepts positive values which are used to scale the size of the plotting character. Notice that the distance score has been used to scale the plotting character, which will cause points with distant neighbors to appear larger and points with closer neighbors to appear smaller. The score has also been transformed using the square root function before passing to the cex argument, which helps visual interpretation by limiting the range of point sizes shown.
The pch argument specifies how the plotting character should look. When pch is set to 20, the points will appear as solid bullets, rather than the default circles. This makes it easier to distinguish points of different sizes.
Notice that the largest anomaly scores are at the top of the region and are associated with the points near to the left cluster. In general, the points nearest to the center of both clusters appear smaller. This is because these points have many neighbors within close proximity, and therefore have a low score.
6. Let's practice!
Let's put this into practice.