1. KNN for outlier detection
We will now learn distance and density-based algorithms for multivariate anomaly detection, beginning with the k-Nearest-Neighbors algorithm.
2. Applications of KNN
KNN is a popular ML algorithm with applications in both supervised and unsupervised-learning. It has been proven to be effective in regression, classification, clustering and of course, outlier detection.
3. Simplicity of KNN
Perhaps, its popularity in outlier detection is due to its simplicity. While Isolation Forest uses a combination of tree depth, sub-sample size and other components to calculate an anomaly score, KNN takes only the distances between instances as a measure of outlierness.
4. Ansur Male Dataset
Let's try the new algorithm on the Ansur Male body measurements dataset, which we load as males.
The dataset records 95 different physical measurements of more than 4000 US Army males.
5. KNN in action
We instantiate a KNN estimator after we import it from the pyod-dot-models-dot-knn module.
Here, we made an informed choice of 1% contamination for this dataset. Since people in the army are usually fit, we can expect only a handful of soldiers as outliers in their body measurements. We are also using parallel execution to speed up the computation.
6. KNN with outlier probabilities
Let's filter the outliers using outlier probabilities, instead of contamination. We generate the probabilities using predict_proba.
We find 13 outliers, which aligns well with our assumption when choosing a contamination level. However, we cannot fully trust this output because we left the hyperparameters as defaults.
7. The number of neighbors
The most important parameter of KNN is the n_neighbors parameter, which determines the number of neighbors of a sample when calculating anomaly scores. The optimal value for n_neighbors, usually denoted as k, is highly dataset-dependent.
A rule of thumb that works well in practice is to choose 20 for n_neighbors if we set contamination below 10%. We should increase n_neighbors accordingly for higher contamination levels. Higher values of n_neighbors will also mitigate the effect of noise in the dataset.
8. Features of KNN
KNN is a non-parametric model, which means it makes no statistical assumptions about the distribution of features in the data. In contrast, models like Linear Regression require linear connections between features, which limits their use cases.
KNN trains super-fast, because it has no internal algorithm that learns patterns within the data. It memorizes all datapoints and relies on distances between them to make predictions. Because of this feature, KNN is also called a non-generalizing model.
9. Drawbacks of KNN
Though KNN is superior to Isolation Forest in terms of speed, it has a few shortcomings.
The fact that KNN memorizes all instances of a dataset makes it memory-inefficient - the model size can quickly get out of hand for large datasets.
Even though training is fast, prediction is slow because most distance calculations happen during that stage. KNN is also sensitive to feature scales. Its performance greatly suffers when features have disproportionate scales relative to their importance. We will see how to solve this last issue in the next video.
10. Let's practice!
For now, let's practice what we've learned.