1. Distance-based learning
Novelty detection is all about flagging data that is unlike anything you have seen before. But what is the best way to measure similarity, or, equivalently, dissimilarity between two examples?
2. Distance and similarity
The most common way to establish whether two numeric vectors are similar is to see if they are near each other in the Euclidean sense. This distance metric, along with many others, is available from the neighbors module in scikit-learn.
All metrics implement a method called pairwise which takes a list of objects and produces a matrix containing pairwise distances.
For example, the element in the first row and second column is the distance between the first and second vectors in the list, namely, [0,1] and [2,3].
You can check by explicitly computing the Euclidean distance using numpy. This also explains why the diagonal of the matrix is all zeros: the distance between an object and itself is always 0.
3. Non-Euclidean Local Outlier Factor
Algorithms like local outlier factor that make use of the concept of a nearest neighbor, crucially depend on a choice of distance metric. Euclidean distance is the default choice, but the optional parameter metric allows you to try out different choices.
Here, we try the Chebyshev distance, which simply subtracts the two vectors element-wise and then takes the maximum of that difference.
We see that the results are pretty good: the distance metrics are closely related.
4. Are all metrics similar?
Many distance metrics behave similarly to Euclidean distance, but some are completely different.
For example, the Hamming distance between two vectors counts the proportion of elements the two vectors agree on.
So the first and third elements in this example, which are furthest apart in the Euclidean sense, are actually the most similar in the Hamming sense. Hamming distance would be appropriate if, for example, the vectors numerically encoded product IDs for purchases made by different customers, so that two customers profiles are similar if they have made many common purchases.
5. Are all metrics similar?
In addition to the neighbors module, more metrics are offered by the scipy spatial module, and implement a similar method, pdist.
The only difference is that pdist outputs a condensed form of the distance matrix, with its unique elements in sequence.
To recover the square form you can use the function squareform from the same module.
6. A real-world example
Let us consider a real-world dataset on hepatitis patients. The objective is to predict whether the patient will die from hepatitis on the basis of a number of features. Some are boolean; for example, whether the patient is on steroid medication. Others are numeric, such as age.
Consider the three examples shown here. We intuitively believe the first two examples should be nearby because they have the same class.
7. A real-world example
However, computing the Euclidean distance matrix reveals that the first and third examples are nearest neighbors, although their class is different. This is probably because of the disproportionately larger effect that numeric variables taking large integer values, like age, have on Euclidean distance in comparison to Hamming distance.
The Hamming distance instead, which looks at agreement per feature rather than difference, more appropriately assigns the second example as the nearest neighbor of the first one.
8. A bigger toolbox
An anomaly detector with an appropriate distance metric can be a very powerful tool, and a very important one to have in your toolbox. Time to see this for yourself!