1. Unstructured data
So far, we have focused on structured, numeric data. We now briefly outline ways of working with unstructured data via distance-based learning.
2. Structured versus unstructured
As we have seen, structured data is tabular and numeric -- or can be easily made numeric using encodings. An example is the hepatitis dataset we saw previously.
Unstructured data, in contrast, are object descriptions for which no numeric features have been extracted. Examples include images, text, audio and video, or just sequences of symbols, as in the case of amino acid sequences forming proteins, shown here. We will focus on this latter type here, and use a dataset containing two types of proteins: 805 immune system and 199 virus proteins.
By the end of the lesson, we will build an anomaly detector that can flag the virus proteins as anomalous without any form of supervision. Let's go!
3. Distance is all that matters
Extracting features from a raw string requires extensive domain expertise, so we will see whether we can deal with this problem using distance-based learning instead.
A popular choice of distance metric in bioinformatics is the so-called edit or Levenshtein distance, which counts the number of inserts, deletes or substitutions that would be necessary in order to convert one string to another. It is available from the stringdist module.
For example, the Levenshtein distance between 'abc' and 'acc' is 1, whereas the distance between 'acc' and 'cce' is 2.
It is not unreasonable to expect that many proteins that perform similar functions will have evolved from one another with random substitutions, which would make them neighbors in terms of the Levenshtein distance. For example, consider the two immune system proteins shown here only differ in a single letter, so their Levenshtein distance is 1.
4. Some debugging
The pdist() function allows you to use non-scipy metrics like Leveshtein. However, a little bit of debugging is required to make it work!
First, applying pdist directly on a pandas series throws an error because it is expecting a two-dimensional numpy array.
5. Some debugging
This is easy to fix by casting the series as a numpy array, and then reshaping it into a matrix. Easy!
This fixed the previous error, but raised another one. Don't worry this is progress! This time, it is the metric that is complaining: it was expecting a string as the value for each of its two arguments, but instead it got an array containing a string.
6. Some debugging
This is also easy to fix. Write a simple function that takes as input two arrays,and then, for each array, extracts its one and only element.
Then pass this on to the Levenshtein metric.
Success!
7. Protein outliers with precomputed matrices
It is important to remember that the amount of computation performed by pdist scales with the square of the number of examples. So it is slow! For example, computing the Levenshtein distance matrix of approximately 6000 immune and virus proteins takes 43 seconds.
Thankfully, you only have to compute this once. You can then fit the local outlier factor on this precomputed matrix, in just 3 seconds! Just remember to use squareform, as it expects a square matrix.
We can finally see how well the Levenshtein distance metric allows us to tell apart viruses from human immune system proteins. Viruses should be outliers in this data, which would be labeled as -1 by the LoF algorithm. Well, the AUC is 0.64, which is significantly better than random guessing. So it works!
8. Pick your distance
This course is all about making full use of the flexibility of machine learning interfaces in Python. Time to practice writing your own distance metric to tell viruses apart from immune proteins.