Session Ready
Exercise

Restricted Levenshtein

You notice that the stringdist package also implements a variation of Levenshtein distance called the Restricted Damerau-Levenshtein distance, and want to try it out. You will follow the logic from the lesson, wrapping it inside a custom function and precomputing the distance matrix before fitting a local outlier factor anomaly detector. You will measure performance with accuracy_score() which is available to you as accuracy(). You also have access to packages stringdist, numpy as np, pdist() and squareform() from scipy.spatial.distance, and LocalOutlierFactor as lof. The data has been preloaded as a pandas dataframe with two columns, label and sequence, and has two classes: IMMUNE SYSTEM and VIRUS.

Instructions
100 XP
  • Write a function with input u and v, each of which is an array containing a string, and applies the rdlevenshtein() function on the two strings.
  • Reshape the sequence column from proteins by first casting it into an numpy array, and then using .reshape().
  • Compute a square distance matrix for sequences using my_rdlevenshtein(), and fit lof on it.
  • Compute accuracy by converting preds and proteins['label'] into booleans indicating whether a protein is a virus.