Get startedGet started for free

Restricted Levenshtein

You notice that the stringdist package also implements a variation of Levenshtein distance called the Restricted Damerau-Levenshtein distance, and want to try it out. You will follow the logic from the lesson, wrapping it inside a custom function and precomputing the distance matrix before fitting a local outlier factor anomaly detector. You will measure performance with accuracy_score() which is available to you as accuracy(). You also have access to packages stringdist, numpy as np, pdist() and squareform() from scipy.spatial.distance, and LocalOutlierFactor as lof. The data has been preloaded as a pandas dataframe with two columns, label and sequence, and has two classes: IMMUNE SYSTEM and VIRUS.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • Write a function with input u and v, each of which is an array containing a string, and applies the rdlevenshtein() function on the two strings.
  • Reshape the sequence column from proteins by first casting it into an numpy array, and then using .reshape().
  • Compute a square distance matrix for sequences using my_rdlevenshtein(), and fit lof on it.
  • Compute accuracy by converting preds and proteins['label'] into booleans indicating whether a protein is a virus.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Wrap the RD-Levenshtein metric in a custom function
def my_rdlevenshtein(u, v):
    return ____.rdlevenshtein(____, ____)

# Reshape the array into a numpy matrix
sequences = ____(proteins['seq']).____(-1, 1)

# Compute the pairwise distance matrix in square form
M = ____

# Run a LoF algorithm on the precomputed distance matrix
preds = lof(metric=____).____(M)

# Compute the accuracy of the outlier predictions
print(accuracy(____, ____))
Edit and Run Code