Restricted Levenshtein
You notice that the stringdist
package also implements a variation of Levenshtein distance called the Restricted Damerau-Levenshtein distance, and want to try it out. You will follow the logic from the lesson, wrapping it inside a custom function and precomputing the distance matrix before fitting a local outlier factor anomaly detector. You will measure performance with accuracy_score()
which is available to you as accuracy()
. You also have access to packages stringdist
, numpy
as np
, pdist()
and squareform()
from scipy.spatial.distance
, and LocalOutlierFactor
as lof
. The data has been preloaded as a pandas
dataframe with two columns, label
and sequence
, and has two classes: IMMUNE SYSTEM
and VIRUS
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Write a function with input
u
andv
, each of which is an array containing a string, and applies therdlevenshtein()
function on the two strings. - Reshape the
sequence
column fromproteins
by first casting it into annumpy
array, and then using.reshape()
. - Compute a square distance matrix for
sequences
usingmy_rdlevenshtein()
, and fitlof
on it. - Compute accuracy by converting
preds
andproteins['label']
into booleans indicating whether a protein is a virus.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Wrap the RD-Levenshtein metric in a custom function
def my_rdlevenshtein(u, v):
return ____.rdlevenshtein(____, ____)
# Reshape the array into a numpy matrix
sequences = ____(proteins['seq']).____(-1, 1)
# Compute the pairwise distance matrix in square form
M = ____
# Run a LoF algorithm on the precomputed distance matrix
preds = lof(metric=____).____(M)
# Compute the accuracy of the outlier predictions
print(accuracy(____, ____))