Restricted Levenshtein
You notice that the stringdist package also implements a variation of Levenshtein distance called the Restricted Damerau-Levenshtein distance, and want to try it out. You will follow the logic from the lesson, wrapping it inside a custom function and precomputing the distance matrix before fitting a local outlier factor anomaly detector. You will measure performance with accuracy_score() which is available to you as accuracy(). You also have access to packages stringdist, numpy as np, pdist() and squareform() from scipy.spatial.distance, and LocalOutlierFactor as lof. The data has been preloaded as a pandas dataframe with two columns, label and sequence, and has two classes: IMMUNE SYSTEM and VIRUS.
Diese Übung ist Teil des Kurses
Designing Machine Learning Workflows in Python
Anleitung zur Übung
- Write a function with input
uandv, each of which is an array containing a string, and applies therdlevenshtein()function on the two strings. - Reshape the
sequencecolumn fromproteinsby first casting it into annumpyarray, and then using.reshape(). - Compute a square distance matrix for
sequencesusingmy_rdlevenshtein(), and fitlofon it. - Compute accuracy by converting
predsandproteins['label']into booleans indicating whether a protein is a virus.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Wrap the RD-Levenshtein metric in a custom function
def my_rdlevenshtein(u, v):
return ____.rdlevenshtein(____, ____)
# Reshape the array into a numpy matrix
sequences = ____(proteins['seq']).____(-1, 1)
# Compute the pairwise distance matrix in square form
M = ____
# Run a LoF algorithm on the precomputed distance matrix
preds = lof(metric=____).____(M)
# Compute the accuracy of the outlier predictions
print(accuracy(____, ____))