Exercise

# Restricted Levenshtein

You notice that the `stringdist`

package also implements a variation of Levenshtein distance called the Restricted Damerau-Levenshtein distance, and want to try it out. You will follow the logic from the lesson, wrapping it inside a custom function and precomputing the distance matrix before fitting a local outlier factor anomaly detector. You will measure performance with `accuracy_score()`

which is available to you as `accuracy()`

. You also have access to packages `stringdist`

, `numpy`

as `np`

, `pdist()`

and `squareform()`

from `scipy.spatial.distance`

, and `LocalOutlierFactor`

as `lof`

. The data has been preloaded as a `pandas`

dataframe with two columns, `label`

and `sequence`

, and has two classes: `IMMUNE SYSTEM`

and `VIRUS`

.

Instructions

**100 XP**

- Write a function with input
`u`

and`v`

, each of which is an array containing a string, and applies the`rdlevenshtein()`

function on the two strings. - Reshape the
`sequence`

column from`proteins`

by first casting it into an`numpy`

array, and then using`.reshape()`

. - Compute a square distance matrix for
`sequences`

using`my_rdlevenshtein()`

, and fit`lof`

on it. - Compute accuracy by converting
`preds`

and`proteins['label']`

into booleans indicating whether a protein is a virus.