Get startedGet started for free

Bringing it all together

In addition to the distance-based learning anomaly detection pipeline you created in the last exercise, you want to also support a feature-based learning one with one-class SVM. You decide to extract two features: first, the length of the string, and, second, a numerical encoding of the first letter of the string, obtained using the function LabelEncoder() described in Chapter 1. To ensure a fair comparison, you will input the outlier scores into an AUC calculation. The following have been imported: LabelEncoder(), roc_auc_score() as auc() and OneClassSVM. The data is available as a pandas data frame called proteins with two columns, label and seq, and two classes, IMMUNE SYSTEM and VIRUS. A fitted LoF detector is available as lof_detector.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • For a string s, len(s) returns its length. Apply that to the seq column to obtain a new column len.
  • For a string s, list(s) returns a list of its characters. Use this to extract the first letter of each sequence, and encode it using LabelEncoder().
  • LoF scores are in the negative_outlier_factor_ attribute. Compute their AUC.
  • Fit a 1-class SVM to a data frame with only len and first as columns. Extract the scores and assess both the LoF scores and the SVM scores using AUC.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a feature that contains the length of the string
proteins['len'] = proteins['seq'].apply(____)

# Create a feature encoding the first letter of the string
proteins['first'] =  ____.____(
  proteins['seq'].apply(____))

# Extract scores from the fitted LoF object, compute its AUC
scores_lof = lof_detector.____
print(____(proteins['label']==____, scores_lof))

# Fit a 1-class SVM, extract its scores, and compute its AUC
svm = ____.____(proteins[['len', 'first']])
scores_svm = svm.____(proteins[['len', 'first']])
print(____(proteins['label']==____, scores_svm))
Edit and Run Code