Bringing it all together
In addition to the distance-based learning anomaly detection pipeline you created in the last exercise, you want to also support a feature-based learning one with one-class SVM. You decide to extract two features: first, the length of the string, and, second, a numerical encoding of the first letter of the string, obtained using the function LabelEncoder() described in Chapter 1. To ensure a fair comparison, you will input the outlier scores into an AUC calculation. The following have been imported: LabelEncoder(), roc_auc_score() as auc() and OneClassSVM. The data is available as a pandas data frame called proteins with two columns, label and seq, and two classes, IMMUNE SYSTEM and VIRUS. A fitted LoF detector is available as lof_detector.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- For a string
s,len(s)returns its length. Apply that to theseqcolumn to obtain a new columnlen. - For a string
s,list(s)returns a list of its characters. Use this to extract the first letter of each sequence, and encode it usingLabelEncoder(). - LoF scores are in the
negative_outlier_factor_attribute. Compute their AUC. - Fit a 1-class SVM to a data frame with only
lenandfirstas columns. Extract the scores and assess both the LoF scores and the SVM scores using AUC.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a feature that contains the length of the string
proteins['len'] = proteins['seq'].apply(____)
# Create a feature encoding the first letter of the string
proteins['first'] = ____.____(
proteins['seq'].apply(____))
# Extract scores from the fitted LoF object, compute its AUC
scores_lof = lof_detector.____
print(____(proteins['label']==____, scores_lof))
# Fit a 1-class SVM, extract its scores, and compute its AUC
svm = ____.____(proteins[['len', 'first']])
scores_svm = svm.____(proteins[['len', 'first']])
print(____(proteins['label']==____, scores_svm))