Bringing it all together
In addition to the distance-based learning anomaly detection pipeline you created in the last exercise, you want to also support a feature-based learning one with one-class SVM. You decide to extract two features: first, the length of the string, and, second, a numerical encoding of the first letter of the string, obtained using the function LabelEncoder()
described in Chapter 1. To ensure a fair comparison, you will input the outlier scores into an AUC calculation. The following have been imported: LabelEncoder()
, roc_auc_score()
as auc()
and OneClassSVM
. The data is available as a pandas
data frame called proteins
with two columns, label
and seq
, and two classes, IMMUNE SYSTEM
and VIRUS
. A fitted LoF detector is available as lof_detector
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- For a string
s
,len(s)
returns its length. Apply that to theseq
column to obtain a new columnlen
. - For a string
s
,list(s)
returns a list of its characters. Use this to extract the first letter of each sequence, and encode it usingLabelEncoder()
. - LoF scores are in the
negative_outlier_factor_
attribute. Compute their AUC. - Fit a 1-class SVM to a data frame with only
len
andfirst
as columns. Extract the scores and assess both the LoF scores and the SVM scores using AUC.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a feature that contains the length of the string
proteins['len'] = proteins['seq'].apply(____)
# Create a feature encoding the first letter of the string
proteins['first'] = ____.____(
proteins['seq'].apply(____))
# Extract scores from the fitted LoF object, compute its AUC
scores_lof = lof_detector.____
print(____(proteins['label']==____, scores_lof))
# Fit a 1-class SVM, extract its scores, and compute its AUC
svm = ____.____(proteins[['len', 'first']])
scores_svm = svm.____(proteins[['len', 'first']])
print(____(proteins['label']==____, scores_svm))