Bringing it all together
In addition to the distance-based learning anomaly detection pipeline you created in the last exercise, you want to also support a feature-based learning one with one-class SVM. You decide to extract two features: first, the length of the string, and, second, a numerical encoding of the first letter of the string, obtained using the function LabelEncoder()
described in Chapter 1. To ensure a fair comparison, you will input the outlier scores into an AUC calculation. The following have been imported: LabelEncoder()
, roc_auc_score()
as auc()
and OneClassSVM
. The data is available as a pandas
data frame called proteins
with two columns, label
and seq
, and two classes, IMMUNE SYSTEM
and VIRUS
. A fitted LoF detector is available as lof_detector
.
Este ejercicio forma parte del curso
Designing Machine Learning Workflows in Python
Instrucciones del ejercicio
- For a string
s
,len(s)
returns its length. Apply that to theseq
column to obtain a new columnlen
. - For a string
s
,list(s)
returns a list of its characters. Use this to extract the first letter of each sequence, and encode it usingLabelEncoder()
. - LoF scores are in the
negative_outlier_factor_
attribute. Compute their AUC. - Fit a 1-class SVM to a data frame with only
len
andfirst
as columns. Extract the scores and assess both the LoF scores and the SVM scores using AUC.
Ejercicio interactivo práctico
Prueba este ejercicio completando el código de muestra.
# Create a feature that contains the length of the string
proteins['len'] = proteins['seq'].apply(____)
# Create a feature encoding the first letter of the string
proteins['first'] = ____.____(
proteins['seq'].apply(____))
# Extract scores from the fitted LoF object, compute its AUC
scores_lof = lof_detector.____
print(____(proteins['label']==____, scores_lof))
# Fit a 1-class SVM, extract its scores, and compute its AUC
svm = ____.____(proteins[['len', 'first']])
scores_svm = svm.____(proteins[['len', 'first']])
print(____(proteins['label']==____, scores_svm))