CommencerCommencer gratuitement

Bringing it all together

In addition to the distance-based learning anomaly detection pipeline you created in the last exercise, you want to also support a feature-based learning one with one-class SVM. You decide to extract two features: first, the length of the string, and, second, a numerical encoding of the first letter of the string, obtained using the function LabelEncoder() described in Chapter 1. To ensure a fair comparison, you will input the outlier scores into an AUC calculation. The following have been imported: LabelEncoder(), roc_auc_score() as auc() and OneClassSVM. The data is available as a pandas data frame called proteins with two columns, label and seq, and two classes, IMMUNE SYSTEM and VIRUS. A fitted LoF detector is available as lof_detector.

Cet exercice fait partie du cours

Designing Machine Learning Workflows in Python

Afficher le cours

Instructions

  • For a string s, len(s) returns its length. Apply that to the seq column to obtain a new column len.
  • For a string s, list(s) returns a list of its characters. Use this to extract the first letter of each sequence, and encode it using LabelEncoder().
  • LoF scores are in the negative_outlier_factor_ attribute. Compute their AUC.
  • Fit a 1-class SVM to a data frame with only len and first as columns. Extract the scores and assess both the LoF scores and the SVM scores using AUC.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create a feature that contains the length of the string
proteins['len'] = proteins['seq'].apply(____)

# Create a feature encoding the first letter of the string
proteins['first'] =  ____.____(
  proteins['seq'].apply(____))

# Extract scores from the fitted LoF object, compute its AUC
scores_lof = lof_detector.____
print(____(proteins['label']==____, scores_lof))

# Fit a 1-class SVM, extract its scores, and compute its AUC
svm = ____.____(proteins[['len', 'first']])
scores_svm = svm.____(proteins[['len', 'first']])
print(____(proteins['label']==____, scores_svm))
Modifier et exécuter le code