Contamination revisited

You notice that one-class SVM does not have a contamination parameter. But you know well by now that you really need a way to control the proportion of examples that are labeled as novelties in order to control your false positive rate. So you decide to experiment with thresholding the scores. The detector has been imported as onesvm, you also have available the data as X_train, X_test, y_train, y_test, numpy as np, and confusion_matrix().

Questo esercizio fa parte del corso

Designing Machine Learning Workflows in Python

Visualizza il corso

Istruzioni dell'esercizio

Fit the 1-class SVM and score the test data.
Compute the observed proportion of outliers in the test data.
Use np.quantile() to find where to threshold the scores to achieve that proportion.
Use that threshold to label the test data. Print the confusion matrix.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Fit a one-class SVM detector and score the test data
nov_det = ____(X_train)
scores = ____(X_test)

# Find the observed proportion of outliers in the test data
prop = np.____(y_test==____)

# Compute the appropriate threshold
threshold = np.____(____, ____)

# Print the confusion matrix for the thresholded scores
print(confusion_matrix(y_test, ____ > ____))

Modifica ed esegui il codice

Designing Machine Learning Workflows in Python

AvançadoNível de habilidade

4.8+

74 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited

Esercizio in corso

Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks