Default thresholding

You would like to confirm that the DecisionTreeClassifier() uses the same default classification threshold as mentioned in the previous lesson, namely 0.5. It seems strange to you that all classifiers should use the same threshold. Let's check! A fitted decision tree classifier clf has been preloaded for you, as have the training and test data with their usual names: X_train, X_test, y_train and y_test. You will have to extract probability scores from the classifier using the .predict_proba() method.

Cet exercice fait partie du cours

Designing Machine Learning Workflows in Python

Afficher le cours

Instructions

Produce scores for the test examples, using the preloaded classifier clf.
Now extract labels from the scores. Remember that you have a pair of scores for each example, not a single score, and the second element is the probability of the positive class.
Now label the test data using the standard .predict() method
Finally, compare with the predictions you got before. Are they identical?

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Score the test data using the given classifier
scores = clf.____(____)

# Get labels from the scores using the default threshold
preds = [s[____] > ____ for s in scores]

# Use the predict method to label the test data again
preds_default = clf.____(____)

# Compare the two sets of predictions
____(preds == preds_default)

Modifier et exécuter le code

Designing Machine Learning Workflows in Python

AvancéNiveau de compétence

4.8+

65 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks