Transformer une heuristique en classifieur

Vous êtes surpris de voir à quel point les heuristiques peuvent être utiles. Vous décidez donc de traiter l’heuristique selon laquelle « trop de ports uniques est suspect » comme un classifieur à part entière. Pour cela, vous appliquez un seuil au nombre de ports uniques par source, en utilisant la moyenne observée chez les ordinateurs sources malveillants — ceux pour lesquels l’étiquette est True. Le jeu de données est préchargé et séparé en entraînement et test ; vous avez donc en mémoire les objets X_train, X_test, y_train et y_test. Vos imports incluent accuracy_score() et numpy sous np. Pour clarifier : vous n’allez pas ajuster un classifieur de scikit-learn dans cet exercice ; vous allez plutôt définir explicitement votre propre règle de classification !

Cet exercice fait partie du cours

<cours>Concevoir des workflows de Machine Learning en Python</cours>

Instructions de l’exercice

Sélectionnez uniquement les hôtes malveillants depuis X_train pour former un nouveau jeu de données X_train_bad. Notez que y_train est un tableau booléen.
Calculez la moyenne de la colonne unique_ports pour les hôtes malveillants et stockez-la dans avg_bad_ports.
Considérez maintenant un classifieur qui prédit positif tout exemple dont unique_ports dépasse avg_bad_ports. Enregistrez les prédictions de ce classifieur sur les données de test dans une nouvelle variable, pred_port.
Calculez la précision de ce classifieur sur les données de test à l’aide de accuracy_score().

Exercice interactif pratique

Essayez cet exercice en complétant ce code d’exemple.

# Create a new dataset X_train_bad by subselecting bad hosts
X_train_bad = ____[____]

# Calculate the average of unique_ports in bad examples
avg_bad_ports = np.____(____['unique_ports'])

# Label as positive sources that use more ports than that
pred_port = ____['unique_ports'] > ____

# Print the accuracy of the heuristic
print(____(y_test, ____))

Modifier et exécuter le code

Cet exercice fait partie du cours

<cours>Concevoir des workflows de Machine Learning en Python</cours>

AvancéNiveau de compétence

4.8+

Commencer le cours gratuitement

In this chapter, you will be reminded of the basics of a supervised learning workflow, complete with model fitting, tuning and selection, feature engineering and selection, and data splitting techniques. You will understand how these steps in a workflow depend on each other, and recognize how they can all contribute to, or fight against overfitting: the data scientist's worst enemy. By the end of the chapter, you will already be fluent in supervised learning, and ready to take the dive towards more advanced material in later chapters.

Exercise 1: Supervised learning pipelines Exercise 2: Feature engineering Exercise 3: Your first pipeline Exercise 4: Model complexity and overfitting Exercise 5: Grid search CV for model complexity Exercise 6: Number of trees and estimators Exercise 7: Feature engineering and overfitting Exercise 8: Categorical encodings Exercise 9: Feature transformations Exercise 10: Bringing it all together

In the previous chapter, you perfected your knowledge of the standard supervised learning workflows. In this chapter, you will critically examine the ways in which expert knowledge is incorporated in supervised learning. This is done through the identification of the appropriate unit of analysis which might require feature engineering across multiple data sources, through the sometimes imperfect process of labeling examples, and through the specification of a loss function that captures the true business value of errors made by your machine learning model.

Exercise 1: Fusion de données Exercise 2: La source ou la destination est-elle en cause ?Exercise 3: Feature engineering sur des données groupées Exercise 4: Étiquettes imparfaites Exercise 5: Transformer une heuristique en classifieur

Exercice actuel

Exercise 6: Combiner des heuristiques Exercise 7: Gérer le bruit dans les labels Exercise 8: Fonctions de perte – Partie I Exercise 9: Rappel des métriques de performance Exercise 10: Analyse des coûts en conditions réelles Exercise 11: Calculs avec la matrice de confusion Exercise 12: Fonctions de perte – Partie II Exercise 13: Seuil par défaut Exercise 14: Optimiser le seuil Exercise 15: Tout rassembler

In the previous chapter, you employed different ways of incorporating feedback from experts in your workflow, and evaluating it in ways that are aligned with business value. Now it is time for you to practice the skills needed to productize your model and ensure it continues to perform well thereafter by iteratively improving it. You will also learn to diagnose dataset shift and mitigate the effect that a changing environment can have on your model's accuracy.

Exercise 1: From workflows to pipelines Exercise 2: Your first pipeline - again!Exercise 3: Custom scorers in pipelines Exercise 4: Model deployment Exercise 5: Pickles Exercise 6: Custom function transformers in pipelines Exercise 7: Iterating without overfitting Exercise 8: Challenge the champion Exercise 9: Cross-validation statistics Exercise 10: Dataset shift Exercise 11: Tuning the window size Exercise 12: Bringing it all together

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks