Mettre toutes les pièces ensemble

Vous avez deux préoccupations au sujet de votre pipeline dans la start-up de détection d’arythmies :

L’application a été entraînée sur des patient·e·s de tous âges, mais elle est principalement utilisée par des personnes adeptes de fitness, qui ont tendance à être jeunes. Vous soupçonnez un cas de « domain shift » et souhaitez donc écarter tous les exemples de plus de 50 ans.
Vous craignez toujours l’overfitting, et vous voulez vérifier si rendre le classifieur random forest moins complexe et sélectionner certaines caractéristiques peut aider.

Vous allez créer un pipeline avec une étape de sélection de caractéristiques SelectKBest() et un RandomForestClassifier, tous deux déjà importés. Vous avez également accès à GridSearchCV(), Pipeline, numpy sous le nom np et pickle. Les données sont disponibles sous arrh.

Cet exercice fait partie du cours

<cours>Concevoir des workflows de Machine Learning en Python</cours>

Instructions de l’exercice

Créez un pipeline avec SelectKBest() sous le nom d’étape ft et RandomForestClassifier() sous le nom d’étape clf.
Créez une grille de paramètres pour ajuster k dans SelectKBest() et max_depth dans RandomForestClassifier().
Utilisez GridSearchCV() pour optimiser votre pipeline selon cette grille sur les données contenant uniquement les personnes de moins de 50 ans.
Enregistrez le pipeline optimisé dans un fichier pickle pour la production.

Exercice interactif pratique

Essayez cet exercice en complétant ce code d’exemple.

# Create a pipeline 
pipe = Pipeline([
  ('ft', ____), ('clf', ____(random_state=2))])

# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}

# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])

# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
    pickle.dump(____, file)

Modifier et exécuter le code

Cet exercice fait partie du cours

<cours>Concevoir des workflows de Machine Learning en Python</cours>

AvancéNiveau de compétence

4.8+

Commencer le cours gratuitement

In this chapter, you will be reminded of the basics of a supervised learning workflow, complete with model fitting, tuning and selection, feature engineering and selection, and data splitting techniques. You will understand how these steps in a workflow depend on each other, and recognize how they can all contribute to, or fight against overfitting: the data scientist's worst enemy. By the end of the chapter, you will already be fluent in supervised learning, and ready to take the dive towards more advanced material in later chapters.

Exercise 1: Supervised learning pipelines Exercise 2: Feature engineering Exercise 3: Your first pipeline Exercise 4: Model complexity and overfitting Exercise 5: Grid search CV for model complexity Exercise 6: Number of trees and estimators Exercise 7: Feature engineering and overfitting Exercise 8: Categorical encodings Exercise 9: Feature transformations Exercise 10: Bringing it all together

In the previous chapter, you perfected your knowledge of the standard supervised learning workflows. In this chapter, you will critically examine the ways in which expert knowledge is incorporated in supervised learning. This is done through the identification of the appropriate unit of analysis which might require feature engineering across multiple data sources, through the sometimes imperfect process of labeling examples, and through the specification of a loss function that captures the true business value of errors made by your machine learning model.

Exercise 1: Data fusion Exercise 2: Is the source or the destination bad?Exercise 3: Feature engineering on grouped data Exercise 4: Imperfect labels Exercise 5: Turning a heuristic into a classifier Exercise 6: Combining heuristics Exercise 7: Dealing with label noise Exercise 8: Loss functions Part I Exercise 9: Reminder of performance metrics Exercise 10: Real-world cost analysis Exercise 11: Confusion matrix calculations Exercise 12: Loss functions Part II Exercise 13: Default thresholding Exercise 14: Optimizing the threshold Exercise 15: Bringing it all together

In the previous chapter, you employed different ways of incorporating feedback from experts in your workflow, and evaluating it in ways that are aligned with business value. Now it is time for you to practice the skills needed to productize your model and ensure it continues to perform well thereafter by iteratively improving it. You will also learn to diagnose dataset shift and mitigate the effect that a changing environment can have on your model's accuracy.

Exercise 1: Des workflows aux pipelines Exercise 2: Votre première pipeline — encore !Exercise 3: Évaluateurs personnalisés dans les pipelines Exercise 4: Déploiement de modèles Exercise 5: Pickles Exercise 6: Transformateurs de fonctions personnalisées dans des pipelines Exercise 7: Itérer sans surapprentissage Exercise 8: Mettre le champion au défi Exercise 9: Statistiques de validation croisée Exercise 10: Déplacement de données Exercise 11: Ajuster la taille de la fenêtre Exercise 12: Mettre toutes les pièces ensemble

Exercice actuel

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks