Feature-Transformationen

Du besprichst den Kredit-Datensatz mit der Bankmanagerin. Sie vermutet, dass die sichersten Kreditanträge eher mittlere Kreditbeträge anfragen. Werte, die entweder sehr niedrig oder sehr hoch sind, deuten auf ein hohes Risiko hin. Das legt nahe, dass zwischen dieser Variable und der Klasse eine nichtlineare Beziehung bestehen könnte. Diese Hypothese willst du testen. Du konstruierst dafür eine nichtlineare Transformation des Features. Anschließend bewertest du, welches der beiden Features die Klasse besser vorhersagt, und zwar mit SelectKBest() und der chi2()-Metrik; beide sind bereits vorab geladen.

Die Daten liegen als pandas DataFrame credit vor, die Klasse steht in der Spalte class. Außerdem sind pandas als pd und numpy als np vorab geladen.

Diese Übung ist Teil des Kurses

<Kurs>Machine-Learning-Workflows in Python entwerfen</Kurs>

Übungsanweisungen

Definiere eine Funktion, die einen numerischen Vektor transformiert, indem sie die absolute Abweichung jedes Werts vom Durchschnitt des Vektors berechnet.
Wende diese Transformation auf die Spalte credit_amount des Datensatzes an und speichere das Ergebnis in einer neuen Spalte namens diff.
Erstelle einen SelectKBest()-Feature-Selector, um mithilfe der chi2()-Metrik eine der beiden Spalten credit_amount und diff auszuwählen.
Inspiziere die Ergebnisse.

Interaktive praktische Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Function computing absolute difference from column mean
def abs_diff(x):
    return ____(x-____)

# Apply it to the credit amount and store to new column
credit['diff'] = ____

# Create a feature selector with chi2 that picks one feature
sk = ____(chi2, ____)

# Use the selector to pick between credit_amount and diff
sk.fit(____, credit['class'])

# Inspect the results
sk.____()

Code bearbeiten und ausführen

Diese Übung ist Teil des Kurses

<Kurs>Machine-Learning-Workflows in Python entwerfen</Kurs>

Hohe SchwierigkeitSchwierigkeitsgrad

4.8+

Kurs kostenlos starten

In this chapter, you will be reminded of the basics of a supervised learning workflow, complete with model fitting, tuning and selection, feature engineering and selection, and data splitting techniques. You will understand how these steps in a workflow depend on each other, and recognize how they can all contribute to, or fight against overfitting: the data scientist's worst enemy. By the end of the chapter, you will already be fluent in supervised learning, and ready to take the dive towards more advanced material in later chapters.

Exercise 1: Pipelines für überwachtes Lernen Exercise 2: Feature Engineering Exercise 3: Deine erste Pipeline Exercise 4: Modellkomplexität und Overfitting Exercise 5: Grid-Search-CV für Modellkomplexität Exercise 6: Anzahl der Bäume und Schätzer Exercise 7: Feature Engineering und Overfitting Exercise 8: Kategorielle Encodings Exercise 9: Feature-Transformationen

Aktuelle Übung

Exercise 10: Alles zusammenführen

In the previous chapter, you perfected your knowledge of the standard supervised learning workflows. In this chapter, you will critically examine the ways in which expert knowledge is incorporated in supervised learning. This is done through the identification of the appropriate unit of analysis which might require feature engineering across multiple data sources, through the sometimes imperfect process of labeling examples, and through the specification of a loss function that captures the true business value of errors made by your machine learning model.

Exercise 1: Data fusion Exercise 2: Is the source or the destination bad?Exercise 3: Feature engineering on grouped data Exercise 4: Imperfect labels Exercise 5: Turning a heuristic into a classifier Exercise 6: Combining heuristics Exercise 7: Dealing with label noise Exercise 8: Loss functions Part I Exercise 9: Reminder of performance metrics Exercise 10: Real-world cost analysis Exercise 11: Confusion matrix calculations Exercise 12: Loss functions Part II Exercise 13: Default thresholding Exercise 14: Optimizing the threshold Exercise 15: Bringing it all together

In the previous chapter, you employed different ways of incorporating feedback from experts in your workflow, and evaluating it in ways that are aligned with business value. Now it is time for you to practice the skills needed to productize your model and ensure it continues to perform well thereafter by iteratively improving it. You will also learn to diagnose dataset shift and mitigate the effect that a changing environment can have on your model's accuracy.

Exercise 1: From workflows to pipelines Exercise 2: Your first pipeline - again!Exercise 3: Custom scorers in pipelines Exercise 4: Model deployment Exercise 5: Pickles Exercise 6: Custom function transformers in pipelines Exercise 7: Iterating without overfitting Exercise 8: Challenge the champion Exercise 9: Cross-validation statistics Exercise 10: Dataset shift Exercise 11: Tuning the window size Exercise 12: Bringing it all together

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks