From workflows to pipelines

1. From workflows to pipelines

In the first chapter, you learned what a typical machine learning workflow looks like. A properly structured workflow is very similar to the software concept of a pipeline: a series of data processing elements processed in sequence. In this lesson you will learn to turn your workflows into solid, super-efficient pipelines.

2. Revisiting our workflow

Consider the following script. First, we import a random forest and split the data into training and test. We then use grid search cross-validation on the training set to tune the maximum depth of the random forest. Using the best depth, we select the best three features and estimate accuracy on the test data. But what if we also wanted to tune the number of estimators used by the random forest, as well as the number of features selected? Should we split the data further? Does the order in which we optimize the parameters matter?

3. The power of grid search

Well, you could start by optimizing tree depth. As shown in the figure on the right, you would check three possible values for depth and determine 10 is best. Meanwhile, you kept the number of estimators fixed at its default value, namely 10.

4. The power of grid search

To tune the number of estimators, you can now hold depth fixed at 10 instead. This determines 30 estimators to perform best. As shown in the Figure, this sequential workflow has only checked 5 out of 9 possible combinations of values for depth and number of estimators.

5. The power of grid search

Instead, by simply expanding the grid dictionary to include both parameters, you can optimize across all possible combinations.

6. Pipelines

Grid search can jointly optimize any classifier parameters, but what about other parameters in your workflow, like the number of features, k? This is not as easy because these parameters live in different objects.

7. Pipelines

The Pipeline module solves this problem by wrapping an entire workflow in a single object. This means you can now use GridSearchCV() to optimize the hyperparameters from all the steps in your pipeline in one go!

8. Pipelines

Start by creating a pipeline object with named steps: in this case, feature selection comes first, followed by the classifier. Next, you have to create a parameter grid just like when tuning a single classifier. You will have to name the keys to match the way a pipeline object internally represents its parameters: use the name of the step in question, followed by the name of the parameter in question, separated by a double underscore. For example, classifier-double underscore-max_depth refers to the parameter max_depth in the classifier step of the pipeline. The value of each key is a list of candidate values for that parameter. You can then input the pipeline object and this parameter grid to GridSearchCV() as usual.

9. Customizing your pipeline

A pipeline with grid search CV is an extremely flexible and bug-free workflow. You can also use performance metrics other than accuracy, by overriding the default value of the optional parameter "scoring" in grid search CV. Let's say you want to use AUC. You first need to wrap it inside a scorer object, so that it can accept a classifier and data as its input, rather than the predicted labels. You can do this using make_scorer(). You then input the resulting object as the scoring argument in grid search CV.

10. Don't overdo it

Just remember that grid search will fit one classifier for each value combination. Checking 3 values for each of 3 hyperparameters using 10-fold cross-validation will fit the classifier 270 times!

11. Supercharged workflows

Time to practice supercharging your workflows with pipelines.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks