Bringing it all together

You have two concerns about your pipeline at the arrhythmia detection startup:

The app was trained on patients of all ages, but is primarily being used by fitness users who tend to be young. You suspect this might be a case of domain shift, and hence want to disregard all examples above 50 years old.
You are still concerned about overfitting, so you want to see if making the random forest classifier less complex and selecting some features might help with that.

You will create a pipeline with a feature selection SelectKBest() step and a RandomForestClassifier, both of which have been imported. You also have access to GridSearchCV(), Pipeline, numpy as np and pickle. The data is available as arrh.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

Create a pipeline with SelectKBest() as step ft and RandomForestClassifier() as step clf.
Create a parameter grid to tune k in SelectKBest() and max_depth in RandomForestClassifier().
Use GridSearchCV() to optimize your pipeline against that grid and data containing only those aged under 50.
Save the optimized pipeline to a pickle for production.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a pipeline 
pipe = Pipeline([
  ('ft', ____), ('clf', ____(random_state=2))])

# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}

# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])

# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
    pickle.dump(____, file)

Edit and Run Code

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks