1. Learn
  2. /
  3. Courses
  4. /
  5. Designing Machine Learning Workflows in Python

Connected

Exercise

Bringing it all together

You have two concerns about your pipeline at the arrhythmia detection startup:

  • The app was trained on patients of all ages, but is primarily being used by fitness users who tend to be young. You suspect this might be a case of domain shift, and hence want to disregard all examples above 50 years old.
  • You are still concerned about overfitting, so you want to see if making the random forest classifier less complex and selecting some features might help with that.

You will create a pipeline with a feature selection SelectKBest() step and a RandomForestClassifier, both of which have been imported. You also have access to GridSearchCV(), Pipeline, numpy as np and pickle. The data is available as arrh.

Instructions

100 XP
  • Create a pipeline with SelectKBest() as step ft and RandomForestClassifier() as step clf.
  • Create a parameter grid to tune k in SelectKBest() and max_depth in RandomForestClassifier().
  • Use GridSearchCV() to optimize your pipeline against that grid and data containing only those aged under 50.
  • Save the optimized pipeline to a pickle for production.