Bringing it all together
You have two concerns about your pipeline at the arrhythmia detection startup:
- The app was trained on patients of all ages, but is primarily being used by fitness users who tend to be young. You suspect this might be a case of domain shift, and hence want to disregard all examples above 50 years old.
- You are still concerned about overfitting, so you want to see if making the random forest classifier less complex and selecting some features might help with that.
You will create a pipeline with a feature selection SelectKBest() step and a RandomForestClassifier, both of which have been imported. You also have access to GridSearchCV(), Pipeline, numpy as np and pickle. The data is available as arrh.
Cet exercice fait partie du cours
Designing Machine Learning Workflows in Python
Instructions
- Create a pipeline with
SelectKBest()as stepftandRandomForestClassifier()as stepclf. - Create a parameter grid to tune
kinSelectKBest()andmax_depthinRandomForestClassifier(). - Use
GridSearchCV()to optimize your pipeline against that grid and data containing only those aged under 50. - Save the optimized pipeline to a pickle for production.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Create a pipeline
pipe = Pipeline([
('ft', ____), ('clf', ____(random_state=2))])
# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}
# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])
# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
pickle.dump(____, file)