Bringing it all together
You have two concerns about your pipeline at the arrhythmia detection startup:
- The app was trained on patients of all ages, but is primarily being used by fitness users who tend to be young. You suspect this might be a case of domain shift, and hence want to disregard all examples above 50 years old.
- You are still concerned about overfitting, so you want to see if making the random forest classifier less complex and selecting some features might help with that.
You will create a pipeline with a feature selection SelectKBest()
step and a RandomForestClassifier
, both of which have been imported. You also have access to GridSearchCV()
, Pipeline
, numpy
as np
and pickle
. The data is available as arrh
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Create a pipeline with
SelectKBest()
as stepft
andRandomForestClassifier()
as stepclf
. - Create a parameter grid to tune
k
inSelectKBest()
andmax_depth
inRandomForestClassifier()
. - Use
GridSearchCV()
to optimize your pipeline against that grid and data containing only those aged under 50. - Save the optimized pipeline to a pickle for production.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a pipeline
pipe = Pipeline([
('ft', ____), ('clf', ____(random_state=2))])
# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}
# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])
# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
pickle.dump(____, file)