Your first pipeline - again!
Back in the arrhythmia startup, your monthly review is coming up, and as part of that an expert Python programmer will be reviewing your code. You decide to tidy up by following best practices and replace your script for feature selection and random forest classification, with a pipeline. You are using a training dataset available as X_train
and y_train
, and a number of modules: RandomForestClassifier
, SelectKBest()
and f_classif()
for feature selection, as well as GridSearchCV
and Pipeline
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Create a pipeline with the feature selector given by the sample code, and a random forest classifier. Name the first step
feature_selection
. - Add two key-value pairs in
params
, one for the number of featuresk
in the selector with values 10 and 20, and one forn_estimators
in the forest with possible values 2 and 5. - Initialize a
GridSearchCV
object with the given pipeline and parameter grid. - Fit the object to the data and print the best performing parameter combination.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create pipeline with feature selector and classifier
pipe = ___([
(___, SelectKBest(f_classif)),
('clf', ___(random_state=2))])
# Create a parameter grid
params = {
'feature_selection__k':___,
___:[2, 5]}
# Initialize the grid search object
grid_search = ___(___, ___=params)
# Fit it to the data and print the best value combination
print(grid_search.fit(___, ___).___)