Bringing it all together
You just joined an arrhythmia detection startup and want to train a model on the arrhythmias dataset arrh
. You noticed that random forests tend to win quite a few Kaggle competitions, so you want to try that out with a maximum depth of 2, 5, or 10, using grid search. You also observe that the dimension of the dataset is quite high so you wish to consider the effect of a feature selection method.
To make sure you don't overfit by mistake, you have already split your data. You will use X_train
and y_train
for the grid search, and X_test
and y_test
to decide if feature selection helps. All four dataset folds are preloaded in your environment. You also have access to GridSearchCV()
, train_test_split()
, SelectKBest()
, chi2()
and RandomForestClassifier
as rfc
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Use grid search to experiment with a maximum depth of 2, 5, and 10 for
RandomForestClassifier
and store the best performing parameter setting. - Now refit the estimator using the best-performing number of estimators as deduced above.
- Apply the
SelectKBest
feature selector with thechi2
scoring function and refit the classifier.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
____(random_state=1), param_grid=____)
best_value = grid_search.____(
____, ____).best_params_['max_depth']
# Using the best value from above, fit a random forest
clf = rfc(
random_state=1, ____=best_value).____(X_train, y_train)
# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(____, k=____).____(X_train, y_train)
# Create a new dataset only containing the selected features
X_train_reduced = ____.transform(____)