1. Learn
  2. /
  3. Courses
  4. /
  5. Designing Machine Learning Workflows in Python

Exercise

Bringing it all together

You just joined an arrhythmia detection startup and want to train a model on the arrhythmias dataset arrh. You noticed that random forests tend to win quite a few Kaggle competitions, so you want to try that out with a maximum depth of 2, 5, or 10, using grid search. You also observe that the dimension of the dataset is quite high so you wish to consider the effect of a feature selection method.

To make sure you don't overfit by mistake, you have already split your data. You will use X_train and y_train for the grid search, and X_test and y_test to decide if feature selection helps. All four dataset folds are preloaded in your environment. You also have access to GridSearchCV(), train_test_split(), SelectKBest(), chi2() and RandomForestClassifier as rfc.

Instructions

100 XP
  • Use grid search to experiment with a maximum depth of 2, 5, and 10 for RandomForestClassifier and store the best performing parameter setting.
  • Now refit the estimator using the best-performing number of estimators as deduced above.
  • Apply the SelectKBest feature selector with the chi2 scoring function and refit the classifier.