Get startedGet started for free

Bringing it all together

You just joined an arrhythmia detection startup and want to train a model on the arrhythmias dataset arrh. You noticed that random forests tend to win quite a few Kaggle competitions, so you want to try that out with a maximum depth of 2, 5, or 10, using grid search. You also observe that the dimension of the dataset is quite high so you wish to consider the effect of a feature selection method.

To make sure you don't overfit by mistake, you have already split your data. You will use X_train and y_train for the grid search, and X_test and y_test to decide if feature selection helps. All four dataset folds are preloaded in your environment. You also have access to GridSearchCV(), train_test_split(), SelectKBest(), chi2() and RandomForestClassifier as rfc.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • Use grid search to experiment with a maximum depth of 2, 5, and 10 for RandomForestClassifier and store the best performing parameter setting.
  • Now refit the estimator using the best-performing number of estimators as deduced above.
  • Apply the SelectKBest feature selector with the chi2 scoring function and refit the classifier.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
  ____(random_state=1), param_grid=____)
best_value = grid_search.____(
  ____, ____).best_params_['max_depth']

# Using the best value from above, fit a random forest
clf = rfc(
  random_state=1, ____=best_value).____(X_train, y_train)

# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(____, k=____).____(X_train, y_train)

# Create a new dataset only containing the selected features
X_train_reduced = ____.transform(____)
Edit and Run Code