Bringing it all together
You just joined an arrhythmia detection startup and want to train a model on the arrhythmias dataset arrh. You noticed that random forests tend to win quite a few Kaggle competitions, so you want to try that out with a maximum depth of 2, 5, or 10, using grid search. You also observe that the dimension of the dataset is quite high so you wish to consider the effect of a feature selection method.
To make sure you don't overfit by mistake, you have already split your data. You will use X_train and y_train for the grid search, and X_test and y_test to decide if feature selection helps. All four dataset folds are preloaded in your environment. You also have access to GridSearchCV(), train_test_split(), SelectKBest(), chi2() and RandomForestClassifier as rfc.
Diese Übung ist Teil des Kurses
Designing Machine Learning Workflows in Python
Anleitung zur Übung
- Use grid search to experiment with a maximum depth of 2, 5, and 10 for
RandomForestClassifierand store the best performing parameter setting. - Now refit the estimator using the best-performing number of estimators as deduced above.
- Apply the
SelectKBestfeature selector with thechi2scoring function and refit the classifier.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
____(random_state=1), param_grid=____)
best_value = grid_search.____(
____, ____).best_params_['max_depth']
# Using the best value from above, fit a random forest
clf = rfc(
random_state=1, ____=best_value).____(X_train, y_train)
# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(____, k=____).____(X_train, y_train)
# Create a new dataset only containing the selected features
X_train_reduced = ____.transform(____)