Cross-validation statistics
You used grid search CV to tune your random forest classifier, and now want to inspect the cross-validation results to ensure you did not overfit. In particular you would like to take the difference of the mean test score for each fold from the mean training score. The dataset is available as X_train
and y_train
, the pipeline as pipe
, and a number of modules are pre-loaded including pandas
as pd
and GridSearchCV()
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Create a grid search object with three cross-validation folds and ensure it returns training as well as test statistics.
- Fit the grid search object to the training data.
- Store the results of the cross-validation, available in the
cv_results_
attribute of the fitted CV object, into a dataframe. - Print the difference between the column containing the average test score and that containing the average training score.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Fit your pipeline using GridSearchCV with three folds
grid_search = GridSearchCV(
pipe, params, ____=3, return_train_score=____)
# Fit the grid search
gs = grid_search.____(____, ____)
# Store the results of CV into a pandas dataframe
results = pd.____(gs.____)
# Print the difference between mean test and training scores
print(
results[____]-results['mean_train_score'])