1. Understanding a grid search output
Now that you know how to run a grid search, let's focus on its output.
2. Analyzing the output
Let us now analyze each of the properties of the GridSearchCV output and learn how to access and use them.
The properties of the object can be categorized into three different groups
a results log
the best results
and 'Extra information'.
3. Accessing object properties
Properties are accessed using the dot notation, that is
grid_search_object-dot-property.
Where property is the actual property you want to retrieve
Let's review each of the key properties now.
4. The .cv_results_ property
Firstly there is the cv_results_ property.
This is a dictionary that we can read into a pandas DataFrame to explore.
Notice there are 12 rows because there are 12 squares in our grid. Each row tells you about what happened when testing that square.
5. The .cv_results_ 'time' columns
The 'time' columns refer to the time it took to fit and score the model.
We did a cross-validation so this ran 5 times and stored the average and standard deviation of the times it took in seconds.
6. The .cv_results_ 'param_' columns
The param_ columns contain information on the different parameters that were used in the model. Remember, each row in this DataFrame is about one model.
So we can see row 3 for example tested the hyperparameter combination of max_depth 10 and min_samples_leaf 2 and n_estimators 100 for our random forest estimator.
7. The .cv_results_ 'param' column
The params column is a dictionary of all the parameters from the previous 'param' columns.
We need to use pd.set_option here to ensure we don't truncate the results we are printing.
8. The .cv_results_ 'test_score' columns
The next 5 columns are the testing scores for each of the 5 cross-folds, or splits, we made, followed by the the mean and standard deviation for those cross-folds.
9. The .cv_results_ 'rank_test_score' column
The rank column conveniently ranks the rows by the mean_test_score.
We can see that the model in our third row had the best mean_test_score.
10. Extracting the best row
Using the rank_test_score column we can easily select the grid search square for analysis.
This table is the row from the cv_results object that was the best model created.
11. The .cv_results_ 'train_score' columns
The test_score columns are then repeated for the training scores.
Note that if we had not set return_train_score to True this would not include the training scores.
There is also no ranking column for the training scores, as we only care about performance on the test set in each fold.
12. The best grid square
Information on the best grid square is found in three different properties
best_params_ which is the dictionary of the parameters that gave the best score.
best_score_, the actual best score
and best_index, the row in our cv_results_ that was the best. This is same as the index of the row with rank 1 in cv_results_ that we extracted just before.
13. The best_estimator_ property
GridSearchCV stores an estimator built with the best hyperparameters in the best_estimator property. Since it is an estimator, we can use this to predict on our test set.
We can demonstrate this by using python's type function and see it is a Random Forest Classification estimator.
We can also use the GridSearchCV object itself directly as an estimator.
14. The best_estimator_ property
We can print out and see the estimator itself.
This is why we set refit=True when creating the grid search, otherwise we would need to refit using the best parameters ourself before using the best estimator.
15. Extra information
Some extra information can be obtained with the following properties. These are not very useful properties but may be important if you construct you grid search differently.
These include the scorer function that was used and the number of cross validation splits (both of which we set ourselves),
and the refit_time which is the number of seconds used for refitting the best model on the whole dataset. This may be of interest in analyzing efficiencies in your work, but not for our use case here.
16. Let's practice!
Let's practice analyzing the output of a Scikit Learn GridSearchCV object!