Get Started

Understanding a grid search output

1. Understanding a grid search output

Now that you know how to run a grid search, let's focus on its output.

2. Analyzing the output

Let us now analyze each of the properties of the GridSearchCV output and learn how to access and use them. The properties of the object can be categorized into three different groups a results log the best results and 'Extra information'.

3. Accessing object properties

Properties are accessed using the dot notation, that is grid_search_object-dot-property. Where property is the actual property you want to retrieve Let's review each of the key properties now.

4. The .cv_results_ property

Firstly there is the cv_results_ property. This is a dictionary that we can read into a pandas DataFrame to explore. Notice there are 12 rows because there are 12 squares in our grid. Each row tells you about what happened when testing that square.

5. The .cv_results_ 'time' columns

The 'time' columns refer to the time it took to fit and score the model. We did a cross-validation so this ran 5 times and stored the average and standard deviation of the times it took in seconds.

6. The .cv_results_ 'param_' columns

The param_ columns contain information on the different parameters that were used in the model. Remember, each row in this DataFrame is about one model. So we can see row 3 for example tested the hyperparameter combination of max_depth 10 and min_samples_leaf 2 and n_estimators 100 for our random forest estimator.

7. The .cv_results_ 'param' column

The params column is a dictionary of all the parameters from the previous 'param' columns. We need to use pd.set_option here to ensure we don't truncate the results we are printing.

8. The .cv_results_ 'test_score' columns

The next 5 columns are the testing scores for each of the 5 cross-folds, or splits, we made, followed by the the mean and standard deviation for those cross-folds.

9. The .cv_results_ 'rank_test_score' column

The rank column conveniently ranks the rows by the mean_test_score. We can see that the model in our third row had the best mean_test_score.

10. Extracting the best row

Using the rank_test_score column we can easily select the grid search square for analysis. This table is the row from the cv_results object that was the best model created.

11. The .cv_results_ 'train_score' columns

The test_score columns are then repeated for the training scores. Note that if we had not set return_train_score to True this would not include the training scores. There is also no ranking column for the training scores, as we only care about performance on the test set in each fold.

12. The best grid square

Information on the best grid square is found in three different properties best_params_ which is the dictionary of the parameters that gave the best score. best_score_, the actual best score and best_index, the row in our cv_results_ that was the best. This is same as the index of the row with rank 1 in cv_results_ that we extracted just before.

13. The best_estimator_ property

GridSearchCV stores an estimator built with the best hyperparameters in the best_estimator property. Since it is an estimator, we can use this to predict on our test set. We can demonstrate this by using python's type function and see it is a Random Forest Classification estimator. We can also use the GridSearchCV object itself directly as an estimator.

14. The best_estimator_ property

We can print out and see the estimator itself. This is why we set refit=True when creating the grid search, otherwise we would need to refit using the best parameters ourself before using the best estimator.

15. Extra information

Some extra information can be obtained with the following properties. These are not very useful properties but may be important if you construct you grid search differently. These include the scorer function that was used and the number of cross validation splits (both of which we set ourselves), and the refit_time which is the number of seconds used for refitting the best model on the whole dataset. This may be of interest in analyzing efficiencies in your work, but not for our use case here.

16. Let's practice!

Let's practice analyzing the output of a Scikit Learn GridSearchCV object!