1. Selecting your final model
In this lesson, we will explore the output of a random search implementation and then select, and reuse our final model.
2. Random search output
To start, we will assume the variable rs is an implementation of RandomizedSearchCV() that has already been fit on data. Let's explore some of the key attributes of rs.
Of course, the first attributes us eager-beavers have to check focuses on the best model. And so, the attributes have been so named best_score_, best_params_, and best_estimator_. These attributes provide the results from the best model found in the random search, and you will use these attributes often.
3. Other attributes
Of course, there are other attributes to explore. Perhaps the most useful will be rs-dot-cv_results_, which contains a dictionary full of the cross-validation results.
This dictionary includes keys such as "mean_test_scores," which gives the average cross-validation test score for each run. The dictionary also contains the key "params," which contains all of the selected parameters for each model run.
4. Using .cv_results_
We can use these attributes to create visuals of the output or make inferences on which hyperparameters are having the biggest impact on the model.
For example, let's look at the mean test scores grouped by the maximum depth of the model. Here we grabbed the max depth from each of the 10 models, as well as the mean test score. We then created a pandas DataFrame and grouped the scores by the maximum depth.
If we look at the output, a max depth of 2, 4, and even 6 all produced really low scores. However, a max depth of 8 and 10 almost achieved 90% accuracy.
5. Other attributes continued
There are a ton of ways to use the cv_results_ output to visualize the effect of each parameter. In the case, we just explored, it's probably best to use a larger max depth when running your models.
These results might inspire you to rerun the random search with a slightly different hyperparameter space.
Right now, we just want to select the best model from our random search.
6. Selecting the best model
However, you perform hyperparameter tuning; in the end, you'll need to select one final model. This may be the model with the best accuracy or the model with the highest precision or even recall.
For now, let's assume we are going for the best mean squared error. The model with the lowest error from the cross-validation is our guess for the model that will perform the best on future data.
The best_estimator_ attribute contains the model that performed the best during cross-validation.
7. Comparing types of models
As an aside, if you built different types of models, say a random forest and a gradient boosting model, you can test the accuracies of your final models on the test set that you held out. This gives an unbiased estimate and can help you make your final overall decision. In the case above, you would select the gradient boosting model as the final model because it had a lower mean squared error on the test data.
8. Using .best_estimator_
Let's use best_estimator_ from a random forest model. You can use the method predict() on this estimator with new data just like any other scikit-learn model. You can check all of the parameters that were used by calling the get_params() method or you can save the estimator as a pickle file using the joblib module for reuse later. This will allow you to load your model on a later date or to share your model with a colleague.
9. Let's practice!
Let's work through a couple examples.