1. Grid Search with Scikit Learn
In this lesson we will move beyond our manual code and leverage Scikit Learn to assist our grid search.
2. GridSearchCV Object
In this lesson we will be introduced to Scikit Learn's GridSearchCV. It will help us create a grid search more efficiently and get some performance analytics.
This is an example of a GridSearchCV object. Don't worry, we will break it down!
3. Steps in a Grid Search
Firstly, let us conceptualize the steps needed to do a proper grid search. Some of these will be familiar from our manual work before.
One Select an algorithm (or 'estimator') to tune.
Two Define which hyperparameters we will tune.
Three Define a range of values for each hyperparameter.
Four Decide a cross-validation scheme.
Five Define a scoring function to determine which model was the best.
Six Include extra useful information or functions.
The only one of these we did not do much work with previously is step (4), but we will cover each now.
4. GridSearchCV Object Inputs
A GridSearchCV object takes several important arguments.
estimator.
param_grid.
cv.
scoring.
refit.
n_jobs.
return_train_score.
5. GridSearchCV 'estimator'
The estimator is our algorithm.
Examples include KNN, Random Forest, GBM or Logistic Regression.
We only pick one algorithm for each grid search.
6. GridSearchCV 'param_grid'
param_grid is how we tell GridSearchCV which hyperparameters and which values to test.
We were previously using lists,
but param_grid needs a dictionary. The dictionary keys must be the hyperparameter names, the values a list of values to test.
7. GridSearchCV 'param_grid'
The keys in the param_grid dictionary must be valid hyperparameters else the Grid Search will fail.
See the example here, 'best_choice' is not a hyperparameter of Scikit Learn's Logistic Regression estimator
and so this will fail.
8. GridSearchCV 'cv'
The cv input allows you to undertake cross-validation.
You could specify different cross-validation types here.
But simply providing an integer will create a k-fold. You are likely familiar with standard 5 and 10 k-fold cross validation.
9. GridSearchCV 'scoring'
`scoring` is a scoring function used to evaluate your model's performance. You did this manually previously using accuracy.
You can use your own custom metric, or one from the available metrics from Scikit Learn's metrics module.
You can check all available metrics using this command.
10. GridSearchCV 'refit'
refit set to true means the best hyperparameter combinations are used to undertake a fitting to the training data.
The GridSearchCV object can be used as an estimator directly
This is very handy as you don't need to save our the best hyperparameters and train another model.
11. GridSearchCV 'n_jobs'
n_jobs assists with parallel execution.
You can effectively 'split up' your work and have many models being created at the same time.
This is possible because the results of one model do not affect the next one.
You can check how many cores you have available, which determines how many models you can run in parallel using this handy code.
Be careful using all cores for a task though as this may mean you can't do other work on your computer while your models run.
12. GridSearchCV 'return_train_score'
Finally `return_train_score` logs statistics about the training runs that were undertaken.
This can be useful for plotting and understanding test vs training set performance (and hence bias-variance tradeoff).
While informative, this is computationally expensive and will not assist in finding the best model.
13. Building a GridSearchCV object
Now we have all the components to build a grid search object.
Firstly we create our parameter grid for the hyperparameters and values we want to input.
Then we create the base classifier, setting some default values at the time of creation.
14. Building a GridSearchCv Object
We can now put the pieces together to create the GridSearchCV object.
You can see all the elements you learned about previously including the estimator and parameter grid we just created.
If this seems like a lot of code, review the couple of previous slides to see what each element means.
15. Using a GridSearchCV Object
With 'refit' set to True, we can directly use the GridSearchCV object as an estimator.
That means we can fit onto our data and make predictions, just like any other Scikit Learn estimator!
16. Let's practice!
Let's undertake our own Grid Search with Scikit Learn's GridSearchCV module!