1. Tuning a CART's hyperparameters
To obtain a better performance, the hyperparameters of a machine learning should be tuned.
2. Hyperparameters
Machine learning models are characterized by parameters and hyperparameters.
Parameters are learned from data through training; examples of parameters include the split-feature and the split-point of a node in a CART.
Hyperparameters are not learned from data; they should be set prior to training. Examples of hyperparameters include the maximum-depth and the splitting-criterion of a CART.
3. What is hyperparameter tuning?
Hyperparameter tuning consists of searching for the set of optimal hyperparameters for the learning algorithm.
The solution involves finding the set of optimal hyperparameters yielding an optimal model.
The optimal model yields an optimal score.
The score function measures the agreement between true labels and a model's predictions.
In sklearn, it defaults to accuracy for classifiers and r-squared for regressors.
A model's generalization performance is evaluated using cross-validation.
4. Why tune hyperparameters?
A legitimate question that you may ask is: why bother tuning hyperparameters?
Well, in scikit-learn, a model's default hyperparameters are not optimal for all problems.
Hyperparameters should be tuned to obtain the best model performance.
5. Approaches to hyperparameter tuning
Now there are many approaches for hyperparameter tuning including: grid-search, random-search, and so on.
In this course, we'll only be exploring the method of grid-search.
6. Grid search cross validation
In grid-search cross-validation, first you manually set a grid of discrete hyperparameter values.
Then, you pick a metric for scoring model performance and you search exhaustively through the grid.
For each set of hyperparameters, you evaluate each model's score.
The optimal hyperparameters are those for which the model achieves the best cross-validation score.
Note that grid-search suffers from the curse of dimensionality, i-dot-e-dot, the bigger the grid, the longer it takes to find the solution.
7. Grid search cross validation: example
Let's walk through a concrete example to understand this procedure.
Consider the case of a CART where you search through the two-dimensional hyperparameter grid shown here.
The dimensions correspond to the CART's maximum-depth and the minimum-percentage of samples per leaf.
For each combination of hyperparameters, the cross-validation score is evaluated using k-fold CV for example.
Finally, the optimal hyperparameters correspond to the model achieving the best cross-validation score.
8. Inspecting the hyperparameters of a CART in sklearn
Let's now see how we can inspect the hyperparameters of a CART in scikit-learn.
You can first instantiate a DecisionTreeClassifier dt as shown here.
9. Inspecting the hyperparameters of a CART in sklearn
Then, call dt's -dot-get_params() method. This prints out a dictionary where the keys are the hyperparameter names.
In the following, we'll only be optimizing max_depth, max_features and min_samples_leaf.
Note that max_features is the number of features to consider when looking for the best split. When it's a float, it is interpreted as a percentage.
You can learn more about these hyperparameters by consulting scikit-learn's documentation.
10. Grid search CV in sklearn (Breast Cancer dataset)
Let's now tune dt on the wisconsin breast cancer dataset which is already loaded and split into 80%-train and 20%-test.
First, import GridSearchCV from sklearn-dot-model_selection.
Then, define a dictionary called params_dt containing the names of the hyperparameters to tune as keys and lists of hyperparameter-values as values.
Once done, instantiate a GridSearchCV object grid_dt by passing dt as an estimator and params_dt as param_grid. Also set scoring to accuracy and cv to 10 in order to use 10-fold stratified cross-validation for model evaluation.
Finally, fit grid_dt to the training set.
11. Extracting the best hyperparameters
After training grid_dt, the best set of hyperparameter-values can be extracted from the attribute -dot-best_params_ of grid_dt.
Also, the best cross validation accuracy can be accessed through grid_dt's -dot-best_score_ attribute.
12. Extracting the best estimator
Similarly, the best-model can be extracted using the -dot-best_estimator attribute. Note that this model is fitted on the whole training set because the refit parameter of GridSearchCV is set to True by default.
Finally, you can evaluate this model's test set accuracy using the score method. The result is about 94-dot-7% while the score of an untuned CART is of 93%.
13. Let's practice!
Now it's your turn to practice.