1. Ensembles and hyperparameter tuning
In this lesson, you will build upon the concepts of the previous lessons by learning about ensemble methods and hyperparameter tuning, two more ways of potentially boosting model performance.
2. Ensemble methods
Ensemble methods are a way of taking many classifiers together at once to improve the quality of a model. These methods try to lower the variance of the overall model, while still being accurate in aggregate, through a mechanism called the bias-variance trade-off. Although the mathematical details of why these methods work are beyond the scope of the course, it is important to be aware of one concept in ensemble methods: bootstrap aggregation (or bagging), which is the process by which random samples are selected for different models. The models are then each individually trained on that data, and the final model takes a combination across all of those separate models in order to make predictions, as seen in the picture.
3. Random forests
Random Forests are an ensemble method that utilize bagging in order to create individual decision trees that are then aggregated. Sklearn has an implementation readily available called RandomForestClassifier. There are many parameters, as seen in the output, but two main parameters are max_depth, which is similar to that of a decision tree that varies how deep the ensemble trees get, and n_estimators, which is the number of trees used in the random forest. Random forests will follow the model implementation workflow consistent with all other model types.
4. Hyperparameter tuning
The two parameters mentioned in the previous slide, max_depth and n_estimators, are actually known as hyperparameters. Put simply, hyperparameters are parameters that have values external to a model and are configured by a data scientist before training. For example, in linear regression there is a slope coefficient, which is not a hyperparameter since that parameter is learned through the model. In contrast, max_depth and n_estimators are set before training, and tweaked manually through trial and error. Since hyperparameters can affect the outcomes of models, we always want to find the best configuration of hyperparameters for our models to boost performance.
5. Grid search
Grid search is one method of testing hyperparameters by going through all combinations specified, such as those in the previous slide. Sklearn has an implementation available called GridSearchCV. In sklearn we can use the param_grid variable to specify our custom combinations to test. First, we create a list of values for each of our hyperparameters, as done in the previous slide with n_estimators and max_depth. Then we can pass both n_estimators and max_depth into a dictionary called param_grid, which we then pass into the GridSearchCV object. The arguments to GridSearchCV are as follows: the first argument is the model we want to use (such as a random forest), the second argument is the param_grid dictionary we just created, and the third argument is the scoring function string, similar to the strings we saw earlier during the k-fold cross validation process. Afterwards, we can examine many attributes, each of which ends with an underscore, such as best_score_ which returns the best score according to the scoring function we specify, and best_estimator_ which returns the model configuration that yielded the best results.
6. Let's practice!
Now that you've learned about ensembles and hyperparameter tuning, let's jump right in!