1. Model selection: ensemble models
Welcome! I'm glad to see that you've made it to the final lesson in this course! In this video lesson, we're going to talk about how to choose from among different ensemble models. Ready to get started?
2. Bootstrapping
Recall from lesson 2.4 and earlier in this chapter when we discussed ensemble methods. Particularly, that bootstrapping is a sampling technique where a subset of the data is selected with replacement, meaning that the same row of data may be chosen more than once in a given subset. A model is built with each bootstrapped sample, and then the output predictions are averaged. This has the effect of reducing variance ultimately resulting in a more accurate model.
3. Random forest
The Random forest machine learning algorithm is just a bootstrapped version of many, many decision trees which can, and often does, number in the thousands to come up with a highly accurate model.
4. Gradient Boosting
Recall that boosting also builds multiple individual models, but does so in a sequential order, learning to reduce predictive error from previous models by modifying the original dataset with weights for incorrectly predicted instances which results in a model with decreased bias. It builds an additive model in a forward stage-wise fashion, optimizing with each iteration.
5. RF vs GB
Although the random forest and gradient boosting classifier algorithms have several parameters in common, it is good to be aware of the differences in not only their default values, but their potential differences in behavior as well.
N underscore estimators for the random forest indicates the number of trees the algorithm should build. It currently defaults to 10 but will soon change to 100 with the next sklearn update.
For gradient boosting, this is the number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance and it already defaults to 100.
The criterion parameter tells the algorithm what to base the splits on. As we've discussed previously, the random forest uses the gini index by default which is a measure of impurity. The other option is entropy for information gain, or how much information is gained for a particular split.
For gradient boosting, however, the default is something called friedman mse which is the mean squared error with improvement score by Friedman. There are other mse options as well.
Max depth for random forest defaults to None which allows the nodes to expand until all leaves are pure or until all leaves contain less than min_samples_split samples which is another parameter whose default is 2.
For gradient boosting, this is the maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree and the default is set at 3.
Another parameter, not used in the random forest but important to take note of for gradient boosting is the learning rate.
The learning rate shrinks the contribution of each tree by this value. There is a trade-off between learning_rate and the n_estimators parameter.
6. Functions
There's just a few functions to review that you'll come across in the exercises. Both from sklearn dot ensemble, the Random forest classifier and gradient boosting classifier functions are used when the dataset has a categorical target variable and will be a great last look at the loan dataset we've used so many times before in this course. From sklearn dot metrics, you'll use accuracy underscore score, confusion underscore matrix, precision underscore score, recall underscore score, and f1 underscore score, which return their respective performance metrics. And remember to always check for class imbalance as we know that this dataset definitely has it!
7. Let's practice!
And now, it's your turn to compare a few ensemble models and see which one performs best!