Bagging parameters: tips and tricks

1. Bagging parameters: tips and tricks

Welcome to final lesson of chapter two! Here you'll learn some tips and tricks to improve your bagging ensembles.

2. Basic parameters for bagging

Let's review some of the parameters of bagging ensemble models you've already seen. One of the most important parameters is base_estimator, the "weak" model which will be built for each sample. The n_estimators parameter specifies the number of estimators to use. This is ten by default, but in practice we'll use more, and the larger the better. Usually between 100 and 500 trees are enough. You also learned how to calculate the out-of-bag score by specifying the parameter oob_score as True.

3. Additional parameters for bagging

Let's take a look at some additional parameters you can use to further improve your bagging models. First we have max_samples, which specifies the number of instances to draw for each estimator. The default is 1.0, equivalent to 100%. Another important parameter is max_features. This is the number of features to draw randomly for each estimator. It is also 1.0 by default. Using lower values provides more diversity for the individual models and reduces the correlation among them, as each will get a different sample of both features and instances. For classification, the optimal value lies around the square root of the number of features. For regression, the optimal value is usually close to one third of the number of features. There's also the parameter bootstrap, which is a boolean to indicate whether samples are drawn with replacement. The default is True, as that is the nature of Bagging. If passed as True, then it is recommended to use max_samples as 100%. If False, then max_samples should be lower than 100%, because otherwise all the samples would be identical.

4. Random forest

Random forests, which you may have seen before, are a special case of Bagging where the base estimators are decision trees. If you want to use decision trees as base estimators, it is recommended to use the Random Forest classes instead, as these are specifically designed for trees. The scikit-learn implementation for Random Forests combines the models using Averaging instead of Voting, so there is no need to use an odd number of estimators. These classes are also part of the sklearn dot ensemble module. For classification, we have the RandomForestClassifier. And for regression, there's the RandomForestRegressor. Let's look at some of the most important parameters. First we have the parameters shared with Bagging: n_estimators, max_features, and oob_score. And then we have the tree-specific parameters: the maximum depth, the minimum number of samples required to split a node, the minimum number of samples required in a leaf node, and class weight. Class weight is a useful parameter which allows you to specify the weights for each class using a dictionary. Or we can pass it as the string value "balanced" and the model will use the class distribution to calculate balanced weights. Therefore, random forests are able to deal with imbalanced targets.

5. Bias-variance tradeoff

Let's remind ourselves of the bias-variance tradeoff. A simple model has low variance, but high bias. Adding more complexity to the model may reduce the bias but increase the variance of predictions. That's why it's important to optimize the parameters of the ensemble models that minimize the total error and find the balance between bias and variance.

6. Let's practice!

Let's round out chapter 2 now with some interactive exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.