BaggingClassifier: nuts and bolts

1. BaggingClassifier: nuts and bolts

In this lesson, you'll learn how to use scikit-learn's BaggingClassifier and BaggingRegressor classes to build ensemble models.

2. Heterogeneous vs Homogeneous Functions

First, let's see a key difference between the ensemble functions from heterogeneous and homogeneous methods. To build a heterogeneous ensemble model, you call the corresponding function with the parameter estimators, which is a list of string - estimator tuples, and some additional parameters. Each of those estimators was already instantiated by you. However, to build a homogeneous ensemble model, instead of a list of estimators, we pass the parameter base_estimator, which is the instantiated "weak" model we have chosen for our ensemble. Then, we pass in the number of estimators we want for our ensemble, and the corresponding additional parameters.

3. BaggingClassifier

Let's see an example. The first step is to instantiate the base estimator. Here we'll use a Decision Tree and limit it to a max depth of three. Remember that the base estimator needs to be a "weak" model. The next step is to build the Bagging classifier, passing the decision tree as the base estimator, and specifying that we are using five estimators. Then, we can fit the bagging classifier to the training set as with any other scikit-learn model. After that, you can use the bagging ensemble to make predictions on the test set or new data.

4. BaggingRegressor

To solve regression problems, you can use the BaggingRegressor class. As an example, let's use linear regression as the base estimator. We can then build the BaggingRegressor. This will be a Bagging ensemble of ten estimators, as we are not specifying the number of estimators and the default value is ten. Then, we can train this ensemble model and use it to make predictions as usual.

5. Out-of-bag score

Let's end this video with a new concept that is useful in bagging ensembles: the out-of-bag score. Recall that in a bagging ensemble, each estimator is trained on a bootstrap sample. Therefore, each of the samples will leave out some of the instances, which are then used to evaluate each estimator, similar to a train-test split. To get the out-of-bag score for each instance, we calculate the predictions using all the estimators for which it was out of the sample. Then, we combine the individual predictions. Finally, we evaluate the desired metric. For classification, the default metric is accuracy, and for regression it's the R squared, also known as the coefficient of determination. Out-of-bag score helps avoid the need for an independent test set. However, it's often lower than the actual performance. To get the out-of-bag score from a Bagging ensemble, we need to set the parameter oob_score to True. After training the model, we can access it using the attribute oob_score_ with an underscore at the end. It's good to compare this to the actual metric - in this case, accuracy. The two values being close is a good indicator of the model's ability to generalize to new data.

6. Now it's your turn!

Now it's your turn to build Bagging ensemble models using the scikit-learn framework!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.