Gradient boosting

1. Gradient boosting

In this lesson you'll be learning about another popular and powerful gradual learning ensemble method: Gradient Boosting.

2. Intro to gradient boosting machine

To understand the intuition behind Gradient Boosting Machine consider the following. Suppose that you want to estimate an objective function, let's say y as a function of X. On the first iteration, our initial model is a weak estimator that is fit to the dataset. Let's call it f sub-one of X. Then, on each subsequent iteration, a new model is built and fitted to the residual error from the previous iteration. The error is calculated as y minus f sub-one of X. After each individual estimator is built, the result is a new additive model, which is an improvement on the previous estimate. We repeat this process n times or until the error is small enough such that the difference in performance is negligible. After the algorithm is finished, the result is a final improved additive model. This is a peculiarity of Gradient Boosting, as the individual estimators are not combined through voting or averaging, but by addition. This is because only the first model is fitted to the target variable, and the rest are estimates of the residual errors.

3. Equivalence to gradient descent

You may be wondering why this method is called Gradient Boosting. Well, that's because it's equivalent to applying gradient descent as the optimization algorithm. To better understand this, we're going to go over a bit of math now, but don't worry if this seems too advanced, as scikit-learn abstracts away all of this. The residuals are defined as y minus F sub-i of X. This represents the error that the model has at iteration i. Gradient Descent, on its side, is an iterative optimization algorithm that attempts to minimize the loss of an estimator. The loss, in this case the square loss, is defined as the square of the residuals divided by two. On every iteration steps are taken in direction of the negative gradient, which points toward the minimum. The gradient is the derivative of the loss with respect to the approximate function. The result is F sub-i of X minus y. This expression looks similar, in fact, it is the opposite of the residuals. From the resulting gradient we can notice the equivalence, that the residuals are equals to the negative gradient. Therefore, we are actually improving the model using Gradient Descent on each iteration.

4. Gradient boosting classifier

To build a Gradient Boosting classifier, we first import the class from the sklearn ensemble module. This will allow you to instantiate the Gradient Boosting classifier. Unlike with other ensemble methods, here we don't specify the base_estimator, as Gradient Boosting is implemented and optimized with regression trees as the individual estimators. In Classification, the trees are fitted to the class probabilities. The first parameter is n_estimators, which you already know. Here it is 100 by default. Then, we also specify the learning rate, the parameter which you already learned about. It is point 1 by default. In addition, we have the tree-specific parameters: the maximum depth, which is three by default, the minimum number of samples required to split a node, the minimum number of samples required in a leaf node, and the maximum number of features. In Gradient Boosting, it is recommended to use all the features.

5. Gradient boosting regressor

In a similar way, we can build a Gradient Boosting regressor. This class is also found on the scikit-learn ensemble module. Then to instantiate the Gradient Boosting Regression model, you must call the function with the same parameters as before.

6. Time to boost!

It's time to boost some models using gradient descent!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.