Get startedGet started for free

Gradient boosting

1. Gradient boosting

Welcome back! Let's recap and strengthen what we know about boosting so far.

2. Recap: boosting

In boosting, weak learners like decision trees with only one split are used, which perform only slightly better than a random chance. Boosting focuses on sequentially adding up these weak learners and filtering out the observations that a learner gets correct at every step. Basically, the stress is on developing new weak learners to handle the remaining difficult observations at each step. One of the very first boosting algorithms developed was Adaboost. Gradient boosting improved upon some of the features of Adaboost to create a stronger and more efficient algorithm.

3. Comparison

Adaboost uses decision stumps as weak learners. Decision stumps are nothing but decision trees with only one single split. It also attaches weights to observations, adding more weight to 'difficult-to-classify' observations and less weight to those that are easy to classify. Gradient boosting uses short, less-complex decision trees instead of decision stumps. Instead of weighing the observations, it uses an error function called a loss function to measure how far off it is. Since the loss function optimization is done using gradient descent, the method is called "gradient boosting".

4. Pros & cons of boosting

One of the reasons that boosting is so popular is that if tuned properly, the performance is often better than any other algorithm in your toolbox. An optimized boosted model can outperform even the state of the art in deep learning on many datasets. Boosting is a good option for unbalanced datasets. In applications like forgery or fraud detection, the classes will be almost certainly imbalanced, where the number of authentic transactions will be huge when compared with unauthentic transactions. One problem that we may encounter in gradient boosting decision trees but not random forests is overfitting due to the addition of too many trees. In random forests, the addition of too many trees won’t cause overfitting. The accuracy of the model doesn’t improve after a certain point, but no problem of overfitting is faced. Depending on how you adjust the learning rate hyperparameter, the learning in boosted ensembles can be slow, especially since it's an iterative and not a parallel algorithm. Additionally, there are a few more tuning hyperparameters than in the other models that you already know.

5. Hyperparameters for gradient boosting

Talking about hyperparameters, let's discuss the hyperparameters boosted trees have. There are some that you already know from previous models, and some that we need to introduce. min_n is the minimum number of data points in a node that is required for the node to be split further. Single decision trees have the same parameter. tree_depth is the maximum depth of the tree, the number of splits. Again, you already know this from simple decision trees. sample_size is the amount of data exposed to the fitting routine. This is similar to bagged trees or random forests. trees is the number of trees contained in the ensemble. This is similar to random forests or bagged trees.

6. Hyperparameters for gradient boosting

mtry is the number of predictors that will be randomly sampled at each split when creating the tree models. You already know this from random forests. learn_rate is the rate at which the boosting algorithm adapts from iteration to iteration. loss_reduction is the reduction in the loss function required to split further. And finally, stop_iter, the number of iterations without improvement before the algorithm stops.

7. Let's practice!

Fine. It's your turn now to train a boosted model using the credit card customers dataset.