Feature importances and gradient boosting

1. Feature importances and gradient boosting

We're going to wrap up with a few more things for the tree-based methods: feature importances, and gradient boosting.

2. Feature importances

All tree-based methods allow us to get "feature importances". These are scores representing how much each feature contributes to a prediction. For regression, this is how much each feature can help reduce the variance when it splits the data. The days-of-the-week variables are very weak predictors, so they don't reduce the variance much. This means their feature importance is low, so we may want to throw out those features.

3. Good features

Other features, like the 200-day simple moving average, are better at reducing the variance in the data -- so these will have higher scores on the feature importance scale.

4. Extracting feature importances

Once we fit a model like a random forest, we can extract the feature importances. This is stored in a property unsurprisingly called feature_importances_. The values align with the feature columns in the training data we gave the model.

5. Sorting and plotting

To plot the feature importances in a clean way, we want to sort them from greatest to least. To do this, we will use numpy's argsort(), which returns the indices of the least to greatest elements of an array or list. Then we use Python's built-in indexing to reverse the list. This works by setting a start, stop, and step within square brackets, all separated by colons. To reverse a list, we leave the start and stop blank, then put -1 for the step. We want to then plot the feature importances as a bar plot. We create variables for the x positions from the length of importances, then create tick labels from feature_names using the sorted index. Finally, we plot the barplot with the sorted importances and feature names as tick labels. Because the feature names are long, we rotate them to be vertical.

6. Plotting feature importances

Now that we have the feature importances plotted, we can learn which features really matter. In this case, the days of the week are very weak predictors, and we may want to throw those out. But the 200-day indicators seem to be pretty useful.

7. Linear models vs gradient boosting

On to gradient boosting. My favorite graphic for gradient boosting is this, which comes from a Kaggle-dot-com blog. If linear models are a Toyota Camry, then gradient boosting is a Black Hawk helicopter. Linear models are great, simple models that are easy to use and understand. Boosted models have the potential to work much better, but are also much more difficult to use and interpret.

8. Boosted models

Boosted models are a general class of machine learning models. These work by iteratively fitting models, such as decision trees, to data. Gradient boosting works by fitting one tree, then fitting another tree to the residual errors of the first one. Then we fit another tree to the errors of the most recent one, and so on.

9. Boosted models

There is another model besides gradient boosting you could use as well, which is adaboost. In this course, we'll only look at gradient boosting and not adaboost. All boosted models work similarly, by fitting models iteratively in order to get improved predictions.

10. Fitting a gradient boosting model

The sklearn library has an implementation of gradient boosting. We create and fit it much in the same way as any other sklearn model. Gradient boosted models have lots of hyperparameters, and we won't go through them here. However, I've set the hyperparameters to decent settings in the coding exercise, so you can get a feel for how some of these may look.

11. Get boosted!

Now that we know how, let's look at feature importances and fit a gradient boosting model.