Gradient boosted trees: modeling

Gradient boosting is a technique to improve the performance of other models. The idea is that you run a weak but easy to calculate model. Then you replace the response values with the residuals from that model, and fit another model. By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model. You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in. With each iteration, the model becomes stronger and stronger.

To give a more concrete example, sparklyr uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model. These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous). In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual.

Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses. The latter course also covers gradient boosting.

To run a gradient boosted trees model in sparklyr, call ml_gradient_boosted_trees(). Usage for this function was discussed in the first exercise of this chapter.

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

Get the columns containing the string "timbre" to use as features.
- Use colnames() to get the column names of track_data_to_model_tbl. Note that names() won't give you what you want.
- Use str_subset() to filter the columns.
- The pattern argument to that function should be fixed("timbre").
- Assign the result to feature_colnames.
Create the formula for the model using reformulate().
- The termlabels argument (inputs of the formula) should be feature_colnames.
- The response argument (output of the formula) should be "year".
- Assign the result to year_formula.
- Using reformulate() this way combines all the variables in feature_colnames with a + sign to form the right-hand side of the formula. This results in a formula year ~ timbre1 + timbre2 + ... + timbre12, which defines the relationship between the variables to be included in the model.
Run the gradient boosting model.
- Call ml_gradient_boosted_trees() with the year_formula you created as its only argument.
- Assign the result to gradient_boosted_trees_model.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Gradient boosted trees: modeling

Instructions