Session Ready
Exercise

Gradient boosted trees: modeling

Gradient boosting is a technique to improve the performance of other models. The idea is that you run a weak but easy to calculate model. Then you replace the response values with the residuals from that model, and fit another model. By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model. You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in. With each iteration, the model becomes stronger and stronger.

To give a more concrete example, sparklyr uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model. These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous). In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual.

Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses. The latter course also covers gradient boosting.

To run a gradient boosted trees model in sparklyr, call ml_gradient_boosted_trees(). Usage for this function was discussed in the first exercise of this chapter.

Instructions
100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

  • Get the columns containing the string "timbre" to use as features.
    • Use colnames() to get the column names of track_data_to_model_tbl. Note that names() won't give you what you want.
    • Use str_subset() to filter the columns.
    • The pattern argument to that function should be fixed("timbre").
    • Assign the result to feature_colnames.
  • Run the gradient boosting model.
  • Call ml_gradient_boosted_trees().
  • The output (response) column is "year".
  • The input columns are feature_colnames.
  • Assign the result to gradient_boosted_trees_model.