Get startedGet started for free

Gradient boosted trees: modeling

Gradient boosting is a technique to improve the performance of other models. The idea is that you run a weak but easy to calculate model. Then you replace the response values with the residuals from that model, and fit another model. By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model. You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in. With each iteration, the model becomes stronger and stronger.

To give a more concrete example, sparklyr uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model. These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous). In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual.

Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses. The latter course also covers gradient boosting.

To run a gradient boosted trees model in sparklyr, call ml_gradient_boosted_trees(). Usage for this function was discussed in the first exercise of this chapter.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

  • Get the columns containing the string "timbre" to use as features.
    • Use colnames() to get the column names of track_data_to_model_tbl. Note that names() won't give you what you want.
    • Use str_subset() to filter the columns.
    • The pattern argument to that function should be fixed("timbre").
    • Assign the result to feature_colnames.
  • Create the formula for the model using reformulate().
    • The termlabels argument (inputs of the formula) should be feature_colnames.
    • The response argument (output of the formula) should be "year".
    • Assign the result to year_formula.
    • Using reformulate() this way combines all the variables in feature_colnames with a + sign to form the right-hand side of the formula. This results in a formula year ~ timbre1 + timbre2 + ... + timbre12, which defines the relationship between the variables to be included in the model.
  • Run the gradient boosting model.
    • Call ml_gradient_boosted_trees() with the year_formula you created as its only argument.
    • Assign the result to gradient_boosted_trees_model.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl

feature_colnames <- track_data_to_model_tbl %>%
  # Get the column names
  ___ %>%
  # Limit to the timbre columns
  ___(___(___))

feature_colnames

# Create the formula for the model
year_formula <- ___

gradient_boosted_trees_model <- track_data_to_model_tbl %>%
  # Run the gradient boosted trees model
  ___
Edit and Run Code