Exercise

# Gradient boosted trees: modeling

Gradient boosting is a technique to improve the performance of other models. The idea is that you run a weak but easy to calculate model. Then you replace the response values with the residuals from that model, and fit another model. By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model. You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in. With each iteration, the model becomes stronger and stronger.

To give a more concrete example, `sparklyr`

uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model. These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous). In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual.

Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses. The latter course also covers gradient boosting.

To run a gradient boosted trees model in `sparklyr`

, call `ml_gradient_boosted_trees()`

. Usage for this function was discussed in the first exercise of this chapter.

Instructions

**100 XP**

A Spark connection has been created for you as `spark_conn`

. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as `track_data_to_model_tbl`

.

- Get the columns containing the string
`"timbre"`

to use as features.- Use
`colnames()`

to get the column names of`track_data_to_model_tbl`

.*Note that*`names()`

won't give you what you want. - Use
`str_subset()`

to filter the columns. - The
`pattern`

argument to that function should be`fixed("timbre")`

. - Assign the result to
`feature_colnames`

.

- Use
- Run the gradient boosting model.
- Call
`ml_gradient_boosted_trees()`

. - The output (response) column is
`"year"`

. - The input columns are
`feature_colnames`

. - Assign the result to
`gradient_boosted_trees_model`

.