Gradient boosted trees: modeling
Gradient boosting is a technique to improve the performance of other models. The idea is that you run a weak but easy to calculate model. Then you replace the response values with the residuals from that model, and fit another model. By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model. You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in. With each iteration, the model becomes stronger and stronger.
To give a more concrete example, sparklyr
uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model. These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous). In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual.
Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses. The latter course also covers gradient boosting.
To run a gradient boosted trees model in sparklyr
, call ml_gradient_boosted_trees()
. Usage for this function was discussed in the first exercise of this chapter.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl
.
- Get the columns containing the string
"timbre"
to use as features.- Use
colnames()
to get the column names oftrack_data_to_model_tbl
. Note thatnames()
won't give you what you want. - Use
str_subset()
to filter the columns. - The
pattern
argument to that function should befixed("timbre")
. - Assign the result to
feature_colnames
.
- Use
- Create the
formula
for the model usingreformulate()
.- The
termlabels
argument (inputs of the formula) should befeature_colnames
. - The
response
argument (output of the formula) should be"year"
. - Assign the result to
year_formula
. - Using
reformulate()
this way combines all the variables infeature_colnames
with a+
sign to form the right-hand side of theformula
. This results in a formulayear ~ timbre1 + timbre2 + ... + timbre12
, which defines the relationship between the variables to be included in the model.
- The
- Run the gradient boosting model.
- Call
ml_gradient_boosted_trees()
with theyear_formula
you created as its only argument. - Assign the result to
gradient_boosted_trees_model
.
- Call
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl
feature_colnames <- track_data_to_model_tbl %>%
# Get the column names
___ %>%
# Limit to the timbre columns
___(___(___))
feature_colnames
# Create the formula for the model
year_formula <- ___
gradient_boosted_trees_model <- track_data_to_model_tbl %>%
# Run the gradient boosted trees model
___