Find the right number of trees for a gradient boosting machine
In this exercise, you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.
The July data has been pre-loaded. Remember that bikesJuly.treat
no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt
.
You will use the xgboost
package to fit the random forest model. The function xgb.cv()
(docs) uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.
For this exercise, the key arguments to the xgb.cv()
call are:
data
: a numeric matrix.label
: vector of outcomes (also numeric).nrounds
: the maximum number of rounds (trees to build).nfold
: the number of folds for the cross-validation. 5 is a good number.objective
:"reg:squarederror"
for continuous outcomes.eta
: the learning rate.max_depth
: maximum depth of trees.early_stopping_rounds
: after this many rounds without improvement, stop.verbose
:FALSE
to stay silent.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Fill in the blanks to run
xgb.cv()
on the treated training data; assign the output to the variablecv
.- Use
as.matrix()
to convert the vtreated data frame to a matrix. - Use 50 rounds, and 5-fold cross validation.
- Set
early_stopping_rounds
to 5. - Set
eta
to 0.75,max_depth
to 5.
- Use
- Get the data frame
evaluation_log
fromcv
and assign it to the variableelog
. Each row of theevaluation_log
corresponds to an additional tree, so the row number tells you the number of trees in the model. - Fill in the blanks to get the number of trees with the minimum value of the columns
train_rmse_mean
andtest_rmse_mean
.which.min()
(docs) returns the index of the minimum value in a vector.- How many trees do you need?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Run xgb.cv
cv <- xgb.cv(data = ____,
label = ___,
nrounds = ___,
nfold = ___,
objective = "reg:squarederror",
eta = ___,
max_depth = ___,
early_stopping_rounds = ___,
verbose = FALSE # silent
)
# Get the evaluation log
elog <- ___
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = ___, # find the index of min(train_rmse_mean)
ntrees.test = ___) # find the index of min(test_rmse_mean)