Get startedGet started for free

Find the right number of trees for a gradient boosting machine

In this exercise, you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. You will train the model on data from the month of July.

The July data has been pre-loaded. Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() (docs) uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this exercise, the key arguments to the xgb.cv() call are:

  • data: a numeric matrix.
  • label: vector of outcomes (also numeric).
  • nrounds: the maximum number of rounds (trees to build).
  • nfold: the number of folds for the cross-validation. 5 is a good number.
  • objective: "reg:squarederror" for continuous outcomes.
  • eta: the learning rate.
  • max_depth: maximum depth of trees.
  • early_stopping_rounds: after this many rounds without improvement, stop.
  • verbose: FALSE to stay silent.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Fill in the blanks to run xgb.cv() on the treated training data; assign the output to the variable cv.
    • Use as.matrix() to convert the vtreated data frame to a matrix.
    • Use 50 rounds, and 5-fold cross validation.
    • Set early_stopping_rounds to 5.
    • Set eta to 0.75, max_depth to 5.
  • Get the data frame evaluation_log from cv and assign it to the variable elog. Each row of the evaluation_log corresponds to an additional tree, so the row number tells you the number of trees in the model.
  • Fill in the blanks to get the number of trees with the minimum value of the columns train_rmse_mean and test_rmse_mean.
    • which.min() (docs) returns the index of the minimum value in a vector.
    • How many trees do you need?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Run xgb.cv
cv <- xgb.cv(data = ____, 
            label = ___,
            nrounds = ___,
            nfold = ___,
            objective = "reg:squarederror",
            eta = ___,
            max_depth = ___,
            early_stopping_rounds = ___,
            verbose = FALSE   # silent
)

# Get the evaluation log 
elog <- ___

# Determine and print how many trees minimize training and test error
elog %>% 
   summarize(ntrees.train = ___,   # find the index of min(train_rmse_mean)
             ntrees.test  = ___)   # find the index of min(test_rmse_mean)
Edit and Run Code