Tuning colsample_bytree
Now, it's time to tune "colsample_bytree"
. You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier
or RandomForestRegressor
, where it just was called max_features
. In both xgboost
and sklearn
, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost
, colsample_bytree
must be specified as a float between 0 and 1.
This exercise is part of the course
Extreme Gradient Boosting with XGBoost
Exercise instructions
- Create a list called
colsample_bytree_vals
to store the values0.1
,0.5
,0.8
, and1
. - Systematically vary
"colsample_bytree"
and perform cross-validation, exactly as you did withmax_depth
andeta
previously.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter dictionary
params={"objective":"reg:squarederror","max_depth":3}
# Create list of hyperparameter values: colsample_bytree_vals
____ = ____
best_rmse = []
# Systematically vary the hyperparameter value
for curr_val in ____:
____ = ____
# Perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
num_boost_round=10, early_stopping_rounds=5,
metrics="rmse", as_pandas=True, seed=123)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))