Explore overfitting XGBoost
Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.
The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in sklearn.metrics
as mean_squared_error()
function that takes two arguments: true values and predicted values.
train
and test
DataFrames together with 3 models trained (xg_depth_2
, xg_depth_8
, xg_depth_15
) are available in your workspace.
This exercise is part of the course
Winning a Kaggle Competition in Python
Exercise instructions
- Make predictions for each model on both the train and test data.
- Calculate the MSE between the true values and your predictions for both the train and test data.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from sklearn.metrics import mean_squared_error
dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])
# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
# Make predictions
train_pred = model.____(dtrain)
test_pred = model.____(dtest)
# Calculate metrics
mse_train = ____(train['sales'], train_pred)
mse_test = ____(test['sales'], test_pred)
print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))