Get startedGet started for free

Explore overfitting XGBoost

Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.

The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in sklearn.metrics as mean_squared_error() function that takes two arguments: true values and predicted values.

train and test DataFrames together with 3 models trained (xg_depth_2, xg_depth_8, xg_depth_15) are available in your workspace.

This exercise is part of the course

Winning a Kaggle Competition in Python

View Course

Exercise instructions

  • Make predictions for each model on both the train and test data.
  • Calculate the MSE between the true values and your predictions for both the train and test data.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from sklearn.metrics import mean_squared_error

dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])

# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
    # Make predictions
    train_pred = model.____(dtrain)     
    test_pred = model.____(dtest)          
    
    # Calculate metrics
    mse_train = ____(train['sales'], train_pred)                  
    mse_test = ____(test['sales'], test_pred)
    print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))
Edit and Run Code