Get startedGet started for free

Error due to under/over-fitting

The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might only see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.

A feature represents which columns of the data are used in a decision tree. The parameter max_features limits the number of features available.

This exercise is part of the course

Model Validation in Python

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Update the rfr model
rfr = RandomForestRegressor(____=25,
                            ____=1111,
                            ____=2)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))
Edit and Run Code