Error due to under/over-fitting
The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might only see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.
A feature represents which columns of the data are used in a decision tree. The parameter max_features
limits the number of features available.
This exercise is part of the course
Model Validation in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Update the rfr model
rfr = RandomForestRegressor(____=25,
____=1111,
____=2)
rfr.fit(X_train, y_train)
# Print the training and testing accuracies
print('The training error is {0:.2f}'.format(
mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
mae(y_test, rfr.predict(X_test))))