Error due to under/over-fitting
The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might only see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.
A feature represents which columns of the data are used in a decision tree. The parameter max_features
limits the number of features available.
Cet exercice fait partie du cours
Model Validation in Python
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Update the rfr model
rfr = RandomForestRegressor(____=25,
____=1111,
____=2)
rfr.fit(X_train, y_train)
# Print the training and testing accuracies
print('The training error is {0:.2f}'.format(
mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
mae(y_test, rfr.predict(X_test))))