Error due to under/over-fitting

The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might only see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.

A feature represents which columns of the data are used in a decision tree. The parameter max_features limits the number of features available.

1
Create a random forest model with 25 trees, a random state of 1111, and max_features of 2. Read the print statements.

2
Set max_features to 11 (the number of columns in the dataset). Read the print statements.
3
Set max_features equal to 4. Read the print statements.

Basic Modeling in scikit-learn

Validation Basics

Cross Validation

Selecting the best model with Hyperparameter tuning.

Exercise

Error due to under/over-fitting

Instructions 1/3