Tune random forest hyperparameters

As with all models, we want to optimize performance by tuning hyperparameters. We have many hyperparameters for random forests, but the most important is often the number of features we sample at each split, or max_features in RandomForestRegressor from the sklearn library. For models like random forests that have randomness built-in, we also want to set the random_state. This is set for our results to be reproducible.

Usually, we can use sklearn's GridSearchCV() method to search hyperparameters, but with a financial time series, we don't want to do cross-validation due to data mixing. We want to fit our models on the oldest data and evaluate on the newest data. So we'll use sklearn's ParameterGrid to create combinations of hyperparameters to search.

This exercise is part of the course

Machine Learning for Finance in Python

View Course

Exercise instructions

  • Set the n_estimators hyperparameter to be a list with one value (200) in the grid dictionary.
  • Set the max_features hyperparameter to be a list containing 4 and 8 in the grid dictionary.
  • Fit the random forest regressor model (rfr, already created for you) to the train_features and train_targets with each combination of hyperparameters, g, in the loop.
  • Calculate R\(^2\) by using rfr.score() on test_features and append the result to the test_scores list.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from sklearn.model_selection import ParameterGrid

# Create a dictionary of hyperparameters to search
grid = {____, 'max_depth': [3], 'max_features': ____, 'random_state': [42]}
test_scores = []

# Loop through the parameter grid, set the hyperparameters, and save the scores
for g in ParameterGrid(grid):
    rfr.set_params(**g)  # ** is "unpacking" the dictionary
    rfr.fit(____, ____)
    test_scores.append(rfr.score(____, ____))

# Find best hyperparameters from the test score and print
best_idx = np.argmax(test_scores)
print(test_scores[best_idx], ParameterGrid(grid)[best_idx])