Explore a wider model space

1. Explore a wider model space

One of the big differences between a random forest and the linear regression models we've been exploring up to now, is that random forests require "tuning".

2. Random forests require tuning

In other words, random forests have "hyperparameters" that control how the model is fit. Unlike the "parameters" of a model (for example the split points in random forests or coefficients in linear regression), hyperparameters must be selected by hand, before fitting the model. The most important of these hyperparameters is the "mtry" or the number of randomly selected variables used at each split point in the individual decision tress that make up the random forest. This number is tunable: you could look at as few as 2 or as many as 100 variables per split. Forests that used 2 variables would tend to be more random, while forests that used 100 variables would tend to be less random. Unfortunately, due to their nature, it's hard to know the best value of these hyperparameters without trying them out on your training data. For some datasets, 2-variable random forests are best, and on other datasets, 100-variable random forests are best.

3. Example: sonar data

Once again, caret saves us a lot of boring manual work and automates this process of hyperparameter selection. Not only does caret do cross-validation to tell us our model's out-of-sample error, it also automates a process called "grid search" for selecting hyperparameters based on out-of-sample error. To start, we can play with the tuneLength argument to the train function. This argument is used to tell train to explore more models along its default tuning grid. First, we load the Sonar dataset from the mlbench package, and then we fit a random forest with a very fine tuning grid by specifying tuneLength = 10. This will take longer than the default model, which uses a tunelength of 3. This means we get a potentially more accurate model, but at the expense of waiting much longer for it to run. Also note that we're using the method = 'ranger' argument to the train function. This uses the ranger package in R to fit a random forest, which is much faster than the more widely known randomForest package. I highly recommend using ranger if you do any random forest modeling. It's a lot faster and yields very similar results. After the model is fit, we can then plot the results

4. Plot the results

and visually inspect the model's accuracy for different values of mtry. In this case, it looks like mtry = 14 yields the highest out-of-sample accuracy.

5. Let's practice!

Let's explore the tuneLength argument on some other models.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.