1. Building and tuning a random forest model
Let's briefly review what you've done so far to evaluate the cross validation performance of the regression model.
2. Cross Validation Performance
Using cross validation, you split the training data into multiple train-validate pairs.
3. Cross Validation Performance
The train section for each of these cross validation folds was used to build a corresponding model.
4. Cross Validation Performance
Which was then used alongside with the held out validate sets.
5. Cross Validation Performance
To calculate the mean absolute error for each cross validation fold.
6. Linear Regression Model
Once you've taken the average mae across the cross validation folds you've measured the performance of the model on held out data.
For your linear regression model, the mean absolute error is 1.5 years, meaning that you can expect the model predictions will be off, on average, by 1.5 years.
Is this the best model that we can build?
7. Another Model
You can determine this by repeating these steps with a different model. Because the same data will be used across the models you can directly compare their validation performance between them, allowing you to select the best performing model.
You can use this machine learning workflow to compare virtually any model. So let's try this out with a random forest model to see if it achieves a higher performance.
8. Random Forest Benefits
The random forest is a very popular model in the machine learning community. The details of how this algorithm works are outside the scope of this course but can be found in other great datacamp courses on machine learning.
In chapter 2, we've learned that there might be a non-linear relationship between the gapminder features and life expectancy. Also we know that the country feature had a direct relationship with other features.
The random forest models natively handle both non-linear relationships and feature interactions so we can be optimistic about trying this model.
9. Basic Random Forest Tools
You will use the random forest implementation from the ranger package.
To build the random forest model with default hyperparameters you use the following syntax.
You need to provide the formula and data just like the regression model. Because a random forest has a random element I recommend using the seed argument to ensure that your results are reproducible.
The syntax for preparing the prediction values for new data is also similar to that of a linear model. The only difference is that you need to use the dollar sign to explicitly extract the prediction vector from the ranger prediction object.
10. Build Basic Random Forest Models
You can apply this as before by mapping the train data to build the models for each fold.
Then use map2() to generate the predictions for each fold.
11. ranger Hyper-Parameters
You can further improve a model by fine tuning its hyper parameters. Ranger has two parameters that can be tuned, mtry and num.trees.
We will focus on tuning the mtry parameter which can range from one to the total number of features available.
12. Tune The Hyper-Parameters
To tune the parameters in a tidyverse fashion you can leverage the crossing() function to expand the cross validation data frame for each value of the hyper parameter you're interesting in trying.
13. Tune The Hyper-Parameters
Then you can use map2() to iterate over all the folds and the mtry parameter to build the new ranger models for each fold-mtry combintation. You can then proceed as usual to calculate the mean absolute error for each combination to determine which parameterized model has the best validation performance.
14. Let's practice!
Now let's use what you've learned up until now to see if the random forest model will provide better validation performance than the linear regression model.