1. Random forests and wine
Now that we've explored simple, linear models for classification and regression, lets move on to something more interesting.
2. Random forests
Random forests are a very popular type of machine learning model. They are very useful, especially for beginners, because they are quite robust against over-fitting.
Random forests typically yield very accurate, non-linear models with no extra work on the part of the data scientist. This makes them very useful on many real-world problems.
3. Random forests
The drawback to random forests is that, unlike linear models, they have "hyperparameters" to tune; and unlike regular parameters, for instance slope or intercept in a linear model, hyperparameters cannot be directly estimated from the training data. They must be manually specified by the data scientist as inputs to the predictive model.
However, these hyperparameters can impact how the model fits the data, and the optimal values for these parameters vary dataset to dataset. In practice, the default values of the hyperparameters for random forests are often fine, but occasionally they aren't and will need adjustment.
Fortunately, we have the caret package to help us.
4. Random forests
Random forests start with a simple decision tree model, which is fast, but usually not very accurate.
5. Random forests
Random forests improve the accuracy of a single model by fitting many decision trees, each fit to a different bootstrap sample of the original dataset.
This is called bootstrap aggregation or bagging, and is a well-known technique for improving the performance of predictive models.
Random forests take bagging one step further by randomly re-sampling the columns of the dataset at each split. This additional level of sampling often helps yield even more accurate models.
6. Running a random forest
Let's fit a random forest using caret.
First, we load the sonar dataset, and then set the random seed so our results are reproducible.
Next, we fit a model using the train function, and pass the "ranger" argument to fit a random forest. Ranger is a great package for fitting random forests in R, and is often much faster than the original randomForest package in R.
Finally, we plot the result, to see which hyperparameters for the random forest give the best results.
7. Plotting the results
Finally, we plot the result, to see which hyperparameters for the random forest give the best results.
In this case it looks like smaller values yield higher accuracy.
8. Let's practice!
Let's practice fitting some random forests.