Comparing model performance
Plotting gives you a nice feel for where the model performs well, and where it doesn't. Sometimes it is nice to have a statistic that gives you a score for the model. This way you can quantify how good a model is, and make comparisons across lots of models. A common statistic is the root mean square error (sometimes abbreviated to "RMSE"), which simply squares the residuals, then takes the mean, then the square root. A small RMSE score for a given dataset implies a better prediction. (By default, you can't compare between different datasets, only different models on the same dataset. Sometimes it is possible to normalize the datasets to provide a comparison between them.)
Here you'll compare the gradient boosted trees and random forest models.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
both_responses
, containing the predicted and actual year of the track from both models, has been pre-defined as a local tibble.
- Create a sum of squares of residuals dataset.
- Add a
residual
column, equal to the predicted response minus the actual response. - Group the data by
model
. - Calculate a summary statistic,
rmse
, equal to the square root of the mean of theresidual
s squared.
- Add a
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# both_responses has been pre-defined
both_responses
# Create a residual sum of squares dataset
___