1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Spark with sparklyr in R

Exercise

Comparing model performance

Plotting gives you a nice feel for where the model performs well, and where it doesn't. Sometimes it is nice to have a statistic that gives you a score for the model. This way you can quantify how good a model is, and make comparisons across lots of models. A common statistic is the root mean square error (sometimes abbreviated to "RMSE"), which simply squares the residuals, then takes the mean, then the square root. A small RMSE score for a given dataset implies a better prediction. (By default, you can't compare between different datasets, only different models on the same dataset. Sometimes it is possible to normalize the datasets to provide a comparison between them.)

Here you'll compare the gradient boosted trees and random forest models.

Instructions

100 XP

both_responses, containing the predicted and actual year of the track from both models, has been pre-defined as a local tibble.

  • Create a sum of squares of residuals dataset.
    • Add a residual column, equal to the predicted response minus the actual response.
    • Group the data by model.
    • Calculate a summary statistic, rmse, equal to the square root of the mean of the residuals squared.