Comparing model performance

Plotting gives you a nice feel for where the model performs well, and where it doesn't. Sometimes it is nice to have a statistic that gives you a score for the model. This way you can quantify how good a model is, and make comparisons across lots of models. A common statistic is the root mean square error (sometimes abbreviated to "RMSE"), which simply squares the residuals, then takes the mean, then the square root. A small RMSE score for a given dataset implies a better prediction. (By default, you can't compare between different datasets, only different models on the same dataset. Sometimes it is possible to normalize the datasets to provide a comparison between them.)

Here you'll compare the gradient boosted trees and random forest models.

both_responses, containing the predicted and actual year of the track from both models, has been pre-defined as a local tibble.

Create a sum of squares of residuals dataset.
- Add a residual column, equal to the predicted response minus the actual response.
- Group the data by model.
- Calculate a summary statistic, rmse, equal to the square root of the mean of the residuals squared.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Comparing model performance

Instructions