Exercise

Gradient boosted trees: visualization

Now you have your model predictions, you might wonder "are they any good?". There are many plots that you can draw to diagnose the accuracy of your predictions; here you'll take a look at two common plots. Firstly, it's nice to draw a scatterplot of the predicted response versus the actual response, to see how they compare. Secondly, the residuals ought to be somewhere close to a normal distribution, so it's useful to draw a density plot of the residuals. The plots will look something like these.

scatterplot of predicted response vs. actual response density plot of distribution of residuals

One slightly tricky thing here is that sparklyr doesn't yet support the residuals() function in all its machine learning models. Consequently, you have to calculate the residuals yourself (predicted responses minus actual responses).

Instructions

100 XP

A local tibble responses, containing predicted and actual years, has been pre-defined.

  • Draw a scatterplot of predicted vs. actual responses.
    • Call ggplot().
    • The first argument is the dataset, responses.
    • The second argument should contain the unquoted column names for the x and y axes (actual and predicted respectively), wrapped in aes().
    • Add points by adding a call to geom_point().
    • Make the points partially transparent by setting alpha = 0.1.
    • Add a reference line by adding a call to geom_abline() with intercept = 0 and slope = 1.
  • Create a tibble of residuals, named residuals.
    • Call transmute() on the responses.
    • The new column should be called residual.
    • residual should be equal to the predicted response minus the actual response.
  • Draw a density plot of residuals.
    • Pipe the transmuted tibble to ggplot().
    • ggplot() needs a single aesthetic, residual wrapped in aes().
    • Add a probability density curve by calling geom_density().
    • Add a vertical reference line through zero by calling geom_vline() with xintercept = 0.