Gradient boosted trees: visualization

Now you have your model predictions, you might wonder "are they any good?". There are many plots that you can draw to diagnose the accuracy of your predictions; here you'll take a look at two common plots. Firstly, it's nice to draw a scatterplot of the predicted response versus the actual response, to see how they compare. Secondly, the residuals ought to be somewhere close to a normal distribution, so it's useful to draw a density plot of the residuals. The plots will look something like these.

Scatterplot of predicted response vs. actual response and density plot of distribution of residuals side by side.

In this exercise, you'll learn to calculate the residuals yourself (predicted responses minus actual responses) for your model predictions.

A local tibble responses, containing predicted and actual years, has been pre-defined.

Draw a scatterplot of predicted vs. actual responses.
- Call ggplot().
- The first argument is the dataset, responses.
- The second argument should contain the unquoted column names for the x and y axes (actual and predicted respectively), wrapped in aes().
- Add points by adding a call to geom_point().
- Make the points partially transparent by setting alpha = 0.1.
- Add a reference line by adding a call to geom_abline() with intercept = 0 and slope = 1.
Create a tibble of residuals, named residuals.
- Call transmute() on the responses.
- The new column should be called residual.
- residual should be equal to the predicted response minus the actual response.
Draw a density plot of residuals.
- Pipe the transmuted tibble to ggplot().
- ggplot() needs a single aesthetic, residual wrapped in aes().
- Add a probability density curve by calling geom_density().
- Add a vertical reference line through zero by calling geom_vline() with xintercept = 0.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Gradient boosted trees: visualization

Instructions