1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Spark with sparklyr in R

Exercise

Gradient boosted trees: prediction

Once you've run your model, then the next step is to make a prediction with it. sparklyr contains methods for the predict() function from base-R. This means that you can make predictions from Spark models with the same syntax as you would use for predicting a linear regression. predict() takes two arguments: a model, and some testing data.

predict(a_model, testing_data)

A common use case is to compare the predicted responses with the actual responses, which you can draw plots of in R. The code pattern for preparing this data is as follows. Note that currently adding a prediction column has to be done locally, so you must collect the results first.

predicted_vs_actual <- testing_data %>%
  select(response) %>%
  collect() %>%
  mutate(predicted_response = predict(a_model, testing_data))

Instructions

100 XP

A Spark connection has been created for you as spark_conn. Tibbles attached to the training and testing datasets stored in Spark have been pre-defined as track_data_to_model_tbl and track_data_to_predict_tbl respectively. The gradient boosted trees model has been pre-defined as gradient_boosted_trees_model.

  • Select the year column.
  • Collect the results.
  • Add a column containing the predictions.
    • Use mutate() to add a field named predicted_year.
    • This field should be created by calling predict().
    • Pass the model and the testing data to predict().