Exercise

Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this dataset is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

  • mutate() adds new columns to a tbl (a type of data frame)
  • group_by() specifies how rows are grouped in a tbl
  • summarize() computes summary statistics of a column

You will also use tidyr's gather() which takes multiple columns and collapses them into key-value pairs. The alcohol data frame and the formulas fmla_add and fmla_interaction have been pre-loaded.

Instructions

100 XP
  • Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation.
    • The first argument is the number of rows to be split.
    • The second argument is the number of folds for the cross-validation.
    • You can set the 3rd and 4th arguments of the function to NULL.
  • Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.
  • Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction.
    • The sample code shows you the procedure.
    • Use the same splitPlan that you already created.
  • Fill in the blanks to
    • gather the predictions into a single column pred.
    • add a column of residuals (actual outcome - predicted outcome).
    • get the RMSE of the cross-validation predictions for each model type.
  • Compare the RMSEs. Based on these results, which model should you use?