Get startedGet started for free

Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this dataset is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

  • mutate() (docs) adds new columns to a tbl (a type of data frame)
  • group_by() (docs) specifies how rows are grouped in a tbl
  • summarize() (docs) computes summary statistics of a column

You will also use tidyr's pivot_longer() (docs) which takes multiple columns and collapses them into key-value pairs. The alcohol data frame and the formulas fmla_add and fmla_interaction have been pre-loaded.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Use kWayCrossValidation() (docs) to create a splitting plan for a 3-fold cross validation.
    • The first argument is the number of rows to be split.
    • The second argument is the number of folds for the cross-validation.
    • You can set the 3rd and 4th arguments of the function to NULL.
  • Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.
  • Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction.
    • The sample code shows you the procedure.
    • Use the same splitPlan that you already created.
  • Fill in the blanks to
    • pivot_longer the predictions into a single column pred.
    • add a column of residuals (actual outcome - predicted outcome).
    • get the RMSE of the cross-validation predictions for each model type.
  • Compare the RMSEs. Based on these results, which model should you use?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# alcohol is available
summary(alcohol)

# Both the formulae are available
fmla_add
fmla_interaction

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- ___(___(___), ___, ___, ___)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i ___ ___) {
  split <- ___
  model_interaction <- lm(___, data = alcohol[split$train, ])
  alcohol$___[split$app] <- predict(___, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  pivot_longer(cols=c('pred_add', 'pred_interaction'), names_to='modeltype', values_to='pred') %>%
  mutate(residuals = ____) %>%      
  group_by(modeltype) %>%
  summarize(rmse = ___(___(___)))
Edit and Run Code