Modeling an interaction (2)

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this dataset is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

mutate() (docs) adds new columns to a tbl (a type of data frame)
group_by() (docs) specifies how rows are grouped in a tbl
summarize() (docs) computes summary statistics of a column

You will also use tidyr's pivot_longer() (docs) which takes multiple columns and collapses them into key-value pairs. The alcohol data frame and the formulas fmla_add and fmla_interaction have been pre-loaded.

Este exercício faz parte do curso

Supervised Learning in R: Regression

Ver curso

Instruções do exercício

Use kWayCrossValidation() (docs) to create a splitting plan for a 3-fold cross validation.
- The first argument is the number of rows to be split.
- The second argument is the number of folds for the cross-validation.
- You can set the 3rd and 4th arguments of the function to NULL.
Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.
Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction.
- The sample code shows you the procedure.
- Use the same splitPlan that you already created.
Fill in the blanks to
- pivot_longer the predictions into a single column pred.
- add a column of residuals (actual outcome - predicted outcome).
- get the RMSE of the cross-validation predictions for each model type.
Compare the RMSEs. Based on these results, which model should you use?

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

# alcohol is available
summary(alcohol)

# Both the formulae are available
fmla_add
fmla_interaction

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- ___(___(___), ___, ___, ___)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i ___ ___) {
  split <- ___
  model_interaction <- lm(___, data = alcohol[split$train, ])
  alcohol$___[split$app] <- predict(___, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  pivot_longer(cols=c('pred_add', 'pred_interaction'), names_to='modeltype', values_to='pred') %>%
  mutate(residuals = ____) %>%      
  group_by(modeltype) %>%
  summarize(rmse = ___(___(___)))

Editar e executar o código

Supervised Learning in R: Regression

IntermediárioNível de habilidade

4.6+

66 reviews