Get startedGet started for free

Input transforms: the "hockey stick" (2)

In the last exercise, you saw that a quadratic model seems to fit the houseprice data better than a linear model. In this exercise, you will confirm whether the quadratic model would perform better on out-of-sample data. Since this dataset is small, you will use cross-validation. The quadratic formula fmla_sqr that you created in the last exercise and the houseprice data frame are available for you to use.

For comparison, the sample code will calculate cross-validation predictions from a linear model price ~ size.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation.
    • You can set the 3rd and 4th arguments of the function to NULL.
  • Examine and run the sample code to get the 3-fold cross-validation predictions of the model price ~ size and add them to the column pred_lin.
  • Get the cross-validation predictions for price as a function of squared size. Assign them to the column pred_sqr.
    • The sample code gives you the procedure.
    • You can use the splitting plan you already created.
  • Fill in the blanks to pivot the predictions and calculate the residuals.
  • Fill in the blanks to compare the RMSE for the two models. Which one fits better?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# houseprice is available
summary(houseprice)

# fmla_sqr is available
fmla_sqr

# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- ___

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- ___
  model_sqr <- lm(___, data = houseprice[split$train, ])
  houseprice$___[split$app] <- predict(___, newdata = houseprice[split$app, ])
}

# Pivot the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  pivot_longer(cols = c('pred_lin', 'pred_sqr'), names_to = 'modeltype', values_to = 'pred') %>%
  mutate(residuals = ___)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = ___)
Edit and Run Code