Session Ready
Exercise

Typical values of data

Let's revisit the model of life insurance cost as a function of age, sex, and coverage that you saw in the first chapter. The model, shown in the plot, is log(Cost) ~ Age + Sex + Coverage.

You can see from the plot that the data show a curving, exponential increase in cost with age. This is a hint that it might be appropriate to model the logarithm of cost instead of modeling cost directly.

Instructions
100 XP
  • Build a linear model, mod_1, with log(Cost) as the response and Age + Sex + Coverage on the explanatory side. Graph the model to see how well it matches the data.
  • Build mod_2 the same as mod_1, but with an interaction between Age and Sex.
  • Build mod_3 the same as mod_2, but replace Coverage with log(Coverage).
  • Finally, build mod_4 the same as mod_3, but with interactions among Age, Sex, and log(Coverage).
  • Use fmodel() on look at each of the four models to see which model seems to fit the data best. (The + ggplot::geom_point() command added to fmodel() displays the data points.)
  • Use cross validation to demonstrate that mod_4 has a smaller MSE on testing data than mod_1. (mod_4 is the best of the four models. Feel free to compare all four if you want to confirm this.)