Get startedGet started for free

Models for each category

1. Models for each category

The parallel slopes model enforced a common slope for each category. That's not always the best option.

2. 4 categories

Recall that the fish dataset had four different species of fish. One way to give each species a different slope is to run a separate model for each of these.

3. Splitting the dataset

There are many smart ways of splitting a dataset into parts and computing on each part. In base-R, you can use split and lapply. In dplyr, you can use nest_by and mutate. We aren't going to do that. Instead, let's filter for each species one at a time and assign the result to individual variables. I'm choosing this approach partly because I don't want fancy code to get in the way of reasoning about models, and partly because running regression models is such a fundamental task that you need to be able to write the code without thinking, and that takes practice. With this approach, you'll be writing the modeling code for every category in the dataset.

4. 4 models

Now we have four datasets, we can run four models. Again, there's no fancy looping, just the same model four times. Observe that each model gives a different intercept and a different slope.

5. Explanatory data

To make predictions with these models, we first have to create a data frame of explanatory variables. The good news is that since each model has the same explanatory variable, you only have to write this code once. Give your copy and paste fingers a rest.

6. Making predictions

Predicting follows the now familiar flow. Add a column with mutate, name it after the response variable, call predict with the model and the explanatory data. The only difference in each case is the model variable. It isn't necessary for calculating the predictions, but to make the plotting code you are about to see easier, I've also included the species in each prediction dataset.

7. Visualizing predictions

Here's the standard ggplot for showing linear regression predictions. geom_point makes it a scatter plot, and geom_smooth with method equals "lm" provides prediction lines. Unlike the parallel slopes case, each line has its own slope. This is achieved by setting the color aesthetic.

8. Adding in your predictions

To sanity check our predictions, we add them to the plot to see if they align with what ggplot2 calculated. The size and shape are changed to help them stand out. Thankfully, each line of squares follows ggplot's trend lines.

9. Coefficient of determination

An important question here is are these models better? The coefficient of determination for a model on the whole fish dataset is point-nine-two. Now here's the coefficient of determination for each of the individual models. The pike number is higher, indicating a better fit, though the numbers for the other models are lower.

10. Residual standard error

Here's the residual standard error for the whole dataset model, one hundred and three. For the individual models, this time the pike residual standard error is higher, indicating larger differences between actual and predicted values, but the other models show an improvement over the whole dataset model. This mixed performance result is quite common: the whole dataset model benefits from the increased power of more rows of data, whereas individual models benefit from not having to satisfy differing components of data.

11. Let's practice!

Let's try this on the housing data.