1. Predicting parallel slopes
Predicting responses is perhaps the most useful feature of regression models. With two explanatory variables, the code for prediction has one subtle difference from the case with a single explanatory variable.
2. The prediction workflow 1
The prediction workflow starts with choosing values for explanatory variables. You pick any values you want, and store them in a data frame or tibble.
For a single explanatory variable, the data frame has one column. Here, it's a sequence of lengths from 5cm to 60cm, in steps of 5cm.
For multiple explanatory variables, it's the same process, but there's a useful trick. expand_grid from the tidyr package returns a data frame of all combinations of its inputs.
Here, you have 5cm and each fish species, 10cm and each fish species, through to 60cm and each fish species.
3. The prediction workflow 2
Next you add a column of predictions to the data frame. To calculate the predictions, call predict, passing the model and the explanatory data. Here's the code for one explanatory variable.
With two or more explanatory variables, other than the model variable name, the code is exactly the same!
4. Visualizing the predictions
Just as in the single explanatory variable case, we can visualize the predictions from the model by adding another geom_point layer and setting the data argument to prediction_data.
I set the size and shape arguments to make the predictions big square points. A good sign that this worked is that the prediction points lie along the lines calculated by ggplot.
5. Manually calculating predictions
In the previous course, you saw how to manually calculate the predictions. The coefficients function extracts the coefficients from the model.
The intercept is the first coefficient, and the slope is the second coefficient.
Then the response value is the intercept plus the slope times the explanatory variable.
6. Coefficients for parallel slopes
For the parallel slopes model, there is an added complication. Each category of the categorical variable has a different intercept.
Due to the way we specified the model, the slope coefficient is the first one.
7. Choosing an intercept with ifelse()
You can choose this intercept using if-else statements, but this becomes clunky when you have lots of categories.
With just four categories, these nested calls to ifelse are hard to write and hard to read. It's a recipe for buggy code.
8. case_when()
dplyr has a function called case_when that simplifies the code. Each argument to case_when is a formula, just like you use when specifying a model. On the left-hand side, you have a logical condition. On the right-hand side, you have the value to give to those rows where the condition is met.
This is very abstract, so let's look at how we use it for predictions.
9. Choosing an intercept with case_when()
The first argument to case_when has a logical condition to check for rows where the species is Bream. On the right-hand side of the formula, we give those rows the value of the bream intercept.
Then we repeat this for the other species. This code does the same thing as the ifelse code, but I find it easier to write and to read.
10. The final prediction step
The final step is to calculate the response. As before, the response is the intercept plus the slope times the numeric explanatory variable. This time, the intercept is different for different rows.
11. Compare to predict()
The model predicts some negative masses, which isn't a good sign. Let's check that we got the right answer by calling predict.
You can see that the predictions are the same numbers as the mass column that we calculated, so our calculations are correct. It's just that this model performs poorly for small fish lengths.
12. Let's practice!
Time for you to make predictions.