1. Making predictions
The big benefit of running models rather than simply calculating descriptive statistics is that models let you make predictions.
2. The fish dataset: bream
Here's the fish dataset again. This time, we'll look only at the bream data. There's a new explanatory variable too: the length of each fish, which we'll use to predict the mass of the fish.
3. Plotting mass vs. length
Here's a scatter plot of mass versus length for the bream data, with a linear trend line.
4. Running the model
Before we can make predictions, we need a model. As before, we call lm with a formula and the dataset. The response, mass in grams, goes on the left-hand side of the formula, and the explanatory variable, length in centimeters, goes on the right. We need to assign the result to a variable to reuse later on.
5. Data on explanatory values to predict
The principle behind predicting is to ask questions of the form "if I set the explanatory variables to these values, what value would the response variable have?".
That means that the next step is to choose some values for the explanatory variables. For this model, the only explanatory variable is the length of the fish. I've chosen a vector of lengths from twenty centimeters to forty centimeters.
The explanatory variables need to be stored inside a data frame. Here, I'm using a tibble, which is a data frame variant that's easier to work with. I could also have used a standard data-dot-frame.
6. Call predict()
The next step is to call predict, passing the model object and the data frame of explanatory variables.
predict returns a vector of predictions, one for each row of the explanatory data.
7. Predicting inside a data frame
Having a vector of predictions isn't that helpful for programming with. It's easier to work with if the predictions are in a data frame alongside the explanatory variables.
I've started with the explanatory data, then used mutate to add a new column, named after the response variable, mass_g, and calculated it with the same predict code from the previous slide. The prediction data frame contains the explanatory variable and the predicted response.
Now we can answer questions like "how heavy would we expect a bream with length twenty five centimeters to be?", even though the original dataset didn't include a bream of that exact length. Looking at the prediction data, you can see that the predicted mass is three hundred and twenty eight grams.
8. Showing predictions
Let's include the predictions we just made to the scatter plot. We add another geom_point layer, and set the data argument to the prediction data frame we just created.
I've colored the points blue to distinguish them from the data points.
Notice that the predictions lie exactly on the trend line.
9. Extrapolating
All the fish were between twenty three and thirty eight centimeters, but the linear model allows us to make predictions outside that range. This is called extrapolating.
Let's see what prediction we get for a ten centimeter bream. The code is the same as before, but with a length of ten in the explanatory data frame.
Wow. The predicted mass is almost minus five hundred grams! This is obviously not physically possible, so the model performs poorly here. Extrapolation is sometimes appropriate, but can lead to misleading or ridiculous results. You need to understand the context of your data in order to determine whether it is sensible to extrapolate.
10. Let's practice!
I predict that you are about to make some predictions.