1. Models for each category
The parallel slopes model enforced a common slope for each category. That's not always the best option.
2. Four categories
Recall that the fish dataset had four different species of fish.
One way to give each species a different slope is to run a separate model for each of these.
3. Splitting the dataset
First, we split up the fish dataset into four subsets. Let's filter for each species one at a time and assign the result to individual variables.
4. Four models
Now that we have four datasets, we can run four models, all predicting mass based on length for each species.
Observe that each model gives a different intercept and a different slope.
5. Explanatory data
To make predictions with these models, we first have to create a DataFrame of explanatory variables. The good news is that since each model has the same explanatory variable, you only have to write this code once. Give your copy and paste fingers a rest.
6. Making predictions
Predicting follows the now familiar flow. Add a column with the assign method, name it after the response variable, call predict on the model and add explanatory_data as the argument. The only difference in each case is the model variable, since every species has its own model coefficients now.
It isn't necessary for calculating the predictions, but to make the plotting code you are about to see easier, I've also included the species in each prediction dataset.
7. Concatenating predictions
Working with all these separate prediction DataFrames for each fish species isn't very convenient. Therefore, we combine all four DataFrames in one prediction DataFrame, using the concat function. This basically sticks a list of separate DataFrames back together into one.
The result looks like this.
8. Visualizing predictions
To visualize regression models across subsets of a dataset, you can't use the regplot function anymore. Instead, you use seaborn's lmplot function. It takes the usual x, y, and data arguments, and a hue argument to define which variable the data should be subsetted on.
Unlike the parallel slopes case, each line has its own slope, which we can see from setting the hue argument.
9. Adding in your predictions
To sanity check our concatenated predictions, we add them to the plot to see if they align with what seaborn's lmplot calculated. As predicted, each line of prediction points follows seaborn's trend lines.
10. Coefficient of determination
An important question here is: are these models better?
The coefficient of determination for a model on the whole fish dataset is point-nine-two.
Now here's the coefficient of determination for each of the individual models. The pike number is higher, indicating a better fit, though the numbers for the other models are lower.
11. Residual standard error
Here's the residual standard error for the whole dataset model, one hundred and three.
For the individual models, this time the pike residual standard error is higher, indicating larger differences between actual and predicted values, but the other models show an improvement over the whole dataset model.
This mixed performance result is quite common: the whole dataset model benefits from the increased power of more rows of data, whereas individual models benefit from not having to satisfy different components of data.
12. Let's practice!
Let's try this on the housing data.