1. One model with an interaction
Messing about with different models for different bits of your dataset is a pain. A better solution is to specify a single model that contains intercepts and slopes for each category. This is achieved through specifying interactions between explanatory variables.
2. What is an interaction?
To understand the idea of interactions between explanatory variables, consider what we know about the fish dataset. Different fish species have different mass to length ratios.
In statistical terms, we can say that the effect that length has on the expected mass of the fish varies between species. That means that length and species interact.
More generally, if the effect of one explanatory variable on the expected response has different values dependent on the values of another explanatory variable, then those two explanatory variables interact.
3. Specifying interactions
You've seen how to include multiple explanatory variables in a formula using plus, for example, length plus species.
To include an interaction between those variables, you simply swap the plus for a times. I've called this syntax implicit because you didn't write down what interactions are needed - statsmodels figures that out itself. Usually this concise syntax is best, but occasionally you may wish to explicitly document which interactions are included in the model.
The explicit syntax is to add each explanatory variable separated by plus then add a third term with both explanatory variables separated by a colon. The result is exactly the same, so choosing a syntax depends on personal preference: do you like brevity or detail?
4. Running the model
Here's the formula in a model, resulting in eight coefficients. As you saw in the models with a categorical explanatory variable, the coefficients are tricky to understand.
The Intercept coefficient is the intercept for the first species, namely bream. The length_cm coefficient is the slope for the bream. Then the Intercept coefficient plus the species T dot Perch coefficient is the intercept for perch. And the length coefficient plus the length_cm:species T dot Perch coefficient is the slope for perch. It's a mess.
5. Easier to understand coefficients
Ironically, to get easier to understand coefficients, we need to make the formula harder to read.
This is the same model specified differently. On the right-hand side of the formula, you can see the categorical explanatory variable, species, then an interaction between the two explanatory variables, then zero to remove the global intercept.
Now we get an intercept coefficient for each species and a slope coefficient for each species.
6. Familiar numbers
You've seen all these coefficient values before. If we examine the coefficients from the model on the bream data from the previous video, you can see that the intercept and slope are the same as in the model we just made.
The same is true for the other three species. In fact, the model with the interaction is effectively the same as fitting separate models for each category, only you get the convenience of not having to manage four sets of code.
7. Let's practice!
Let's interact with some exercises!