Model formula
1. Model formula
In chapter one we touched on the notion of formulas when fitting a GLM and throughout the course, you learned how to use the formula for continuous explanatory variables. However, often in practice, you would want to include categorical, interaction or transformation terms. Let's see how to go about doing this.2. Formula and model matrix
Model formula presents the heart of the modeling function and is directly related to the model matrix. Let's see how. Starting with the y and X data3. Formula and model matrix
we can write the model formula directly in the glm function.4. Formula and model matrix
Under the hood the patsy package is used to construct the model matrix5. Formula and model matrix
which is actually used in the fit call. Notice how the intercept is added silently.6. Model matrix
Model matrix denoted in capital bold letter X contains all the variables to be included in the model fit. For example, if we write formula y tilde x1 plus x2 the model matrix would include the intercept term and the two variables. To see the model matrix we use dmatrix function from the patsy package with only the RHS of the formula. While we are predominantly concerned with the formula specification viewing model matrix especially with more complex model structure is always advisable practice.7. Variable transformation
Oftentimes it is beneficial to perform the variable transformation. We can take the logarithm value of a continuous variable inside the formula argument. Note that numpy needs to be imported.8. Centering and standardization
Centering or standardizing variables can also be done directly in the formula argument. Centering subtracts the mean and standardizing subtracts the mean and divides by the sample standard deviation. These are stateful transforms as they remember the state of the original data which then can be further used to apply to new data.9. Build your own transformation
You can also define your own transformation and apply it directly.10. Arithmetic operations
Direct arithmetic transformations are also possible but we need to wrap them. To model the sum of x1 and x2 instead of individual values we need to enclose them in the I function. From the output, we can see that we obtain the intercept and only one explanatory variable. Note however that if the variables are in the standard list then it will perform concatenation and not element-wise addition.11. Coding the categorical data
Categorical data can be a bit tricky compared to what we covered so far since they usually contain two or more groups which can be represented encoding them. For example, given 3 groups for color12. Coding the categorical data
with the measured values green, red, red, etc13. Coding the categorical data
are encoded into indicator variables where each has values 0 or 1. One group is chosen as the reference group, where the model estimates will then be measured by how much they differ from the reference group, i.e. the coefficients provide the difference in means between each group and the reference group.14. Patsy coding
The strings and booleans are automatically codded, where for other categorical data we use the C function. By default, the 1st group is used as a reference, which can be changed using Treatment or levels argument.15. The C() function
The color variable in the crab dataset is defined as an integer, so the model matrix would take it as such. However, using the value counts function we see there are 4 levels.16. The C() function
Applying the C function codes the color variable. Since color takes 4 different groups, it is represented with 3 columns and a reference group. The mean behavior of the reference group is given by the intercept. The 3 columns have values of 0 or 1.17. Changing the reference group
Using treatment inside the C function and declaring the level changes the reference group.18. Changing the reference group
Similarly you can use levels inside the C argument for different coding of the reference group.19. Multiple intercepts
To estimate the intercept for each group we would add negative one in the formula.20. Let's practice!
Time for some practice problems.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.