1. How to fit a GLM in Python?
Now that you understand the building blocks of GLMs it is time to learn how to fit a GLM in Python.
2. statsmodels
The starting point is the statsmodels library, which is used for statistical and econometric analysis. We import the library using statsmodels dot API. From 0.5.0. version the formula-based entry is supported which we can import as follows, or we can import the glm function directly via statsmodels dot formula dot API.
3. Process of model fit
To fit a model we first need to describe the model using the model class glm. Then the method fit is used to fit the model. Very detailed results of the model fit can be analyzed via the summary method, and finally, we can compute predictions using the predict method.
4. Describing the model
There are two ways to describe the model, using formulas or arrays. If you are familiar with R language then you will appreciate the ability to fit a GLM using the R-style formulas. The statsmodels uses the patsy package to convert formulas and data to the matrices which are then used in model fitting. Note that if you are using the array-based method the intercept is not included by default. You can add it using the add constant function. For this course, we will use the formula based method. The main arguments are formula, data, and family.
5. Formula Argument
The formula is at the heart of the modeling function, where the response or output is modeled as a function of the explanatory variables or the inputs. Each explanatory variable is specified and separated with a plus sign. Note that the formula needs to be enclosed in quotation marks. There are different ways we can represent explanatory variables in the model. Categorical variables are enclosed with capital C, removing the intercept is done with minus one, the interaction terms are written in two ways depending on the need, where the semicolon applies to only the interaction term, whereas the multiplication symbol will also, in addition to the interaction term, add individual variables. We will see how this works in chapter 4. Lastly, we can also add transformations of the variables directly in the formula.
6. Family Argument
Family distributions are in the families namespace. Here we list only 3, which we will use in this course. The default link function is denoted in parenthesis, but you could choose other link functions available for each distribution. However, if you choose to use a non-default link function, you would have to specify it directly.
7. Summarizing the model
To view the results of the model fit we use the summary method.
8. Summarizing the model
which provides the main information on model fit, such as the model description, model statistics such as log-likelihood and deviance, and estimated model parameters with their corresponding statistics. The estimated parameters are given by coef with their standard error, z scores, p-values and 95% confidence intervals.
9. Regression coefficients
To only view the regression coefficients we can use params given model fit. Similarly, the confidence intervals for the parameter estimates can be obtained by calling conf_int. The default is 95% which you can change using the alpha argument. With cols argument, you can specify which confidence intervals to return.
10. Predictions
When doing predictive modeling your final goal is to compute and assess predictions given the fitted model and test data. The first step is to specify the test data which should contain all the variables you have included in the final model. Note that if you don't specify test data the function uses data with which the model was fit. Final predictions are computed with predict.
11. Let's practice!
Now let's try these functions out!