Multivariable logistic regression
1. Multivariable logistic regression
Hello and welcome to the final chapter on GLM models. In previous chapters, you learned about logistic and Poisson regression models, where we used the univariable setting to develop theoretical and practical concepts. While this suffices for those goals, in practice we are often presented with a number of potential explanatory variables and have to choose the ones which will provide the best model fit. Enter multivariable logistic regression.2. Multivariable setting
Recall a GLM form with one explanatory variable x13. Multivariable setting
for which we estimated the beta1 coefficient along with the coefficient beta0 for the intercept.4. Multivariable setting
Now consider you have additional variables x2 to xp. Extending the initial model to p number of variables would be given as follows, adding the variables to our model formula.5. Multivariable setting
As in the univariable setting, each new variable has an associated estimated parameter so we extend the betas to corresponding p terms as well. As before, the betas provide information on the effect of x on the log odds that y equals 1, while controlling for other xs. Similarly, exponential of beta is the multiplicative effect on the odds of a 1-unit increase in x assuming fixed levels of other xs. In python using the glm function we would add the new variables in additive order. Data and family arguments remain the same.6. Example - well switching
Let's revisit the well-switching example where now we model the probability of the well switch with both distance and arsenic level as explanatory variables. The model summary output provides information for all model variables.7. Example - well switching
We can see that both coefficients are statistically significant with pvalues less than 5 percent. Furthermore, the signs of the coefficients are logical where switching the well is less likely if it is far. Similarly, if the current well is high in arsenic the household should be more inclined to switch. The coefficient values tell us that a one-unit change in distance to the nearest safe well corresponds to a negative difference of 0.89 in the logit. Similarly, a one-unit change in arsenic levels corresponds to a positive difference of 0.46 in the logit.8. Impact of adding a variable
So what is the impact on coefficients of adding another variable to the original model of only distance as an explanatory variable. Comparing the coefficient value of the distance variable we notice a change from negative 0.62 to negative 0.89. This is attributed to the fact that the further the current well is from the nearest safe well the level of arsenic in the current well is likely to be high.9. Multicollinearity
Another important concept is multicollinearity which occurs when variables are correlated with each other. Visually we can see that the higher the correlation the more structure is present. Including highly correlated variables leads to inflation of standard errors which can result in coefficient not being statistically significant.10. Presence of multicollinearity?
We can check for multicollinearity by analyzing the coefficient p-value and its standard errors, whether adding or removing the variable significantly changes the coefficients, the logic of coefficient sign and whether there is a significant correlation between the model variables.11. Variance inflation factor (VIF)
The most widely used diagnostic for multicollinearity is the variance inflation factor of each explanatory variable. It describes how inflated the variance of the coefficient is compared to what it would be if the variables were not correlated with any other variable in the model. A general threshold can be set at a value of 2.5. In Python, we can compute VIF directly using the variance inflation factor function from statsmodels library as you will see in the following exercises.12. Let's practice!
In the following exercises, you will test for multicollinearity in the multivariable setting.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.