1. The Problem of Overdispersion
In the first video of this chapter, you learned about Poisson distribution and its parameter lambda, which represents both the mean and the variance. Using this assumption we have fitted Poisson regression models. In this video, you will learn the effects of violating this assumption and what measures you can take to remedy the problem.
2. Understanding the data
So, what if the variance is not equal to the mean? As we know counts can vary from 0 to infinity so they can exhibit variance that is not equal to the mean. Consider the crab data we used previously. Visually, we can notice that the mean will be much less than the variance. Using python mean and var function we can see that the variance is more than 3 times larger than the mean.
3. Mean not equal to variance
This effect is called overdispersion. Note that overdispersion is not an issue in linear models assuming normally distributed response variable since there is a separate parameter, which describes the variability in the data. As a comparison to overdispersion with the variance larger than the mean, there is also an effect called underdispersion, which occurs when the variance is less than the mean, but this is rare in actual data. Overdispersion usually results in incorrect small standard errors and p-values of the model coefficients. Hence, we have to be careful when interpreting model results.
4. How to check for overdispersion?
We can estimate for the presence of overdispersion using the Pearson statistic and the degrees of freedom of the fitted model. In the model summary, we are provided with the information on the Pearson chi-squared statistic and the degrees of freedom of the residuals.
5. Compute estimated overdispersion
To estimate for overdispersion we check the ratio of the Pearson statistic with the reported degrees of freedom of the residuals. We can obtain the value of the Pearson statistic with the Pearson underscore chi-squared function of the model fit, and similarly, the function df underscore resid provides the degrees of freedom of the residuals. The decision is made based on whether the ratio is greater than 1. Namely, if the ratio is close to 1 then it is assumed the data is drawn from the Poisson distribution, if it is smaller than 1 then it implies underdispersion and if greater than one it implied overdispersion. Note that this is an approximation and there is no fixed threshold for an affirmative statistical intervention.
6. Negative Binomial Regression
To account for overdispersion in the data we can fit a negative Binomial regression model, where the negative Binomial is a generalization of the Poisson distribution. The negative Binomial uses additional parameter alpha, the dispersion parameter, which specifies the level that the distribution's variance exceeds its mean. Notice that as alpha goes to zero the variance is equal to the mean.
7. GLM negative Binomial in Python
Using the already familiar glm function from the statsmodels library you can fit a negative Binomial regression by substituting the family from Poisson to negative Binomial. Now we have additional parameter alpha which we can specify. The default value is set at 1. As we noted in the previous slide if you specify a low value the effect of the dispersion is smaller so the estimates will still not be properly computed. Always check by computing the estimated overdispersion.
8. Let's practice!
Now let's use some exercises to see how this is applied in practical data problems.