1. Assumptions of multiple logistic regression
So far, you've learned about multiple logistic regression. Like any statistical method, these models have important assumptions, which you'll learn about during this section. You'll also learn how some of these assumptions apply to multiple Poisson regression.
2. Assumptions
Although this section focuses on logistic regression, these limitations also apply to Poisson and other GLMs as well.
First, Simpsons' paradox occurs when an important parameter is missed during model construction.
Second, GLMs assume that the predictor variables are linear and monotonic.
Third, both the predictors and response variables should be independent.
Fourth, overdispersion can occur where the data varies from the model too much.
3. Example Simpson's paradox
Graphically, Simpson's paradox may be easily shown using a linear example.
In this case, there are two groups: A and B. If you ignored the groups and fit a trendline, you would think that x causes y to decrease. However, this is not true.
If you fit a trendline with an intercept for both groups A and B, you can x causes y to increase.
4. Simpson's paradox
Simpson's paradox can also occur with GLMs. The main point of Simpson's paradox is that an important predictor was missed. Furthermore, the inclusion of this predictor would have changed the results of the model. Although this is easy to see with linear data, it is more difficult to plot with GLMs.
5. Simpson's paradox and admission data
However, it is important for GLMs and R comes with a build in dataset to demonstrate it. In the 1970s, the University of California Berkeley was concerned that they were rejecting graduate school candidates disproportionately based upon gender. However, admission varies greatly by department. In the exercises, you'll get to explore Simpson's paradox with this data. This assumption is difficult to test without understanding the system being modeled.
Especially, if you have not collected the missing variable.
6. Linear and monotonic
Another important assumption is that the data is both linear and monotonic.
Linear means the data increases like a straight line.
Monotonic means the data always increases or decreases.
The top plot is both linear and monotonic.
The middle plot is monotonic, but not linear.
The bottom plot is neither linear nor monotonic.
This assumption can often be checked with careful plots of data.
During the exercise, you'll get to preview how to fit non-linear models in R.
7. Independence
Another important assumption is that the predictors and response variables are independent.
If predictors are not independent, then reordering them can change the coefficient estimates.
If the response variables are not independent, you need to be concerned about the focus of your model.
For example, are you looking at individuals or groups?
For example, if modeling test score, are you look individual-level results, school-level results, or district or high-level results?
The DataCamp course on Hierarchical models covers this topic.
The assumptions of independence requires insight into the system being studied and careful examination of variables.
8. Overdispersion
Last, data can be over dispersed.
A binomial can have too many zeros or ones.
A Poisson can have too many zeros or too large of variance term.
Both GLMs can have a changing variance.
Violations of these assumptions can often be found by examining plots of the data.
However, addressing these violations are beyond the scope of this course.
9. Let's practice!
Now that you've learned about the assumptions of GLMs, you can now see for yourself in the exercises!