Get startedGet started for free

Simpson's Paradox

1. Simpson's Paradox

This chapter looked at the difference between models of a whole dataset and individual models for each category. For some datasets, this can lead to a nonintuitive result known as the Simpson's Paradox.

2. A most ingenious paradox!

Simpson's paradox is that the trend given by a model on the whole dataset is very different from the trends in subsets of the data. That's pretty abstract, so let's try an example.

3. Synthetic Simpson data

Here's a synthetic - that is, made up - dataset designed to demonstrate the paradox. Each row has an x and y coordinate, and the dataset is split into five groups, labeled A to E.

4. Linear regressions

Fitting a linear regression of y versus x to the whole dataset shows a positive slope of one point seven five. However, fitting a model that includes the group and an interaction shows something completely different. The bottom row of coefficients contains the slope for each group. Every group has a negative slope, apparently contradicting the fact that the whole dataset has a positive slope. Let's visualize the dataset to try and reconcile these opposite coefficients.

5. Plotting the whole dataset

This is the now-standard scatter plot with a linear regression trend line, drawn by regplot. As x increases, so does y, resulting in a positive slope over the whole dataset.

6. Plotting by group

Amending the plot to color the lines by group shows that within each group, y decreases as x increases.

7. Reconciling the difference

One moral of this story is that it's helpful to try and visualize your dataset. This is especially true if different models give conflicting results. Some common advice for how to choose which model is best is correct but annoying. It depends on the dataset and what question you are trying to answer. A useful corollary is that you should decide on a question to answer before you start fitting models.

8. Test score example

Thinking up examples where the grouped model is best is fairly easy. Here's the same synthetic dataset as before, with different axis labels. If x is the number of hours spent playing games each month, and y is the score on a test, modeling the whole dataset suggests that playing more games is related to a higher test score. If we reveal that each group represents the age of the child taking the test, it changes the interpretation. Now older children score more highly in the test, and playing lots of games is related to a lower score.

9. Infectious disease example

Coming up with examples where the model of the whole dataset is more useful than the model split by group is harder. One example is that for an infectious disease, the infection rate tends to be higher when the population density is higher. In this plot, each point represents a neighborhood in a city. Splitting by city reveals that the highest density areas of each city have lower infection rates. However, this may be due to other things that you haven't included in the model, like the wealth and demographics of the residents. That's an interesting insight, but "increasing population density is related to increased infection rate" is arguably more important.

10. Reconciling the difference

Unfortunately, resolving these model disagreements is messy. Often, the models including the groups will contain insight that you'd miss otherwise. The disagreements between the models may reveal that you need even more explanatory variables to understand why they are different. Finally, I'm going to repeat the correct but annoying advice: to choose the best model you need contextual information about what your dataset means and what question you are trying to answer.

11. Simpson's paradox in real datasets

Such a clear case of Simpson's paradox is very rare. Subtle differences between models are more common. A slope may go to zero instead of changing its direction. You may only see the effect in some groups, but not all of them.

12. Let's practice!

Time to play with the paradox.