1. Combining evidence from p-values
Meta analysis is a popular statistical technique where results from many studies are combined. Key to this is understanding how to make inference based on multiple p-values coming from different samples, but testing the same hypothesis.
2. p-values
Suppose we're investigating the effect of lifting weights once a week on muscle mass gain. This is a common idea, and so it's likely many researchers have investigated it, all with the common null hypothesis that lifting weights has no impact on muscle mass, and the alternative hypothesis of increased muscle mass. Likely, each study also produced their own p-values.
3. Repeated experiments
It's reasonable to expect that different studies would get different p-values. But is it also reasonable to expect that different studies would come to completely different conclusions? In other words, what should we make of it if one study concluded lifting weights doesn't increase muscle mass, and another claimed it did?
4. Different effect sizes
Surprisingly enough, this is a reasonable outcome! The reason is related to samples. Recall that the outcome of any hypothesis test is completely dependent on the sample collected. Perhaps one study had a sample of people who just happened to respond poorly to lifting weights, and another did not. Another way to phrase this is that the effect size for one sample was small, while the effect size was large for another sample.
5. Testing p-values
What Fisher's method does is let us take many different studies, each studying the same null hypothesis, and test if at least one of the studies should have rejected the null hypothesis. It does so by looking at all of the p-values together as a single piece of evidence.
This differs from the studies themselves, in which the researchers considered only their own data. However, Fisher's method considers the results from all of these studies to see if there is broad evidence that at least one study should have rejected the null hypothesis.
6. Fisher's method in SciPy
To use Fisher's method in SciPy, we'll use the stats-dot-combine_pvalues function. It takes in a list of p-values and returns the test statistic and p-values from Fisher's method. Here we see a list of p-values from tests all testing the same null hypothesis. Individually, none of the tests had a p-value below the five percent level, but they're all extremely close to five percent. What inference would be valid here?
By using Fisher's method in this case we conclude that at least one of the studies should indeed have rejected the null hypothesis. Thus we can conclude that, while no test individually showed statistical significance, the combination of evidence from all of the tests suggest there is indeed a significant effect present.
7. Fisher's method in SciPy
Contrast that with the case where one study had a low p-value of zero-point-zero-one, and yet all others had significantly larger p-values. In this case, Fisher's method suggests that none of the studies should have rejected the null, and perhaps the one study which did is merely a fluke.
8. Let's practice!
Now that we've seen Fisher's method in action, let's jump into some data and practice!