Get startedGet started for free

Multiple testing

1. Multiple testing

To finish up this chapter on statistical experiments and hypothesis testing, let's discuss a special case that come up quite often in practice: multiple comparisons.

2. Multiple comparisons problem

When you run a typical hypothesis test with the significance level set to point 05, there's a 5 percent chance that you'll make a type 1 error and detect an effect that doesn't exist. This is a risk that we are normally willing to take. The multiple comparisons problem arises when you run several sequential hypothesis tests. Some quick math explains this phenomenon quite easily. Since each test is independent, you can multiply the probability of each type I error to get our combined probability of an error. For instance, if we tested linkage of 20 different colors of jelly beans to acne with 5 percent significance, there's around a 65 percent chance of at least one error; in this case, it was the green jelly beans that were linked with acne.

3. Correcting for multiple comparisons

When you run multiple tests, the p-values have to be adjusted for the number of hypothesis tests you are running to control the type I error rate discussed earlier. There isn't a universally-accepted way to control for the problem of multiple testing, but there are a few common ones that we'll discuss.

4. Common approaches

Common approaches include the Bonferroni correction, Sidak correction, several step-based techniques, Tukey's procedure, and Dunnet's correction. In the interest of simplicity, we'll stick to Bonferroni correction, as it's unlikely that you'll have to dive into the details of any other approaches during an interview, but it's always good to know what's out there.

5. Bonferroni correction

The most conservative of corrections, the Bonferroni correction is also perhaps the most straightforward in its approach. As you can see, you simply divide your significance level, alpha, by the number of tests denoted here as n. For instance, if we had a significance level of point 05 and wanted to run 10 tests, our corrected p-value would come out to point 005 for each test.

6. Example

This can be accomplished pretty easily in python using the multiple-underscore-tests function. We pass a list of test result p-values along with the alpha and our method, and we get back a list test results, along with the specific corrected p-values in the second index if necessary.

7. Side effects

Of course, there are side effects to using a conservative adjustment like Bonferroni. With many tests, the corrected significance level will become very, very small. This reduces power, which means that you are increasingly unlikely to detect a true effect when it occurs.

8. Summary

To summarize, we covered the multiple comparisons problem, what it means to correct for it, and then outlined a common method for doing so: the Bonferroni correction.

9. Let's prepare for the interview!

Let's go ahead and practice with some exercises!