Get startedGet started for free

Multiple comparisons tests

1. Multiple comparisons tests

In this video we will learn about multiple comparisons tests.

2. Introduction to the multiple comparisons problem

So far we talked about setting up parameters for testing one metric, a single comparison between a control 'A' and a treatment 'B', and no subcategories. But what happens in a multiple comparisons scenario where we want to compare more than two versions, analyze differences in more than one metric, or dive deeper into various subcategories in our data? Now we will learn how to adjust our test setup to accommodate such scenarios and understand why we need to do so.

3. Family-wise error rate

We know that when we set the significance level at 5%, we effectively accept that we would mistakenly reject a true null hypothesis 5 out of 100 times. If the probability of making a type I error is alpha, then the probability of not making an error is one minus alpha. And for 'm' independent tests this joint probability is the product of the individual probabilities. And since the probability of an event occurring at least once is one minus the probability of the event never occurring, then the probability of making at least one type I error in m tests, is one minus the probability of not making that error in any tests. This probability is called Family-wise error rate, and for a single test, it's the same as alpha. But what if we perform more than one test?

4. Family-wise error rate

The following script plots this rate as a function of the number of tests and alpha. We first do our imports, define our parameters, assign 'x' equally spaced points from 0 to 20 with 1 point increments representing the number of tests, and 'y' as our family-wise error rate. Looking at the plot, we can see the inflation in the false positive rate. What we set at 5% initially turns into 40% if we run 10 tests, make 10 comparisons, or analyze 10 metrics simultaneously.

5. Correction methods

So how can we ensure this error rate is still reasonably controlled for multiple tests? One simple, yet conservative, method is the Bonferroni correction. This technique sets alpha for the entire set of comparisons equal to the division of the individual test alpha by the number of tests performed. A less stringent approach is the Sidak correction, obtained by equating the family-wise error rate with the desired alpha and solving for alpha Sidak.

6. Bonferroni correction example

As an example, we ran three different tests to compare the average sales per user between a control version of a check out page and three other versions. These are the p-values calculated. Without correcting for multiple comparisons, although all three tests would be considered significant, the probability of making at least one false positive becomes inflated at 14%. If we use the Bonferroni correction, the comparison A versus D, which has a p-value higher than the adjusted significance level, would no longer be significant. However, we can rest assured that our family-wise error rate remains at 5%.

7. statsmodels multipletests method

Let's demonstrate this using Python. From statsmodels we import the multipletest function, which corrects the p-values based on the number of tests and chosen correction method. We then provide it with the list of p-values, specify our error rate, and select the desired correction method. We can see how the boolean list of rejected hypotheses returned in index 0 did not reject the hypothesis for the first p-value, which is lower than the specified alpha but still higher than the Bonferroni corrected alpha, which the function returns in index 3.

8. Let's practice!

Let's practice more in the following set of exercises.