1. Type I errors
In this chapter, we will learn about how much we should trust our test results.
2. Ways of being wrong
When we run a statistical test, we can either reject or fail to reject the null hypothesis, depending on whether or not we find an effect. Finding an effect can be thought of as positive while not finding an effect can be thought of as negative, and our conclusion will either be correct or incorrect.
If we are incorrect, we can have either a false positive or a false negative. A false positive, or finding a difference where none exists, is a type I error. A false negative, or failing to find a difference that is there, is a type II error.
3. Avoiding type I errors
Of course, we want to do our best to avoid errors. When we talk about avoiding type I errors, we need to keep in mind that statistical tests are probabilistic. They quantify the likelihood that we would see a set of results under the null hypothesis. When this probability is sufficiently low, meaning that the p-value is small, we reject the null hypothesis. However, there is always a non-zero chance that our results came about by chance.
4. Picking a single result can be misleading
Let's consider flipping a coin and obtaining 18 heads and 2 tails. Sounds significant, right? But, this assumes that we are not conducting many repeated tests. If that result was one of many coins that we were flipping and we cherry-picked it, it's much less impressive.
5. Accounting for multiple tests
We can account for multiple tests in a few ways. The first is in the experimental design. We should avoid p-value fishing, or running many tests without justification for each test.
When running multiple hypothesis tests is justified, we can correct our p-values to account for the distorting effect of multiple tests. These correction methods include the Bonferroni and Šídák corrections. The correction method is chosen based on whether or not each test is independent from the others, and not on the type of statistical test used.
6. Bonferroni correction
The Bonferroni correction is a conservative and simple method that is appropriate when tests are not independent from each other.
After performing tests, we extract the p-values in an array-like format, then pass them to the multipletests function, indicating the desired p-value and specifying the method as b.
7. Bonferroni correction example
Let's consider an example from the Olympic dataset, where we are performing multiple non-independent t-tests that are pairwise comparisons of the heights of athletes from three disciplines.
8. Insert title here...
We start by importing scipy dot stats and statsmodels. Then, we perform the tests, extract the p-values from index 1, and pass this to the multipletests function, along with our alpha value and the method b. The output tells us whether or not to reject the null hypothesis and yields adjusted p-values.
9. Šídák correction
An alternative method is the Šídák correction, which is slightly less conservative and can be used when tests are independent from each other. The implementation of the correction using the multipletests functions is identical, except that we specify the method as s.
10. Šídák correction example
Let's look at an example using the Olympic dataset of independent comparisons of heights of athletes between events.
11. Insert title here...
As before, we first import scipy dot stats and statsmodels. Then, we perform the tests, noting that each test is independent. We extract the p-values and pass them to multipletests, along with an alpha value and method s. The output is similar to the previous example and displays the adjusted p-values.
12. Let's practice!
Now, you try it out.