1. ANOVA tests
You've seen how to compare two groups in the unpaired and paired cases. What if there are more than two groups?
2. Job satisfaction: 5 categories
The Stack Overflow survey includes a job satisfaction variable, with five categories from "Very dissatisfied" to "Very satisfied".
3. Visualizing multiple distributions
Suppose we want to know if mean annual compensation is different for each of the levels of job satisfaction.
The first thing to do is visualize the distributions with box plots. I've used coord_flip to swap the x and y axes, making the category labels easier to read.
"Very satisfied" looks slightly higher than the others, but to see if they are significantly different, we'll need to use hypothesis tests.
4. Analysis of variance (ANOVA)
ANOVA tests determine whether there are differences between the groups. First, you fit a linear regression. You call lm, specifying the numeric variable as the response on the left-hand-side of the formula, and the categories as the explanatory variable on the right-hand side.
Then you call anova to perform an analysis of variance test. In the job_sat row, the right-hand column contains a p-value, which is point-zero-zero five. The two stars next to it tell you that this p-value is significant at the point-zero-one level.
That means that at least two of the categories of job satisfaction have significant differences between their compensation levels.
The problem is that this method doesn't tell you which two categories they are. For this reason, ANOVA is less useful than pairwise t-tests.
5. Pairwise tests
To compare all five categories of job satisfaction using hypothesis tests, we can test each pair. There are ten ways of choosing two items from a set of five, so we have ten tests to perform.
We'll set the significance level to point-two.
6. pairwise.t.test()
To run all these hypothesis tests in one go, you can use pairwise-dot-t-dot-test. The first argument is the numeric variable whose sample means you are interested in. The second argument is the categorical variable defining the groups. We'll discuss p-dot-adjust-dot-method shortly.
The result shows a matrix of ten p-values.
Three of these are less than our significance level of point-two.
7. As the no. of groups increases...
In this case we have five groups, resulting in ten pairs. As the number of groups increases, the number of pairs - and hence the number of hypothesis tests - increases quadratically.
The more tests you run, the higher the chance that at least one of them will give a false positive significant result.
With a significance level of point-two, if you run one test, the chance of a false positive result is point-two. With five groups and ten tests, the probability of at least one false positive is around point-seven. With twenty groups, it's almost guaranteed that you'll get at least one false positive.
8. Bonferroni correction
The solution to this is to apply an adjustment to increase the p-values, reducing the chance of getting a false positive. One common adjustment is the Bonferroni correction.
Now only two of the pairs appear to have significant differences.
9. More methods
R provides several methods for adjusting the p-values. You can list their names with p-dot-adjust-dot-methods.
Holm adjustment is the default. It's less strict than Bonferroni, but works well in most situations.
10. Bonferroni and Holm adjustments
If you have ten tests, Bonferroni just multiplies the p-values by ten up to a maximum of one.
Holm will multiply the smallest by ten, then the second smallest by nine, and so on. There's a possible extra correction to make sure that the order of p-values from smallest to largest is preserved, but that's the gist of it.
11. Let's practice!
Let's run lots of tests.