1. Randomized distributions
The idea behind statistical inference is to understand samples from a hypothetical population where the null hypothesis is true. For example, from East and West Coasts, where cola preference is the same.
2. Logic of inference
As a way of summarizing each of the null samples,
3. Logic of inference
we calculate one statistic from each sample.
4. Logic of inference
Here, the statistic is the difference in
5. Logic of inference
the proportion of West Coast people who prefer cola as compared with the proportion of East Coast people who prefer cola,
6. Logic of inference
where each of the sample proportions is denoted "p-hat". The difference in p-hats changes with each sample. First it is 0,
7. Logic of inference
then it is negative one third, and it will keep changing.
8. Understanding the null distribution
We can build a distribution of differences in proportions assuming the null hypothesis, that there is no link between location and soda preference, is true. That is, the null samples consist of randomly shuffled soda variables so that the samples don't have any dependency between location and soda preference.
9. Understanding the null distribution
The original sample proportions are p-ha East of (point) 82 and p-hat West of (point) 73. A difference of negative (point) 09.
10. Understanding the null distribution
The first shuffle of the drink variable gives the exact same summaries as the observed data!
11. Understanding the null distribution
The second shuffle, on the other hand, gives 27 people on the East Coast who prefer cola as compared with 20 on the West Coast who prefer cola. The difference in sample proportions for the second shuffle of the data is negative (point) 02, which is less extreme than the original data. Note that both the original data, the red line, and the first two shuffled differences in proportions, black dots, can be plotted together.
12. Understanding the null distribution
The next few shuffles give differences in proportions
13. Understanding the null distribution
centered around zero.
14. Understanding the null distribution
Note that the 5th difference is negative (point) 16, which is farther from zero than the original data.
15. Understanding the null distribution
That is, the fifth shuffle gives more evidence of a difference in soda preference than the original data does. And we know that the fifth shuffle was created by randomly permuting the labels, so a difference of negative (point) 16 is plausible under the null hypothesis!
16. Understanding the null distribution
Generally, the null differences are between
17. Understanding the null distribution
negative (point) 2
18. Understanding the null distribution
and positive 0 (point) 2,
19. Understanding the null distribution
and about one third of the differences
20. Understanding the null distribution
are as or more extreme than the observed difference of negative (point) 09.
21. Understanding the null distribution
Now that we have seen a visual representation of the null distribution, let's see how a null sample can be generated in R.
22. One random permutation
Using the mutate and sample functions, the vector of soda preference is mixed up, or permuted, such that whether someone is on the East or West Coast can't possibly be causing any difference in proportions. However, due to inherent natural variability, there is also no expectation that the soda preferences are exactly the same for any sample.
After grouping by the location variable, summarize calculates the proportion of each coast that prefers cola.
Note that drink equals "cola" produces a vector of TRUEs and FALSEs, which R then coerces to ones and zeros when the mean function is applied. Since a one represents an individual who prefers cola, the average of these ones and zeros represents the proportion of individuals who prefer cola.
summarize is used a second time to find the difference in proportion of cola preference across the two coastal groups. The diff function is applied across the two coastal groups because the data have been summarized by location.
Notice that the output gives a permuted difference of negative (point) 02 as compared to the observed difference of negative (point) 09. However, the permuted difference of negative (point) 02 represents only one instance of the variability of soda preference under the null model. To get a sense of the degree of variability under the null model, it is necessary to permute the drink variable many times.
23. Many random permutations
By repeating the permuting and difference calculations five times, the permuted differences are seen to be sometimes positive, sometimes negative, sometimes close to zero, sometimes far from zero. However, five times isn't quite enough to capture all of the variability in the null differences.
24. Random distribution
By repeating the permutation process 100 times, the null differences are seen to range from approximately negative (point) 3 to positive (point) 3 although the majority of the differences are between negative (point) 1 and positive (point) 1. The observed data difference of negative (point) 09 doesn't seem too extreme compared to this collection of null differences.
25. Let's practice!
OK, now it's your turn to practice what you've learned.