1. Introduction to bootstrapping
So far, we've mostly focused on the idea of sampling without replacement.
2. With or without
Sampling without replacement is like dealing a pack of cards. When you deal the ace of spades to one player you can't then deal the ace of spades to another player.
Sampling with replacement is like rolling dice. If you roll a six, you can still get six on the next roll.
Sampling with replacement is sometimes called resampling. We'll use the terms interchangeably.
3. Simple random sampling without replacement
If you take a simple random sample without replacement, each row of the dataset, or each type of coffee, can only appear at most once in the sample.
4. Simple random sampling with replacement
If you sample with replacement, it means that each row of the dataset, or each coffee type, can appear multiple times in the sample.
5. Why sample with replacement?
We've been treating the coffee_ratings dataset as the population of coffees. Of course, it doesn't include every coffee in the world. We could treat the coffee dataset as just being a big sample of coffees.
To imagine what the whole population is like, we need to approximate the other coffees that aren't in our dataset.
Each of the coffees that we do have in the sample dataset will have properties that are representative of those other coffees that we don't have. Resampling lets us use the existing coffees to approximate the other theoretical coffees.
6. Coffee data preparation
To keep it simple, let's focus on three columns of the coffee dataset. To make it easier to see which rows ended up in the sample, we'll add a row ID column.
7. Resampling with slice_sample()
To sample with replacement, you call slice_sample as usual, but set the replace argument to TRUE.
Setting prop to 1 gives a sample with the same size as the original dataset.
8. Repeated coffees
Counting the row IDs shows how many times each coffee ended up in the resampled dataset. Some coffees are present five times in the new dataset.
9. Missing coffees
That means that some coffees didn't end up in the resampled dataset. By taking the number of distinct row IDs in the resampled dataset using dplyr's n_distinct, you can see that eight hundred and forty four different coffees were included, and four hundred and ninety four coffees weren't included.
10. Bootstrapping
We're going to use resampling for a technique called bootstrapping. In some sense, bootstrapping is the opposite of sampling from a population.
With sampling, you treat your dataset as the population, and move to a smaller sample.
With bootstrapping, you treat your dataset as a sample and use it to build up a theoretical population.
An important use case of bootstrapping is to try to understand variability due to sampling, which is important in cases where you aren't able to sample multiple times from a population to create a sampling distribution as you've seen.
11. Bootstrapping process
The bootstrapping process has three steps.
First, you do random sampling with replacement, to get a resample the same size as your original dataset.
Then you calculate a statistic, such as a mean of one of the columns. Note that the mean isn't always the choice here and bootstrapping allows for complex statistics to be computed too.
Then you replicate this many times to get lots of these statistics.
Earlier in the course, you did something similar. You took a simple random sample, then calculated a summary statistic, then repeated those two steps to form a sampling distribution.
This time, when you've used resampling instead of sampling, you get a bootstrap distribution.
12. Bootstrapping coffee mean flavor
The resampling step uses the code you just saw: calling slice_sample with prop set to one and replace set to TRUE.
Calculating a bootstrap statistic can be done with summarize. In this case, we're calculating the mean flavor score.
To repeat steps one and two, we can wrap the code in a call to replicate.
13. Bootstrap distribution histogram
Here's a histogram of the bootstrap distribution of the sample mean. Notice that it is close to following a normal distribution.
14. Let's practice!
Best get bootstrapping.