Get startedGet started for free

Comparing sampling and bootstrap distributions

1. Comparing sampling and bootstrap distributions

In the last video,

2. Coffee focused subset

we took a focused subset of the coffee dataset. Here's a five hundred row sample from it.

3. The bootstrap of mean coffee flavors

Here, we generate a bootstrap distribution of the mean coffee flavor scores from that sample. slice_sample generates a resample, summarize calculates the statistic, and replicate repeats these steps to give a distribution of statistics.

4. Mean flavor bootstrap distribution

Here's the histogram of the bootstrap distribution, which is close to a normal distribution.

5. Sample, bootstrap distribution, population means

Here's the mean flavor score from the original sample. In the bootstrap distribution, each value is an estimate of the mean flavor score. Recall that each of these values corresponds to one potential sample mean from the theoretical population. If we take the mean of those means, we get our best guess of the population mean. The two values are really close. However, there's a problem. The true population mean is actually slightly different.

6. Interpreting the means

The behavior that you just saw is typical. The bootstrap distribution mean is usually almost identical to the original sample mean. However, that is often not a good thing. If the original sample wasn't closely representative of the population, then the bootstrap distribution mean won't be a good estimate of the population mean. Bootstrapping cannot correct any potential biases due to differences between your sample and the population.

7. Sample sd vs bootstrap distribution sd

While we do have that limitation in estimating the population mean, one great thing about distributions is that we can also quantify variation. The sample standard deviation is used to estimate the population standard deviation, 0.3525. The standard deviation of the bootstrap means estimates the standard error of the sample mean, resulting in a different number. So what's going on?

8. Sample, bootstrap dist'n, pop'n standard deviations

Remember that one goal of bootstrapping is to quantify what variability you might expect in your sample statistic as you go from one sample to another. Recall this quantity is called the standard error of that statistic. The standard deviation of the bootstrap means can be used as a way to estimate this measure of uncertainty. If you multiply that standard error by the square root of the sample size, you get an estimate of the standard deviation in the original population. Our estimate of standard deviation is 0.3515. The true standard deviation is 0.3414, so our estimate is pretty close. In fact, it's much closer than the standard deviation from the original sample.

9. Interpreting the standard errors

To recap, the estimated standard error is the standard deviation of the bootstrap distribution values for your statistic of interest. This estimated standard error from the bootstrap distribution times the square root of the sample size gives a really good estimate of the standard deviation on the population. That is, although bootstrapping can sometimes be poor at estimating the population mean if the sample is biased, it is, in general, great for estimating the population standard deviation.

10. Let's practice!

Let's play with some standard errors.