1. Comparing sampling and bootstrap distributions
In the last video,
2. Coffee focused subset
we took a focused subset of the coffee dataset. Here's a five hundred row sample from it.
3. The bootstrap of mean coffee flavors
Here, we generate a bootstrap distribution of the mean coffee flavor scores from that sample.
dot-sample generates a resample, np-dot-mean calculates the statistic, and the for loop with dot-append repeats these steps to produce a distribution of bootstrap statistics.
4. Mean flavor bootstrap distribution
Here's the histogram of the bootstrap distribution, which is close to a normal distribution.
5. Sample, bootstrap distribution, population means
Here's the mean flavor score from the original sample.
In the bootstrap distribution, each value is an estimate of the mean flavor score. Recall that each of these values corresponds to one potential sample mean from the theoretical population.
If we take the mean of those means, we get our best guess of the population mean. The two values are really close.
However, there's a problem. The true population mean is actually a little different.
6. Interpreting the means
The behavior that you just saw is typical. The bootstrap distribution mean is usually almost identical to the original sample mean.
However, that is not often a good thing. If the original sample wasn't closely representative of the population, then the bootstrap distribution mean won't be a good estimate of the population mean.
Bootstrapping cannot correct any potential biases due to differences between the sample and the population.
7. Sample sd vs. bootstrap distribution sd
While we do have that limitation in estimating the population mean, one great thing about distributions is that we can also quantify variation.
The standard deviation of the sample flavors is around zero-point-three-five-four. Recall that pandas dot-std calculates a sample standard deviation by default.
If we calculate the standard deviation of the bootstrap distribution, specifying a ddof of one, then we get a completely different number. So what's going on here?
8. Sample, bootstrap dist'n, pop'n standard deviations
Remember that one goal of bootstrapping is to quantify what variability we might expect in our sample statistic as we go from one sample to another. Recall that this quantity is called the standard error as measured by the standard deviation of the sampling distribution of that statistic. The standard deviation of the bootstrap means can be used as a way to estimate this measure of uncertainty.
If we multiply that standard error by the square root of the sample size, we get an estimate of the standard deviation in the original population. Our estimate of the standard deviation is around point-three-five-three.
The true standard deviation is around point-three-four-one, so our estimate is pretty close. In fact, it is closer than just using the sample standard deviation alone.
9. Interpreting the standard errors
To recap, the estimated standard error is the standard deviation of the bootstrap distribution values for our statistic of interest.
This estimated standard error times the square root of the sample size gives a really good estimate of the standard deviation of the population.
That is, although bootstrapping was poor at estimating the population mean, it is generally great for estimating the population standard deviation.
10. Let's practice!
Let's play with some standard errors.