Bootstrapping

1. Bootstrapping

With hypothesis testing,

2. Hypothesis testing

it is important to understand how samples from a null population vary. We repeatedly sample from a null population, which gives a sense for the variability of the statistic under the random chance model. Note that the statistic at hand is p-hat, which is the proportion of successes in the sample. The parameter, on the other hand, is the proportion of successes in the population.

3. Confidence intervals

In contrast, with confidence intervals, there is no null population. Instead, we need to understand how samples from the population of interest vary. We expect the sample statistic to vary around the parameter, but how far is the statistic from the parameter?

4. Bootstrapping

Bootstrapping is a method that allows us to estimate the distance from a statistic to the parameter.

5. Bootstrapping

Let's see how it works. Bootstrapping repeatedly samples

6. Bootstrapping

from the sample in order to estimate the variability of the statistic.

7. Bootstrapping

Each time we resample,

8. Bootstrapping

the data are sampled from the original data with replacement.

9. Bootstrapping

It turns out that the process of resampling from the original sample

10. Bootstrapping

is an excellent approximation for sampling from a population!

11. Bootstrapping

We call the bootstrapped statistic p-hat-star,

12. Bootstrapping

which is the proportion of successes in the resample.

13. Polling

The original sample showed 17 of 30 who plan to vote for candidate X, or almost 57%.

14. Polling

When we sample with replacement from the original sample, 47% of individuals will vote for candidate X.

15. Polling

The second resample has 60% of individuals voting for candidate X. Note that in this second resample, observation 24 is repeated three times and observation 21 shows up twice.

16. Polling

The third resampled proportion is 40%. By repeating the resampling process many times, the resampled proportions give a measure of how p-hat varies. Indeed, the standard error of p-hat-star is around (point) 09.

17. Standard error

The standard error, which describes how variable the statistic is around the parameter, is key to building a confidence interval. As we've already said, the bootstrap process is an excellent approximation for estimating how variable the statistic is. To demonstrate how well the bootstrap works, we set up a hypothetical situation where we actually know the number we need. The following R code calculates the standard error of the sample proportions in two ways.

18. Variability of p-hat from the population

First, we use the population information, then we use the bootstrap. To start, consider a totally unrealistic situation on voter preference, where we actually know the true population parameter is that 60% of people prefer candidate X. As a way of trying to measure the variability of p-hat we have set up a scenario, again, totally unrealistic, where we are taking many samples from the same population. For each of the samples, p-hat is calculated. The standard error is exactly the number we need to know and is calculated to be (point) 085. But it is unrealistic to think we could take many samples from the population. Typically, the researcher has exactly one sample from the population.

19. Variability of p-hat from the sample (bootstrapping)

It turns out that the variability of the statistic p-hat can be measured without taking repeated samples from the population. Instead, we take repeated samples from the original observed data. Here, we generate 1000 resamples from one pool of data. An important characteristic of resampling is that it must be done with replacement, otherwise every resample would be identical. Notice that the resampled p-hat-star values have a standard deviation of (point) 0869, which is very close to the number (point) 0852 from the previous, unrealistic slide.

20. Let's practice!

OK, now it's your turn to practice what you've learned.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.