Bootstrapping
1. Bootstrapping
Now let's dive deeper into the first and probably most popular type of resampling method called bootstrapping. The name bootstrapping basically refers to the fact that we use the existing dataset to simulate multiple different datasets. Let's try to understand bootstrapping with an example.2. Easter eggs
Suppose you've received a large shipment of Easter eggs and are interested in determining the average weight of each egg for quality control. You have access to a small sample of 10 eggs. You weigh these eggs and find 4 that weigh 20g, 3 that weigh 70g and 3 others weighing 50g, 90g and 80g respectively.3. Easter eggs
From this sample, you can easily calculate the mean of 51, standard deviation of 27, standard error of 8.53 and then multiply this standard error by 1.96 to get the 95% confidence interval between 34.27 and 67.73. This gives us what we want, doesn't it? We went from a sample distribution to a population distribution. However, there are hidden assumptions in this calculation. First of all, we assumed that the distribution of weights is normal. In addition, we assumed that the confidence interval was symmetric. Both of these might not be reasonable assumptions. So what do we do?4. Bootstrapping Easter eggs
One approach is to take a bootstrapped sample by sampling with replacement from the original sample. In our case, this means that each of the 10 eggs have an equal probability of being picked. And since it with replacement, each egg has an equal probability of being picked subsequently as well. Here are some bootstrapped samples drawn from the original sample. Notice how some egg weights appear more often that they do in the original sample. After drawing multiple bootstrap samples, we can calculate the mean weight for each of these samples.5. Bootstrapped distribution
Using 5000 iterations, I get a mean weight of 50.8g with a 95% confidence interval between 35 and 67.03. Notice that the CI is not symmetric. Although this result isn't hugely different from the original calculation, it does serve to illustrate the power of the bootstrap. One thing to keep in mind is that the reliability of the bootstrap is dependent on the original sample being a reasonable representation of the population.6. Bootstrap - Good to know
As a rule of thumb, be sure to run at least 5-10K iterations with the number of observations at least equal to the number of observations in the original sample. Another thing to keep in mind is that bootstrapping is a random simulation. This means that the answer will be an approximation and will vary slightly every time to run the simulation. One word of caution is that some bootstrapped statistics, especially those concerning dispersion of the data like standard deviation tend to be inherently biased. But there are procedures like balanced bootstrap that help correct this bias. I encourage you to look these up as you get more familiar with bootstrapping.7. Let's practice!
Now let's work through some examples together.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.