1. Generating bootstrap replicates
In the prequel to this course we computed summary statistics of measurements, including the mean, median, and standard deviation. But remember, we need to think probabilistically. What if we acquired the data again? Would we get the same mean? The same median? The same standard deviation? Probably not. In inference problems, it is rare that we are interested in the result from a single experiment or data acquisition. We want to say something more general.
2. Michelson's speed of light measurements
Michelson was not interested in what the measured speed of light was in the specific 100 measurements conducted in the summer of 1879. He wanted to know what the speed of light actually is. Statistically speaking, that means he wanted to know what speed of light he would observe if he did the experiment over and over again an infinite number of times. Unfortunately, actually repeating the experiment lots and lots of times is just not possible. But, as hackers, we can simulate getting the data again.
3. Resampling an array
The idea is that we resample the data we have and recompute the summary statistic of interest, say the mean. To resample an array of measurements, we randomly
4. Resampling an array
select one entry and
5. Resampling an array
store it. Importantly, we
6. Resampling an array
replace the entry in the original array, or equivalently, we just don't delete it. This is called sampling with replacement. Then, we then randomly
7. Resampling an array
select another
8. Resampling an array
one and store it. We do this n times,
9. Resampling an array
where n is the total number of measurements, five in this case. We then have a resampled array of data. Using this new resampled array, we compute the summary statistic and store the result. Resampling the speed of light data is as if we repeated Michelson's set of measurements.
10. Mean of resampled Michelson measurements
We do this over and over again to get a large number of summary statistics from resampled data sets. We can use these results to plot an ECDF, for example, to get a picture of the probability distribution describing the summary statistic. This process is an example of
11. Bootstrapping
bootstrapping, which more generally is the use of resampled data to perform statistical inference. To make sure we have our terminology down, each resampled array is called
12. Bootstrap sample
a bootstrap sample. A bootstrap replicate
13. Bootstrap replicate
is the value of the summary statistic computed from the bootstrap sample. The name makes sense; it's a simulated replica of the original data acquired by bootstrapping. Let's look at how we can generate a bootstrap sample and compute a bootstrap replicate from it using Python. We will use Michelson's measurements of the speed of light.
14. Resampling engine: np.random.choice()
First, we need a function to perform the resampling. The NumPy function random dot choice provides this functionality. Conveniently, like many of the other functions in the NumPy random module, it has a size keyword argument, which allows us to specify how many samples we want to take out of the array. Notice that it chose the number five three times; the function does not delete an entry when it samples it out of the array. Now, we can draw 100 samples out of the Michelson speed of light data.
15. Computing a bootstrap replicate
This is a bootstrap sample, since there were 100 data points and we are choosing 100 of them with replacement. Now that we have a bootstrap sample, we can compute a bootstrap replicate. We can pick whatever summary statistic we like. We'll compute the mean, median, and standard deviation. It's as simple as treating the bootstrap sample as though it were a data set.
16. Let's practice!
Now it's time for you to do some bootstrap sampling yourself!