1. Model Uncertainty and Sample Distributions
Previously, to estimate a model parameter, we assumed a shape of the parameter distribution. Least-squares assumes a gaussian; maximum likelihood estimation requires us to chose a shape, so we chose gaussian.
But there are situations where the distribution shape is unknown.
2. Population Unavailable
Recall the daily high temperature data seen earlier. Usually we only have a single sample, and no knowledge of the SHAPE of the population.
3. Sample as Population Model
Recall how the shape of the sampled temperature data resembled the population shape?
What if we used the sample as the model of the population?
4. Sample Statistic
If we compute the mean of the single sample, it gives us a guess, but no knowledge of the uncertainty in this guess.
What we really want is a prediction for what happens when we take the NEXT sample of the population.
5. Bootstrap Resampling
What if we RESAMPLE the SAMPLE, many times, to SIMULATE taking samples from the population.
This is called boot-strap resampling.
Here we see 3 resamples. Each time we resample, we compute a statistic, the mean.
6. Resample Distribution
"Sampling-the-sample", results in a distribution of sample statistic values. In this example, a distribution of mean temperatures.
We can use this to predict that a future sample mean will be 100 degrees, with a "probability" of about 95% the mean will occur between 92 and 107, since 95% is the FRACTION of the area or counts between those values.
7. Bootstrap in Code
Let's see it in code.
Assume we have only a sample of daily high temperatures. We want to model the August daily highs for a decade, or predict the next sample draw from that population.
First, we assign the sample as the model for the unmeasured population.
Second, we resample the sample, computing a mean temperature of each resample. The result is many mean values.
This is called a "bootstrap sample distribution".
Note the use of replace equals True. More on that soon.
Lastly, we compute the mean of the means. This is our best estimate of the daily high temperature in any August. And the standard deviation of the means is the "standard error" or uncertainty.
8. Replacement
When resampling the sample, what is "replacement"?
Imagine you want to write a song using a 7-tone scale, but you lack inspiration. However, you do have 7 marbles and a marker. You "label" each marble with a note -- A, B, C, D, E, F, G -- and then you put them in a bag.
You then randomly "draw" 4 marbles, one at a time, writing down each note as you go.
That is a single sample, composed of 4 draws. Like a single song of 4 notes.
But what you do, in between draws, makes all the difference in the kinds of samples (or songs) that are possible.
If, between each draw, you put the marble back in the bag, that's sampling WITH REPLACEMENT. Every draw will be from a bag of 7 marbles.
9. Replacement
If you do NOT put the marble back in before making your next draw, that's sampling WITHOUT replacement. Each successive draw will be from a bag of 6 marbles, then 5, then 4.
Sampling WITHOUT replacement, presents two problems:
(1) You never get repeated notes.
(2) Your sampling method is changing the "model" after every draw.
If you want to estimate the shape of a single population model use resampling, you must hold the population model constant, meaning you have to put the marble back in between every draw!
10. Let's practice!
Now it's your turn to apply bootstrap resampling to distance data to estimate the distribution of slopes, or "speeds", you might get from future measurements.