1. Resampling as a special type of Monte Carlo simulation
Resampling can be thought of as a special type of Monte Carlo simulation.
2. Resampling as a special type of Monte Carlo simulation
In Monte Carlo simulations, we sample from probability distributions
which are either known or assumed.
Often we rely on historical data or subject matter knowledge to choose the proper distribution.
In resampling, we randomly sample from the existing data,
using the existing data implicitly as a probability distribution.
We assume the existing data is representative of the population of interest.
3. Resampling methods
In sampling without replacement, we draw a random sample. For the next sample, we draw from the remainder of the samples.
In sampling with replacement, or bootstrapping, we draw a random sample, put it back, and draw from the entirety of the samples again. This method allows the estimation of the sampling distribution of almost any statistic.
With permutation, we shuffle the order of the elements in the data to form a new sample. Permutation is often used to compare two groups.
4. Sampling without replacement
Let's look at sampling without replacement. We'd like to draw two different random New England states.
We define a function two_random_ne_states.
ne_states is a list containing the six states in New England.
We use random-dot-sample to sample two states without replacement.
Calling the function twice, we see two randomly generated lists containing state names.
5. Bootstrapping
Now, we'd like to estimate the 95% confidence interval for the mean height of NBA players.
We use the random-dot-choices function to sample with replacement from a list of player heights, setting k equal to 15 to draw 15 samples. After resampling 1,000 times, we calculate the upper and lower quartile boundaries
and print the mean as well as its 95% confidence interval.
6. Visualization of bootstrap results
Let's plot the results using Seaborn and Matplotlib, two plotting libraries we'll leverage throughout the course.
We use Seaborn's displot to plot the distribution of the simulated results. Then, we can plot three vertical lines using Matplotlib's axvline function. We plot two red lines to mark the 95% confidence interval boundaries and a green line to mark the mean height, which falls in the middle of the confidence interval.
7. Permutation
Our next example will estimate the 95% confidence interval of the mean difference between two lists of heights, NBA players and adult American males, using permutation.
We initially assume there is no difference in heights and begin by merging the us_heights and nba_heights lists into a single list.
Then we use random-dot-permutation to shuffle all the samples together to create perm_sample. To mimic the lengths of the original lists, we assign the first 15 samples as the new simulated nba heights and the next 20 as the simulated American male heights.
We subtract the means of the two lists perm_nba and perm_adult from each other, record the value, and repeat this process 1000 times.
8. Permutation results
The difference in the mean of the original lists of NBA and adult male heights is 18-point-32 centimeters.
After performing the permutations, the 95% confidence interval for the difference between two random lists resulting from the permutation is from negative 10-point-03 to positive 10-point-03. Because the difference in mean of 18-point-32 is outside the confidence interval, the difference does not appear to be a random result.
9. Visualizing permutation results
Checking the results graphically confirms this: the difference in mean, represented by the green line at the far right of the distribution, is outside the red confidence intervals.
NBA players are taller than the average American male!
10. Let's practice!
Now, let's practice some resampling!