1. Visualizing the bootstrap
In the previous lessons, we've focused on a specific type of uncertainty: the model-based confidence interval. Here, we'll look at another type of uncertainty, the bootstrap resampling method.
2. Interpretation of confidence interval problems (a)
Confidence intervals are an excellent and easy way of showing uncertainty for your estimates, but they simply present an interval where we would expect our estimate to fall under repeated resampling. This means that there is no difference between a point that falls at one end of the interval versus directly in the middle.
3. Interpretation of confidence interval problems (b)
A natural desire when looking at uncertainty is to get an idea of how much more or less supported a given value is within our interval compared to another. Luckily there is an easy way to get this using a technique called 'the bootstrap.'
4. The "Bootstrap" (a)
The bootstrap is a technique that provides uncertainty for an estimate by resampling the observed data many times.
Recall the definition of a confidence interval. 'If we sampled from our population many times we would expect our estimate to fall in this range.' Traditional confidence intervals use models to calculate this range from the observed sample.
5. The "Bootstrap" (b)
The bootstrap finds it by simulating the resamples by repeatedly sampling them from the observed sample.
6. The "Bootstrap" (c)
For each bootstrap resample, the estimate of interest is collected. At the end of this process, we have a collection of resampled estimates, which we can use to make uncertainty statements about our original estimate.
7. Histograms of bootstrap samples
Since the bootstrap gives us an array of estimates, we can use any of the distribution visualization plots we've already seen to show the results.
The most common form of visualization used for the bootstrap is the histogram.
The following code performs a bootstrap resampling using the function, bootstrap(), and plots the results using Seaborn's histplot() function.
For comparison's sake, a 95% confidence interval is overlaid on the background by finding the boundaries of the middle 95% of the bootstrap estimates.
8. Histograms of bootstrap samples (plot)
Histograms are preferred for bootstrap visualizations due to their ease of interpretation. Since you control the number of samples drawn, you can make it high enough that the issues with histograms, like not having enough data to use a large number of bins, are not a problem.
9. Bootstrapped regressions
Bootstrap resampling also provides a great way of looking at uncertainty in regression estimates by simply plotting all the regression lines fit for each resampling of your data.
Here we take advantage of Seaborn's built-in regression plotting functionality to draw a series of bootstrapped regressions of CO on O3 for Denver in August, by passing a DataFrame with each bootstrap sample concatenated together.
To get separate lines, we tell the plotting function to treat each resampled dataset as a separate class with its own regression line.
One thing to be careful when doing this is to turn off Seaborn's built-in confidence intervals as we're already providing uncertainty ourselves.
10. Bootstrapped regressions (plot)
The plot produced helps give the viewer a nice intuition about how variable the model fit is. For instance, if a few outliers are dragging the regression line up, these points may be left out of many of the resamples and the plotted lines will be noticeably different from the others, indicating that skepticism about the fit may be warranted.
11. Displaying lots of bootstraps using beeswarms
Bootstrap samples are not limited to a single estimate either. You can make comparisons against multiple distributions using beeswarm plots. This code takes bootstrap mean samples for multiple cities and concatenates them together for plotting with Seaborn's swarmplot() function.
The only special thing to note about the swarmplot() call itself is that we need to take care to set all the colors to be the same to avoid any potential color-size biases in our comparisons.
12. Displaying lots of bootstraps using beeswarms (plot)
The resulting plot is clear and intuitive. These are bootstrap samples of the NO2 averages for different cities in August. We can be highly confident that the values for Denver on average are much higher than those of the other cities investigated.
13. Let's get (re)sampling
We've just covered a large amount of uncertainty visualization techniques in a short time. Let's now make some bootstrap visualizations with our pollution data.