Get startedGet started for free

Bootstrap confidence intervals

1. Bootstrap confidence intervals

2. EDA is the first step

You have now used graphical exploratory data analysis, or EDA, to investigate the active bouts of the zebrafish. I remind you of one of my favorite quotes from John Tukey. Exploratory data analysis can never be the whole story, but nothing else can serve as a foundation stone--as the first step. In this course, and throughout your data science endeavors in general, it is important to heed Tukey's advice and start with EDA. Now that we have done some EDA, let's start progressing toward the whole story.

3. Active bout length ECDFs

We saw in the previous exercises that the active bout lengths are roughly Exponentially distributed. The Exponential distribution has a single parameter that describes the characteristic time between arrivals of a Poisson process.

4. Optimal parameter value

The value of that parameter that best describes the data is computed from the mean of all of the active bout lengths. Thus, the mean computed from the data is the optimal parameter value.

5. Optimal parameter estimation

Let's look at how this is done with the nuclear incident data. We can use the `np.mean()` function to compute the mean of all inter-incident times, which is 87 days, indicated by the vertical gray line on the plot. But how confident are we in this value? What if we could somehow measure a collection of inter-incident times again? What would we get for the mean?

6. Bootstrap sample

We can simulate this by drawing a bootstrap sample. Specifically, we resample the data with replacement using the `np.random.choice()` function.

7. Bootstrap replicates

We can plot the ECDF of the resampled data, along with the mean inter-incident time computed from this resampled dataset. We get a slightly different value than we got from the original data.

8. Bootstrap replicates

We can do this procedure again.

9. Bootstrap replicates

and again.

10. Bootstrap replicates

and again and again and again and again.

11. Bootstrap replicates

Each value of the mean inter-incident time is a bootstrap replicate, which is generally a statistic computed from a resampled dataset. In this case, that statistic is the mean.

12. dcst.draw_bs_reps()

The `dc_stat_think` module has a function to draw bootstrap replicates from a dataset. For example, you can use it to draw ten thousand replicates of the mean from a dataset.

13. The bootstrap confidence interval

In looking at the plot of the replicates, shown by the vertical gray lines, we see that the replicates lie somewhere between about 70 and 100 days. This is roughly the bootstrap **confidence interval** of the mean inter-incident time.

14. The bootstrap confidence interval

Generally, a p-percent confidence interval can by defined as follows. If we repeated measurements over and over again, p% of the observed values would lie within the p% confidence interval. Because the bootstrap replicates are simulating measurements over and over again, we can simply take percentiles of the bootstrap replicates to compute the confidence interval. For the 95% confidence interval, we compute the 2.5th and 97.5th percentiles.

15. The bootstrap confidence interval

We can do that using NumPy's `percentile()` function. The first argument is an array containing the bootstrap replicates, and the second is a list or tuple with the desired percentiles. We get a 95% confidence interval that spans from 73 to 102 days.

16. Let's practice!

Now that you are refamiliarized with computing optimal parameters and obtaining bootstrap confidence intervals, you can quantify active bout lengths of wild type and mutant fish.