1. Linear regressions and pairs bootstrap
Sometimes variables are related to each other and a linear relationship is appropriate for the data. In these cases, a linear regression is useful to quantify the relationship between two related variables.
2. Bacterial growth
To practice performing linear regressions, we will consider another dataset from the biological sciences at Caltech, this time from the lab of Michael Elowitz.
Here is a movie of two bacteria of the species *Bacillus subtilis* growing and dividing into a small colony. These bacteria were engineered to have fluorescent proteins in them, which is why they glow, enabling us to see them clearly.
3. Bacterial growth
If I plot the total area of bacteria in the image over time, we see this beautiful growth curve. This is clearly not a linear curve, though.
4. Bacterial growth
However, if I instead plot the logarithm of the bacterial area versus time, the curve is linear. This is accomplished using the `plt.semilogy()` function, which works just like `plt.plot()`, but with the y-axis on a logarithmic scale.
The slope of the growth curve on a semilog plot is the growth rate of the bacteria.
5. Linear regression with np.polyfit()
In the second Statistical Thinking course, you learned how to perform a linear regression using `np.polyfit()`. The first two arguments are the respective x and y values you want to fit with a line. The last argument is always one for a linear regression because a line is a polynomial of degree one. The function returns the slope and intercept of the best fit line.
You can then generate points to use to plot a theoretical line, and can finally put it all together on a plot.
6. Regression of bacterial growth
The result is--- Wait a minute. That doesn't look right. The problem is that we need to perform the regression using the *logarithm* of the bacterial area. Let's try that again.
7. Semilog-linear regression with np.polyfit()
We now use `np.log()` to pass the logarithm of the bacterial area into `np.polyfit()`. When computing the theoretical area, we exponentiate the theoretical curve using `np.exp()` to recover the area in square micrometers. Finally, we make our plots using `plt.semilogy()`.
8. Regression of bacterial growth
That is much better.
You will use this technique to get the growth rate for Bacillus under these conditions in the exercises.
Now, you might already be thinking of the next question. If we did this experiment again, how might the growth rate we obtain change? Or more specifically, what is the 95% confidence interval of the growth rate?
9. Pairs bootstrap
Pairs bootstrap is an approach to compute confidence intervals for regression parameters. Instead of resampling a single dataset, as we did before, we resample *pairs* of data. In this case, we take a time point and its corresponding bacterial area, store it, and then select another time point-bacterial area pair, and so on with replacement. We then compute the slope and intercept from the resampled data to get a pairs bootstrap replicate. We do this over and over again and then compute the confidence interval from percentiles of the replicates.
10. Pairs bootstrap
You wrote a function called `draw_bs_pairs_linreg()` to do this in Statistical Thinking Part II, and it is also implemented in the `dc_stat_think` module. Given x and y data, as well as the number of replicates you want via the `size` keyword argument, it returns pairs bootstrap replicates of the slope and intercept. You can then compute the confidence interval from the replicates using the `np.percentile()` function.
11. Let's practice!
Now it is your turn to quantify the growth rate of Bacillus subtilis.