Get startedGet started for free

Blocking and randomization

1. Blocking and randomization

Now that we can recognize confounding, how do we deal with it?

2. Making comparisons

When we talk about drawing conclusions from our data, we usually mean making comparisons of one sort or another. Are these means different or not? Are these proportions different? In order to make a valid comparison, we need to compare like with like: if we are comparing two sets of samples that differ in our variable of interest, we should make sure that our variable of interest is the only difference between those samples. This means we should remove other sources of variation from our studies, making the variation of interest more clearly visible.

3. Random sampling

We also should not cherry pick samples to assign to one treatment or another, as samples should be representative of the population. This applies to planning experiments and working with existing data. We can randomly sample from a DataFrame using pandas' dot sample function. First, we define a random seed, ensuring replicability. Then, we apply the sample function, providing n, a number of rows to sample, and our seed. This will create a random subset, which we can then use as needed, like here, where we pass the column of two DataFrames to a t-test.

4. Other sources of variation

While this is useful, we run into difficulty when confounding variables are present. Consider this example. We are comparing potato yields for two different potato varieties, Roosters and Records, that used two different fertilizers, A and B. As the varieties have different yields, randomly sampling could be misleading, as we could have different proportions of high and low-yielding potato varieties in the fertilizer A and B treatments and variety could be a confounder.

5. Blocking

To deal with this, we can use blocking. This is a strategy where we control for the effect of a confounding variable by ensuring that it is balanced with respect to the other variable. In this case, we deal with the effect of variety by ensuring that equal proportions of each variety are present in the samples treated with each fertilizer.

6. Implementing a blocked design

Implementing this is simple. We randomly sample, as before, but now we create two blocks, corresponding to the two varieties that we are blocking for. To do this, we use pandas' subsetting functions and pass our subset to the dot sample function. Then, we use pandas dot concat function to stitch our DataFrames together, creating a DataFrame of samples for the fertilizer A treatment containing equal numbers from both varieties.

7. Paired samples

A special case of this is when we have a substantial amount of variation present at the individual sample level and want to control for the individual variation. If our samples are linked in this manner, we can make use of a paired test, like a paired t-test. Let's compare potato yields in the same five fields across two years, before and after adding a new fertilizer. As each sample from one year corresponds to one sample from the other year, we can use a paired t-test. With a paired test, we control for variation between fields. By removing this source of noise, we increase statistical power and are better able to detect a difference.

8. Implementing a paired t-test

Implementing this paired t-test is simple and is very similar to the independent t-tests we saw previously. The main difference is that our two arrays need to have the same length. Having imported stats from scipy, we give two arrays to the ttest underscore rel function. The output from this function is an array of length 2, with the p-value at index 1.

9. Let's practice!

Enough chatting, now let's put it to practice!