1. Experimental data setup
We've seen that randomization is often the best technique for setting up experimental data, but it isn't always.
2. The problem with randomization
There are several scenarios where pure randomization can lead to undesirable outcomes.
Firstly, when it results in uneven numbers of subjects in different groups, often seen more in smaller experiment sizes.
3. The problem with randomization
Covariates are variables that potentially affect experiment results but aren't the primary focus. If covariates are highly variable or not equally distributed among groups, randomization might not produce balanced groups. This imbalance can lead to biased results.
Overall these make it harder to see an effect from a treatment, as these issues may be driving an observed change.
4. Block randomization
A solution to our uneven problem is block randomization. This involves splitting into a block of size n first, then randomly splitting.
This is what it looks like. Subjects are split into two groups, then randomly assigned to be Treatment (orange) or control (white).
This fixes the uneven issue, and the smaller blocks give us more control over the allocation.
5. Our dataset
Let's give block randomization a go on a dataset of 1000 members from an e-commerce site that contains variables for
their average basket size in dollars, the average time spent on the website each day, and whether they are a power user. Power users spend an average of 40+ minutes on the website each day. There are 100 power users in these 1000 subjects.
6. Block randomization in Python
We can use pandas' sample method to randomly assign subjects into two blocks. A block column has also been added to both DataFrames for convenience.
This produces even block sizes, fixing the uneven issue, but let's check for covariates.
7. Visualizing splits
A nice way of checking for potential covariate issues is with visualizations. We can use seaborn's displot function to produce a kde (or kernal density) plot to visualize the distribution of the basket size, split by whether the user is a power user.
There is quite a difference in the group distributions.
It seems like the power_user variable could have an effect on basket size. When an effect could be because of a variable rather than the treatment, this is often called confounding.
The covariate issue can be solved with stratified randomization.
8. Stratified randomization
Stratified randomization involves splitting based on a potentially confounding variable first, followed by randomization.
This is what it may look like.
Firstly, we split into two blocks (sometimes called strata) of power users, in green, and non-power users, in yellow.
Then, inside the groups, randomly allocating to treatment or control.
This fixes the uneven covariate issue, and can even be done for multiple covariates, but managing more strata does increase complexity.
9. Our first strata
Let's stratify our power users. We separate them out first and label the block.
We then sample half the power users to be in Treatment. The T_C column notes this status.
We then place the remaining into control by dropping the subjects in the treatment group.
10. The second strata
For our other strata, we separate out non-power users first and label the block differently.
The rest of the code is the same as before. We allocate half to treatment and control using the same column headers.
11. Confirming stratification
Let's bring our work together by firstly concatenating the strata and groups.
We can confirm our work using groupby and chaining the .size() method. This will show the number of power users in each block by their treatment or control status.
We can see two blocks: one with all 100 power users and another with the other 900 users, split evenly into treatment and control groups.
12. Let's practice!
Let's practice using and assessing block and stratified randomization.