1. Sanity checks: Internal validity
Let's talk about sanity checks for our internal test setup.
2. Sample Ratio Mismatch (SRM)
Starting with sample ratio mismatch. This happens when we detect deviations in the ratio of traffic allocated between variants compared to the test design. For example, if we design for a 50/50 allocation between the treatment and control groups, and we instead get a fifty-one forty-nine, this could be a sign of a bug in traffic allocation that needs to be examined further.
In order to statistically confirm that the observed ratio is not something that happened due to chance, we run a Chi-square goodness-of-fit test which follows the shown formula. The Null hypothesis states that the allocated traffic matches the experiment design which is 50/50 in this case. If the p-value is lower than a strict significance threshold of 1% for instance, then we reject the Null hypothesis and conclude that if we continue running the experiment, the results may be invalid due to a bug.
3. SRM python example
To demonstrate this in Python we use the AdSmart dataset and count the unique ids per experiment group using the nunique method. We calculate the allocation percentages per group and see if it matches the design allocation.
4. SRM python example
To examine the results statistically, we import the chisquare test from SciPy library and pass it the observed and expected allocation lists. Comparing the p-value to the significance threshold of 1% reveals that the allocation matches the design.
5. SRM root-causing
Finding the root-cause of the bug is more involved, but some of the common reasons are: due to assignment errors with incorrect randomization functions causing allocation imbalances, execution delays between variants starting time or ramping up of allocation percentages, data logging issues or bots filtering if the variants aren't affected evenly, or the interference of the experimenters themselves if they need to pause enrollment to fix other issues with one of the variants.
6. A/A tests
Another great way to check the validity of the internal testing setup is to run AA tests before the actual experiment. These tests are set up the same way as AB tests except that we present both groups of users with the same experience. In the event that we have any of the previously mentioned bugs, we can expect to see symptoms in the AA test results. The first item we can examine is differences between the metrics of interest across both groups. Since both groups are getting the same experience and our tools are assumed to perform proper randomization, any statistically significant differences in metrics can be attributed to a bug in the experiment setup. However, there is always a chance of having a false positive 5% of the time for alpha of 5%. Another thing we can confirm is even distributions of all attributes such as devices and browsers across groups. If for instance people using smartphones are more likely to purchase than those on PC and the experiment setup enrolls more smart phone users in group A, then the purchase rate in that group will be biased and that difference would be mistakenly attributed to group A's design.
7. Distributions balance Python example
Using pandas' groupby and value counts methods, we can examine the browser distributions in two different tests. Notice how evenly balanced the values are for the checkout data set, which is a sign of a valid randomization algorithm and clean test, as opposed to the major imbalances in the AdSmart data set which could be a symptom of improper randomization functions or many of the previously discussed bugs.
8. Let's practice!
Let's put what we've learned to practice.