Get startedGet started for free

Sanity checks: external validity

1. Sanity checks: external validity

In addition to validating our internal setup, it is important to run external validations to confirm our ability to generalize the results of an AB test across a variety of user segments and over time.

2. Simpson's paradox

Starting with generalizing across user segments, another reason why having an unbiased randomization function and balanced distributions between the groups is important is to not fall into the trap of Simpson's paradox. This phenomenon could happen when the distributions across different segments such as devices, countries, or browsers vary between the experiment's variants. Let's examine a dataset where we ran an AB test to compare conversion rates between groups A and B and also wanted to understand how the rates vary across devices so we can tailor which designs to present by device. Looking at the conversion rates by variant, it is clear that group A has the higher conversion rate. When we add the device grouping, however, we notice that the direction is reversed and group B's rates now seem higher!

3. Simpson's paradox

To understand what happened, let's look at the distribution of devices between the variants. Examining the counts, it looks like we've enrolled more phone users in group A than group B and phone users generally have higher conversion rates. This invalidates the test results and forces us to rerun with balanced distributions.

4. Simpson's paradox

With a balanced distribution, we can see that group A's conversion rate is higher not only overall, but also by device segments. This enables us to better trust our results and generalize them more confidently to the respective populations.

5. Novelty effect

Switching over to generalizing results over time, occasionally we may notice that users behave differently depending on how long they have been exposed to a particular version of the experiment or a new feature. At the early days of the test, some users may want to interact with the new feature or design out of curiosity and not necessarily due to the feature's usefulness. This may lead us to believe that the treatment is performing better than the control at first, but that improvement dies down over time. We refer to this phenomenon as the novelty effect. Similarly, the opposite may happen where returning users are more accustomed to the old features and avoid the new ones entirely leading us to believe that the treatment is performing poorly. This is referred to as change aversion.

6. Novelty effect visual inspection

To inspect this effect visually, let's look at an example of an AB test where we changed the design of a call-to-action button of a marketing campaign to something more attention-grabbing. After plotting the lift of click through rate of the new design compared to the old, we see that the average rates were higher over the first few days of the test compared to the lower more stable rate over the remaining period. This is a clear symptom of novelty effects that needs to be taken into account while analyzing the test results.

7. Correcting for novelty effects

There are several ways we can correct for novelty effects when detected in our test. The first way is to increase the test duration until we see the metrics converge to stable levels. Additionally, we can examine the results of new and returning user cohorts separately. If novelty effects exist, we can expect new users who are seeing either the control or treatment variants for the first time to have the cleanest data, whereas returning users may experience novelty effects or change aversion as they are inherently comparing their default experience to the new one.

8. Let's practice!

Let's look at some exercises.