Get startedGet started for free

Analyzing difference in means A/B tests

1. Analyzing difference in means A/B tests

In this video we will talk about analyzing the difference in means.

2. Framework for difference in means

Similar to testing for difference in proportions, let's assume we already ran our power analysis to determine the sample size required to detect a certain effect in average time on checkout page. We ran our AB test and performed all the necessary sanity checks. In this case our null hypothesis is that there is no difference in mean time on page between the groups, and our alternative hypothesis is that there is a difference. To evaluate whether the difference in the mean time on page between groups A and B is statistically significant or not, we perform a two sample t-test for difference in means. If the calculated p-value is lower than our chosen significance threshold, we reject the Null hypothesis and conclude that the treatment effect is statistically significant. Otherwise, we have no evidence to believe that the means are sampled from different distributions with different mean time on page, and therefore cannot reject the Null hypothesis.

3. Pingouin t-test

Looking at the means per variant, it seems that group A spent the longest time on page at 44-point-6 seconds, while groups B and C spent almost equal average times on page at 42 seconds. Starting with one comparison between groups B and C, the Pingouin ttest function takes the time on page per user columns for each variant as input. We keep the other input parameters as the default values since we are doing a two-sided ttest to check for the difference in the means in either direction, and our data is not paired. Looking at the function's output, we can see that the p-value is just under the 5% significance threshold, but the 95% confidence intervals of the difference show that the lower limit is quite close to zero. Therefore, depending on whether we consider this one second at most difference significant enough from a practical perspective, we can choose to deploy design B or C.

4. Pingouin pairwise

If we want to perform the full set of comparisons between the three designs, however, we need to leverage the pingouin pairwise function which performs t-tests with the option of correcting for the multiple comparisons. The functions input arguments are the dependent variable, which is the metric of interest's column, the between parameter as the variant group identifier, and the padjust parameter which adds the correction for multiple comparisons. We chose the Bonferroni correction as the conservative approach. Examining the results we can see that the corrected p-value for the B versus C comparison is now higher than our significance threshold, suggesting no difference in average time on page, while group A's mean difference between either groups B or C was statistically significant. Depending on how we view the time on page metric, we can declare design A as either the winner or loser. The increased time on page could be a good thing signaling interest in case of an article, or a bad thing signaling confusion and delay in purchase decision for a checkout page design. This example demonstrates how practical significance, metrics interpretation, and domain knowledge play a big role in making decisions with AB tests results.

5. Let's practice!

Let's practice.