Get Started

Experimental design: power analysis

1. Experimental design: power analysis

Now that we've explored the parameters needed to set up an AB test, let's dive into the power analysis.

2. Effect size

For testing the differences in means, after selecting the suitable minimum detectable effect of interest, we convert it into a standardized effect size known as Cohen's d defined as the difference between the two means divided by the standard deviation. For differences in proportions, a common effect size to use is Cohen's h calculated using the shown formula. A general rule of thumb is that zero-point-two corresponds to a small effect, zero-point-five is a medium effect, and zero-point-eight is large. That said, these levels are subjective and the actual evaluation will depend on the problem at hand. To calculate Cohen's h using Python, we import proportion effect size method from statsmodels and pass the group's proportions as arguments. Using the checkout dataset's groups A and B purchase rates as an example gets us a standardized effect size of zero-point-zero-seven.

3. Sample size estimation for proportions

There are several formulas available for calculating the required sample size but thankfully statsmodels power module does the heavy lifting for us. The t-test solve power function takes the power level, alpha, and standardized effect size as arguments and calculates the sample size per group for us. Note that the TTestIndPower package runs power calculations for t-test considering two independent samples. The samples being the control group users and the users that we plan on exposing to the new design B, as we assume those users and their actions to be independent. Using the default parameters of eighty percent power and five percent alpha, along with the checkout dataset's groups A and B effect size as an example requires a sample size of approximately 3000 users to enroll in each group.

4. Effect of sample size and MDE on power

Let's examine the impact of sample size and effect size on the power of the test. We first import statsmodels power package for t-tests with independent samples TTestIndPower. We then pass two arrays of sample sizes and effect sizes to plot using the plot power function with default alpha at 5%. As shown on the plot, for the same significance level of five-percent, a higher powered test requires a larger sample, and a larger sample is required to capture smaller effects. For example, to hit the eighty percent power dotted line while requiring the detection of a minimum standardized effect zero-point-five, we would need approximately a sample size of 70 users. But to detect a larger effect of zero-point-eight, we would only need around 25 users.

5. Sample size estimation for means

Similar to estimating the sample size for proportions, the difference in means sample size estimation only differs in calculating the standardized effect size that we introduced earlier as Cohen's d. We start with calculating the baseline mean order value using group A data. We then use the dot std method to calculate the sample's standard deviation and assume the same value for the treatment group. After that we define a desired new mean of 26 dollars and calculate Cohen's d as the difference between the two means divided by the standard deviation.

6. Sample size estimation for means

We again use the default parameters of eighty percent power and five percent alpha, along with the calculated effect size which yields a sample size of approximately 86 users to enroll in each group.

7. Let's practice!

Let's practice what we've learned.