1. A/B Testing best practices and advanced topics intro
It's time to talk about best practices and introduce some advanced topics.
2. Best practices
One of AB testing's best practices is to avoid peeking at the results and making conclusions as soon as we notice one statistically significant instance. In other words, if an experiment is designed to last a week and we notice a statistically significant difference on day two, we need to be very careful with the decision of stopping the experiment. This is due to a few reasons: the first has to do with error rates. Recall that at any given point, we are accepting a default of 5% false positive rate. This means that, like multiple comparisons, every time we check the results, we expose ourselves to the possibility of flagging a false result and inflating our error rates. The second reason is related to the seasonality and day-of-the-week effects. If, for example, our customers are more likely to purchase or spend more time using an app on weekends compared to weekdays, we need to ensure that we include the overall behavior, especially if we want to generalize our results across the full week.
3. Best practices
Another good practice with testing big changes is to design simple and implementable versions before dedicating time and resources building full-fledged features that we don't know the impact of yet. One good way to do this is a painted door test. This test involves testing a partially completed feature to gauge the user's interest before committing to building it. For example clicking on a toggle button to add an option that is not available yet, followed by a message thanking the users for their interest and providing an earliest availability date. However, maintaining ethical considerations with these tests is especially critical.
Finally, unless we want to make conclusions about an entirely new experience, it's advisable to only change one variable at a time for each experiment to isolate the impact of that variable. For example, if we change the art, copy, and button colors on a landing page, we can only claim that the results are due to all three changes combined, but we cannot predict the impact of changing any of these variables alone.
4. Advanced topics
Shifting gears to some advanced topics, in cases where we think the different variables in the test may cause a different result when combined, we can resort to a more advanced testing setup known as multifactorial design. For instance, changing the button's color alone may increase clicks and changing the font size may also have an isolated effect. But changing both may not always result in their added effect. Some changes may cancel each other's impact while others may cause synergistic effects higher than the simple addition of the isolated impact of either change alone.
All of the statistical techniques we covered so far rely on frequentist statistics where the parameter we infer is fixed and the data used to make this inference is only the current experiment's data. The Bayesian approach on the other hand can incorporate knowledge from previous data and experiments to form a prior belief that can be combined with the experiment's data to generate a posterior probability distribution about the population parameter. More simply, it can answer questions such as what is the probability that the conversion rate of group A is higher than group B, which is easy to interpret. Lastly, another advanced yet common topic is A/B testing with network effects. This is common in social network A/B tests where a particular user may be exposed to features in a test that they are not enrolled in, which causes leakage and contamination of results. One solution to this is assigning whole clusters of users to the control and the treatment where the interactions between the clusters are minimal.
5. Let's practice!
Let's solidify these concepts with some practice.