1. Ratio metrics and the delta method
Let's examine how we run AB tests with ratio metrics.
2. Ratio metrics A/B testing
We discussed mean metrics where the denominator is the number of users per variant such as average order value per user. We refer to this denominator as the unit of analysis, which is also our randomization unit since we randomize and assign units to each variant based on user_id. This is usually preferred because we want to measure the effects of our changes throughout a consistent user experience. If group A users see a different design every time they view a page or when they log in and out, the test results will be contaminated.
3. Ratio metrics A/B testing
Many of the metrics used in practice are ratio metrics such as click-through-rate, query per session, revenue per page view, and so on. In this case, the unit of analysis is no longer a user, but a page view or a session. This creates a problem for our metrics estimation since we've been relying on a critical assumption that the experimental units need to be independently identically distributed. Since one user can create more than one session and view a page more than once, we cannot claim that those sessions or page views are independent. This is where the Delta method can help.
4. Delta method motivation
We have so far been calculating the average order value and the purchase rate in a decoupled fashion. Meaning the average order value is only calculated when a user purchases an item.
Note how the mean order value is calculated based on the count of instances where a purchase occurred. But what if one group had high average order values and only a few purchases? This metric alone would miss that side of the story and the same can be true if we were to only use purchase rate.
Note how the mean order values calculated based on page views are different. In realistic AB testing scenarios, it is common to see these metrics move in opposite directions making it difficult to declare a winner. This creates the need to develop more comprehensive metrics to capture these effects. One metric would be the ratio of the total order values to the total page views per variant. Which is a good use case for utilizing the Delta method for variance estimation since the unit of analysis being checkout page views is more granular than the randomization unit of user_id.
5. Delta method variance
We won't go over the proof for the Delta method, but we listed some references. The formula and Python function shown can be used to calculate the variance of the ratio metrics which we can then use to create confidence intervals or test for statistical significance.
6. Delta method z-test
To do so, we pre-defined the z-test delta function that takes ratio metrics data for the control and treatment groups along with the desired alpha level for the test. For the common online AB tests traffic size, this method has sufficient accuracy. The function outputs the control and treatment means, the average difference between the two, the confidence intervals corresponding to the chosen alpha, and a p-value.
7. Python example
In Python, we first define the user-level ratio components for each of the variants starting with the numerator of total order value per user_id as the order value column, and the denominator as total page views per user_id as the page_view column. We then assign these columns the respective x and y variable names for the control and treatment and pass them to the ztest_delta function. Looks like the treatment group B has a higher average order value by four-point-eight dollars on average with a practically zero p-value. This signals a high confidence in rejecting the Null hypothesis of equal means.
8. Let's practice!
Time for some exercises!