Metrics design and estimation

1. Metrics design and estimation

Recall that AB tests require us to determine sound metrics that we'll track during our experiment. In this video we will explore metrics design and estimation.

2. Types of metrics

One way to categorize metrics is by the purpose they serve. Some metrics allow us to align testing efforts with business objectives and measure performance, while others enable us to better understand users' behavior. First we have primary metrics, also known as north star, goal, success metrics, or key performance indicators. These are the metrics that most accurately represent the business's success, the company's mission, or AB testing and optimization efforts. Examples are signup rate, average sales per user, and daily active users. Next, we have granular metrics that go deeper than the primary metrics and are often more sensitive and actionable. In many cases, they explain the users' intermediate actions in the user journey or marketing funnel. Decomposing a primary metric into its component pieces enables us to reach deeper levels that can be improved and influenced by direct design changes. For example, to improve signup rate, one can focus on increasing either or both of its components: the clicks per visitors of a landing page, or the final signups per clicks. Finally, we have instrumentation or guardrail metrics such as page rendering speed and logging accuracy, but those are outside of the scope of this course.

3. Types of metrics

Quantitatively, metrics can be categorized into several buckets. First we have means and percentiles such as average sales per variant, or median time spent on a page. Although means and medians are both measures of central tendency, medians tend to be less affected by outliers than means. Next, we have proportions and rates. A good example would be page abandonment rate which is the number of users abandoning a page without taking any action divided by the total number of users visiting the page. Lastly, we have ratio metrics such as revenue per session and click through rate, which represents the number of clicks on a certain call-to-action divided by the total number of page visits. Each type of metrics is useful in its own way and there is no one-size-fits-all. Depending on the use-case one can even leverage a combination of all these metrics to decide the success or failure of an AB test.

4. Metrics requirements

In terms of requirements, metrics have to be stable and robust in order not to move heavily with random variations, while at the same time remain sensitive to the changes intended to influence specific user actions. Metrics also need to be measurable since not all effects can be easily measured. For example, user satisfaction is too abstract to measure, but retention and return rate could be measured and served as good proxies for satisfaction. Last but not least, metrics have to be non-gameable. For instance, if the goal is to increase click-through-rate, one can create the biggest brightest button on the page, and although this might increase the number of clicks, it is not guaranteed to get users to actually sign up. Another example is time on page: one can create a long confusing page that people spend a lot of time on, but they ultimately may not come back to the site after this negative experience.

5. Python metrics estimation

Looking at the checkout dataset again, let's calculate the proportion of users who purchased something on the page and segment it by the gender column. Looks like females on average purchased at a higher rate than males: 90% versus 78%. Examining the average order value by gender, filtering only on Safari and Chrome browsers using the "OR" logical operator doesn't show a big difference.

6. Python metrics estimation

Looking at the same metrics segmented by browser type does reveal some slight differences. For now, we won't consider whether these differences are statistically significant or just due to chance.

7. Let's practice!

Time to practice.