Metrics design and estimation

1. Metrics design and estimation

Recall that AB tests require us to determine sound metrics that we'll track during our experiment. In this video we will explore metrics design and estimation.

2. Types of metrics

One way to categorize metrics is by the purpose they serve. Some metrics allow us to align testing efforts with business objectives and measure performance, while others enable us to better understand users' behavior. First we have primary metrics, also known as north star, goal, success metrics, or key performance indicators. These are the metrics that most accurately represent the business's success, the company's mission, or AB testing and optimization efforts. Examples are signup rate, average sales per user, and daily active users. Next, we have granular metrics that go deeper than the primary metrics and are often more sensitive and actionable. In many cases, they explain the users' intermediate actions in the user journey or marketing funnel. Decomposing a primary metric into its component pieces enables us to reach deeper levels that can be improved and influenced by direct design changes. For example, to improve signup rate, one can focus on increasing either or both of its components: the clicks per visitors of a landing page, or the final signups per clicks. Finally, we have instrumentation or guardrail metrics such as page rendering speed and logging accuracy, but those are outside of the scope of this course.

3. Types of metrics

Quantitatively, metrics can be categorized into several buckets. First we have means and percentiles such as average sales per variant, or median time spent on a page. Although means and medians are both measures of central tendency, medians tend to be less affected by outliers than means. Next, we have proportions and rates. A good example would be page abandonment rate which is the number of users abandoning a page without taking any action divided by the total number of users visiting the page. Lastly, we have ratio metrics such as revenue per session and click through rate, which represents the number of clicks on a certain call-to-action divided by the total number of page visits. Each type of metrics is useful in its own way and there is no one-size-fits-all. Depending on the use-case one can even leverage a combination of all these metrics to decide the success or failure of an AB test.

4. Metrics requirements

In terms of requirements, metrics have to be stable and robust in order not to move heavily with random variations, while at the same time remain sensitive to the changes intended to influence specific user actions. Metrics also need to be measurable since not all effects can be easily measured. For example, user satisfaction is too abstract to measure, but retention and return rate could be measured and served as good proxies for satisfaction. Last but not least, metrics have to be non-gameable. For instance, if the goal is to increase click-through-rate, one can create the biggest brightest button on the page, and although this might increase the number of clicks, it is not guaranteed to get users to actually sign up. Another example is time on page: one can create a long confusing page that people spend a lot of time on, but they ultimately may not come back to the site after this negative experience.

5. Python metrics estimation

Looking at the checkout dataset again, let's calculate the proportion of users who purchased something on the page and segment it by the gender column. Looks like females on average purchased at a higher rate than males: 90% versus 78%. Examining the average order value by gender, filtering only on Safari and Chrome browsers using the "OR" logical operator doesn't show a big difference.

6. Python metrics estimation

Looking at the same metrics segmented by browser type does reveal some slight differences. For now, we won't consider whether these differences are statistically significant or just due to chance.

7. Let's practice!

Time to practice.

This exercise is part of the course

A/B Testing in Python

IntermediateSkill Level

4.7+

Start Course for Free

In this chapter, you’ll learn the foundations of A/B testing. You’ll explore clear steps and use cases, learn the reasons and value of designing and running A/B tests, and discover the most commonly used metrics design and estimation frameworks.

Exercise 1: What is A/B testing?Exercise 2: When an A/B test is not best Exercise 3: A/B testing steps Exercise 4: Randomization effects Exercise 5: Why run experiments?Exercise 6: Correlation visualization Exercise 7: Correlation or causation?Exercise 8: Metrics design and estimation

Current Exercise

Exercise 9: Means and proportions Exercise 10: Ad impressions metrics

In Chapter 2, you’ll cover the experiment design process. Starting with learning how to formulate strong A/B testing hypotheses, you’ll also cover statistical concepts such as power, error rates, and minimum detectable effects. You’ll finish the chapter by learning to estimate the appropriate sample size needed to yield conclusive results and tackle scenarios with multiple comparisons.

Exercise 1: Hypothesis formulation and distributions Exercise 2: Strong hypothesis formulation Exercise 3: Plotting distributions Exercise 4: Central limit theorem for means Exercise 5: Experimental design: setting up testing parameters Exercise 6: Interpreting p-values Exercise 7: Error rates in the wild Exercise 8: Experimental design: power analysis Exercise 9: Plotting power curves Exercise 10: Sample size for means Exercise 11: Sample size for proportions Exercise 12: Multiple comparisons tests Exercise 13: Is a multiple comparisons correction needed?Exercise 14: Corrected p-values

Here, you’ll discover a concrete workflow for cleaning, preprocessing, and exploring AB testing data, as well as learn the necessary sanity checks we need to follow to ensure valid results. You’ll explore a detailed explanation and example of analyzing difference in proportions A/B tests.

Exercise 1: Data cleaning and exploratory analysis Exercise 2: Proportions EDA Exercise 3: A/B test data cleaning Exercise 4: Sanity checks: Internal validity Exercise 5: SRM Exercise 6: Distributions balance Exercise 7: Sanity checks: external validity Exercise 8: Novelty effects detection Exercise 9: Simpson's paradox in action Exercise 10: Analyzing difference in proportions A/B tests Exercise 11: Difference in proportions A/B test Exercise 12: Interpretation of confidence intervals Exercise 13: Confidence intervals for proportions

In the final chapter, you’ll develop frameworks for analyzing differences in means and leveraging non-parametric tests when several assumptions aren't met. You’ll also learn how to apply the Delta method when analyzing ratio metrics and discover the best practices and some advanced topics to continue the A/B testing mastery journey.

Exercise 1: Analyzing difference in means A/B tests Exercise 2: T-test for difference in means Exercise 3: Pairwise t-tests Exercise 4: Non-parametric statistical tests Exercise 5: Parametric or non-parametric?Exercise 6: Mann-Whitney U test Exercise 7: Chi-square test for independence Exercise 8: Ratio metrics and the delta method Exercise 9: Delta or not?Exercise 10: Delta method Exercise 11: A/B Testing best practices and advanced topics intro Exercise 12: Best practices Exercise 13: Day-of-the-week effect Exercise 14: Wrap-up: A/B testing in python