Get startedGet started for free

Data cleaning and exploratory analysis

1. Data cleaning and exploratory analysis

Let's talk about data cleaning and exploration.

2. Cleaning missing values

Data cleaning is a big topic to fully cover in this course, but we'll focus on a few AB testing-related points. Dealing with missing values in the data will depend on the case, but it's useful to understand the reason behind missing data and its frequency. Sometimes we can drop the missing rows if they represent a small percentage and won't cause imbalances in the dataset. Other times we can impute the missing data using the average, interpolated, and most frequent values. While in other cases, having missing values is completely fine. We see one such case in the order value column of the checkout dataset. When a user doesn't make a purchase, the order value data will naturally be empty and it is important to keep it so for accurate mean calculations. If we replace that value with zero, this will be counted in the denominator of the average order value when it shouldn't be. For example, replacing the missing values with zeros using the fillna method resulted in the wrong average order value due to mistakenly assigning a purchase value of zero where customers didn't make a purchase at all.

3. Cleaning duplicates

In some occasions we experience logging issues with duplicate data. Those should be dropped in order not to double count and inflate metrics. Here we used the drop duplicates function while keeping the first row. The length of the data with and without dropping duplicates is the same so we conclude that we don't have identical rows of data to drop.

4. Cleaning duplicates

In other cases, the duplication is not an error. For example the same users can land on the same page more than once in the same test duration. The action in this case will depend on how we want to define our metrics. For the purchase rate metric with duplicated user_ids, we can define the numerator as the number of unique users who purchased at least once by taking the max of the purchased column before summing it, or we can sum the total purchases if we want to detect the overall changes in purchases. Both are correct but we need to remain consistent across variants.

5. EDA summary stats

One of the quickest ways to get metrics' summary statistics is to group by the experiment variant and use the dot agg method with the mean, standard deviation, and count functions. Applying this to the order value column shows that variant C has the highest mean order value at 35 dollars, B has the highest spread of 7 dollars and A has the lowest number of purchasers.

6. EDA plotting

To visualize our results, we use seaborn to start with a bar plot of the mean order value passing the checkout page and order value columns to the x and y parameters. The default estimator is the mean, but any other aggregate function can be used.

7. EDA plotting

A histogram is another great visualization of the distribution of our data. Here we use seaborn's displot function on the order value and passing the checkout page to the hue parameter to separate the variants. By glancing over the distributions we can already confirm that group B has the widest spread of values and A has the narrowest spread but lowest average order value.

8. EDA plotting

Finally, plotting the average daily metrics over the experiment duration using seaborn's lineplot function may reveal some important patterns. Looking at the average proportion of users responding 'yes' to an ad in the adsmart dataset over the experiment days shows interesting trends for each variant. Explaining such trends will come down to more understanding of the context of running the test and domain expertise, but uncovering trends during the exploratory data analysis phase is often useful.

9. Let's practice!

Time to practice.