1. Post-hoc analysis following ANOVA
After conducting ANOVA, we often need to understand specific differences between groups. This is where post-hoc analysis comes in, providing detailed insights into pairwise comparisons.
2. When to use post-hoc tests
Post-hoc tests are pivotal when ANOVA reveals significant differences among groups. They allow us to pinpoint which specific pairs of groups differ, allowing us to peek behind the curtain to explore the inner workings of pairwise differences.
3. Key post-hoc methods
There are two common post-hoc methods:
Tukey's HSD, named after statistician John Tukey, which is known for its robustness in multiple comparisons.
There's also the Bonferroni correction, named after mathematician Carlo Bonferroni, which adjusts p-values to control for Type I errors.
For broader comparisons, use Tukey's HSD; Bonferroni is better for reducing false positives in more focused tests.
4. The dataset: marketing ad campaigns
We'll work with a dataset of marketing campaigns, examining the Click_Through_Rate for different Ad campaigns to identify differences and which strategy is most effective.
5. Data organization with pivot tables
Pivot tables in pandas can be extremely helpful for organizing data, especially before conducting post-hoc analysis. It provides a clear comparison of the mean Click_Through_Rates for each campaign type.
6. Performing ANOVA
We start with ANOVA to assess if there's a significant difference in these Click_Through_Rates among the campaigns. This sets the stage for further analysis if significant differences are found.
First, we specify the different campaign types. Then we create the groups using a list comprehension to extract the Click_Through_Rate for each Ad_Campaign.
Next, we perform the ANOVA across the three campaign types, unpacking the groups using an asterisk, to compare their mean click-through rates.
The very small P-value here indicates significant differences in these means.
7. Tukey's HSD test
If ANOVA indicates significant differences, Tukey's HSD test helps us understand exactly which campaigns differ.
The pairwise_tukeyhsd function from statsmodels.stats takes arguments for the continuous response variable, Click_Through_Rate in this case, the categorical variable with more than two groups, Ad_Campaign, and alpha.
To interpret the results of this table, we focus on the meandiff, p-adj (adjusted P-value), and reject columns. For the first row, Loyalty Reward versus New Arrival, the mean difference is 0.2211, with a p-value less than 0.05, indicating that the Loyalty Reward group has a significantly higher mean than the New Arrival group. For Loyalty Reward versus Seasonal Discount, on row 2, the mean difference is -0.2738. With a p-value less than 0.05, it suggests that the Loyalty Reward group has a significantly lower mean than the Seasonal Discount group. Lastly, for New Arrival versus Seasonal Discount, the mean difference is -0.4949, with a p-value less than 0.05, indicating that the New Arrival group has a significantly lower mean than the Seasonal Discount group.
8. Bonferroni correction set-up
The Bonferroni correction is a stringent method to adjust p-values when conducting multiple pairwise comparisons, effectively reducing the chances of a Type I error.
A little more data preparation is required before applying the Bonferroni correction. We begin by creating an empty P-values list.
Then, we lay out a list of tuples containing the pairwise comparisons that we will iterate over.
Next, we iterate over the tuples in comparisons, using the tuple elements to extract the Click_Through_Rate for both groups.
We run ttest_ind on the click through rates in a pairwise fashion, and append the p-values to our list.
9. Performing Bonferroni correction
Now we apply the Bonferroni correction using the multipletests function.
The resulting p-values for the three comparisons are all extremely small. This again provides evidence that each of the three groups have significant click through rate differences.
10. Let's practice!
Time to try out your own post-hoc analysis!