1. Introduction to correlations
Correlations investigate associations of variables.
2. Correlation in A/B design
A correlation assesses the
strength and direction of the relationship between
two variables, determining the amount of increase or decrease in one variable per unit increase or decrease in the other variable. We
hypothesize that enjoyment is correlated with the time to eat pizza.
Given AB test designs having two groups, such as Cheese and Pepperoni topping pizza, each potentially having multiple measures, such as time to eat pizza and enjoyment of pizza, multiple correlations can be run on AB designs. Groups can be ignored, assessing the correlation of time and enjoyment of pizza in general.
3. Correlation in A/B design
We can also test within groups, assessing the time and enjoyment in Cheese pizza only and the correlation in Pepperoni pizza only. In AB design, groups are generally of interest individually. Obtaining different correlations in each group can make the impact of the group on the relationship of the variables to be apparent.
4. Correlation
Remember
correlation does not imply causation.
The number of drownings and ice cream sales are significantly correlated, both increasing together. The
likely explanation is ice cream is eaten, and drownings occur more often during warmer months. AB tests can help us
deduce causation by
making and testing changes. Comparing
pepperoni to cheese pizza, the only group difference being topping, can identify pepperoni as the likely cause of a difference in correlation in each group. We are assessing
enjoyment on time to eat pizza and think enjoyment may cause eating time. If the
enjoyment and time relationship is different in each group, we infer the relationship is caused by the topping. Meaningfully changing AB tests provides insights to the variables.
A correlation also does not determine whether the variables are dependent. The number of drownings is not dependent on number of ice cream sales, or vice versa. The
data can be viewed using ggplot-two, calling the dataset in ggplot, variables in x and y, wrapped in aes, and geom-underscore-point to create a scatter plot.
5. Correlation coefficient
The
correlation coefficient, r, measures the degree or strength of association,
ranging from negative-one to positive-one. A
negative correlation indicates one variable increases as the other decreases. A
positive correlation means as one variable increases, the other variable also increases.
Zero indicates no correlation.
A further coefficient from zero indicates a better fit and greater correlation. Of particular interest to many AB design studies is the ability to
predict data, which is based on correlations. For example, in a stronger correlation, the enjoyment of pizza can be better used to predict the time to eat the pizza, or vice versa.
6. Correlation values
The cor function gives the correlation coefficient.
A correlation coefficient over point-seven, such as this one, is generally considered strong. In addition to the strength of the relationship,
the proportion of variation in the dependent variable, x, here time to eat, that can be attributed to the independent variable, y, or enjoyment, can be found with
R-squared
or squaring the correlation coefficient saved as corvalue.
7. Correlation limitations
Though outliers should be assessed in any test, correlations are
particularly susceptible to distortion from outliers.
Note that the
correlation coefficient does not give the slope of the line of best fit, shown here in red. This line is derived in
regression analyses, built on correlations. Additionally, the
correlation coefficient itself is not an indication of statistical significance but is
used, along with the sample size, to determine a p-value and whether the null hypothesis can be rejected.
8. Let's practice!
Let's practice this.