1. Pearson correlation
The most common test to determine the statistical significance of a correlation coefficient is the Pearson Correlation.
2. Pearson correlation assumptions
Choosing
the test to assess correlation is based on the properties of the data.
Pearson correlation, or Pearson’s r, assesses the strength and direction of a
linear relationship between two variables deriving from
a normal distribution.
3. Pearson correlation and A/B tests
We are interested in running a Pearson correlation in which the AB design has subjects eat either a Cheese pizza or Pepperoni pizza. We are interested in the relationship of time taken to eat the pizza and enjoyment of the pizza. If we assess the data ignoring groups, our null hypothesis of the Pearson correlation is that there is no relationship between time to eat and enjoyment of the pizza.
Provided we are interested in each group individually, the null hypothesis becomes there is no relationship between time to eat and enjoyment of the Cheese pizza, and
identically for the Pepperoni pizza, which will require two correlation analyses.
4. Determine sample size
To determine the number of subjects needed we can use pwr-dot-r-dot-test of the pwr package, specifying the effect expected to detect, power, and our alpha, our significance level at which we will reject the null. To detect an effect of point-three, power of point-eight and alpha of point-zero-five, we need 85 data points to contribute to the Pearson analysis.
5. Assessing linearity
Once the subjects have provided data, to assess linearity,
create a scatter plot using ggplot with enjoyment as x and time as y, and geom-underscore-point. Provided the points do not curve, we have a linear relationship. Note that time and enjoyment of both groups are plotted, ignoring the groups. Since there is not a distinction in point clusters that may indicate one group versus another, the linear correlation ignoring groups is not unrealistic.
If distinctive clusters are formed, a correlation ignoring groups can be performed but should be followed with a test of each group, regardless of the hypotheses, as the slope of the line ignoring the groups will be impacted by the separation of the group clusters, provided the clusters each derive from a different group.
6. Assessing normality
Pearson is a parametric test, meaning the distribution is assumed to be normal. To assess normality, use
shapiro-dot-test on each the time and enjoyment variable.
If the p-value of the Shapiro test is below the alpha, commonly point-zero-five, the data is not normal. The p-values here are above point-zero-five and are therefore normal and can be assessed using the Pearson Correlation. Note we are ignoring groups, as both topping pizza groups are included in each Shapiro test.
7. Pearson ignoring groups
To run the Pearson correlation ignoring groups,
use cor-dot-test, calling the two variables you want to assess the relationship of using the formula tilde x plus y, or tilde enjoyment plus time. Specify the method pearson. The raw, unstandardized data can be input here.
The output gives the t-value, the degrees of freedom, and the p-value to determine significance. It also provides the correlation coefficient
which can be squared to determine the proportion of variation in the dependent variable, y or time, is attributed to the independent variable, x or enjoyment.
8. Pearson within groups
To specify one group in cor-dot-test, add a subset argument, specifying the group column, Topping, with the group of interest, Cheese.
9. Power analysis
The power analysis for a Pearson correlation can be run using pwr-dot-r-dot-test. Specify the correlation coefficient with r, the number of samples with n, and the p-value with sig-dot-level.
10. Let's practice!
Let’s practice the Pearson Correlation.