Chi-square test of independence
1. Chi-square test of independence
Just as ANOVA extends t-tests to more than two groups, chi-square tests of independence extend proportion tests to more than two groups.2. Revisiting the proportion test
Here's the proportion test from last time. In the first column, the test statistic is the square of the z-score. Minus four-point-two-two squared is seventeen-point-eight. In the second column, chisq_df means the degrees of freedom of a chi-square test. Its value is one.3. Independence of variables
That proportion test had a positive result. There was evidence that the hobbyist and age category variables had an association. If that wasn't the case, and the proportion of hobbyists was the same for each age category, the variables would be considered statistically independent. More formally, statistical independence of two categorical variables is when the proportion of successes in the response variable is the same across all categories of the explanatory variable.4. Job satisfaction and age category
Recall that the Stack Overflow sample has an age category variable with two categories and a job satisfaction variable with five categories.5. Declaring the hypotheses
We can declare hypotheses to test for independence of these variables. Here, age category is the response variable, and job satisfaction is the explanatory variable. The null hypothesis is that independence occurs. I've set a significance level of point-one. The test statistic is denoted chi-square. It quantifies how far away the observed results are from the values you'd expect if independence was true.6. Exploratory visualization: proportional stacked bar plot
Let's explore the data using a proportional stacked bar plot. fill means two different things here. In the aesthetics, fill refers to the fill color of the bars. In geom_bar's position argument, "fill" makes each bar the same height, so you can compare proportions. If the age category was independent of the job satisfaction, the split between the age categories would be at the same height in each of the five bars. There's some variation here, but we'll need the hypothesis test to determine whether it's a significant difference.7. Chi-square independence test using chisq_test()
The hypothesis test for independence is called a chi-square independence test. There's a base-R function for it called chisq-dot-test, but like prop-dot-test, it's fiddly to use. The easiest option is to use infer's chisq-underscore-test. Pipe from the sample dataset, passing a formula with the response variable on the left and the explanatory variable on the right. The p-value is point-two-three, which is above the significance level we set, so we conclude that age categories are independent of job satisfaction. The chi-square distribution, whose CDF is used to calculate the p-value from the test statistic, has a degrees of freedom argument, just like the t-distribution. The results show that there are four degrees of freedom. This is the number of response categories minus one, times the number of explanatory categories minus one. Two minus one times five minus one is four.8. Swapping the variables?
If we swap the variables, so age category is the response, and job satisfaction is the explanatory variable, the visual inspection technique is the same: see if the splits for each bar are in similar places.9. chi-square both ways
If we run the chi-square test with the variables swapped, then the results are identical. We should ask questions like "are variables X and Y independent?", not "is variable X independent from variable Y?", since the order doesn't matter.10. What about direction and tails?
We didn't worry about tails in this test, and in fact chisq_test doesn't have an alternative argument. This is because the test statistic is based on the square of observed and expected counts, and square numbers are non-negative. That means that chi-square tests tend to be right-tailed tests.11. Let's practice!
Let's try some examples.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.