Get startedGet started for free

Chi-square goodness of fit tests

1. Chi-square goodness of fit tests

Last time you used a chi-square test to compare proportions in two categorical variables. This time, we'll use another variant of the chi-square test to compare a single categorical variable to a hypothesized distribution.

2. Purple links

The Stack Overflow survey contains a fun question about how the user feels when they discover that they already visited the top resource when trying to solve a coding problem. There are four possible answers, stored in the purple_link variable.

3. Declaring the hypotheses

Let's hypothesize that half the users in the population would respond "Hello old friend", and the other three responses would get one sixth each. The tribble function provides a convenient way of manually entering values for a data frame, and returns a standard tibble. We can specify the hypotheses as whether or not the sample matches this hypothesized distribution. The test statistic measures how far the distribution of proportions observed in the sample is from the hypothesized distribution of proportions. I'll set a significance level of point-zero-one.

4. Hypothesized counts by category

To visualize the distribution of the purple links, it will help to have the hypothesized counts. These are the hypothesized proportions times the total number of observations in the sample.

5. Visualizing counts

The natural way to visualize the categories is with a bar plot. We're using geom_col here since we already calculated the counts. The purple points show the hypothesized counts, so you can compare the sample distribution to the hypothesized distribution. Two of the bars are close to the values we hypothesized and two are slightly different. We'll need to run the hypothesis test to see if the differences are statistically significant.

6. chi-square goodness of fit test using chisq_test()

The one sample chi-square test is called a goodness of fit test. To run it, we need the hypothesized proportions in vector form rather than in a tibble. Again we use chisq_test from infer. However, this time the arguments are different. Piping from the dataset, we set response to the name of the column of interest, and set p to the hypothesized distribution. The degrees of freedom are one less than the number of choices in the survey. Four minus one is three. The p-value is very small, much lower than the significance level we set. Thus we conclude that the sample distribution of proportions is different than the hypothesized distribution of proportions.

7. Let's practice!

Let's do some one sample chi-square tests.