Get startedGet started for free

Chi-squared test statistic

1. Chi-squared test statistic

When you look at the bar plots that relate party to spending on space exploration and spending on the military, you get two very different stories.

2. Comparing bar plots

The plot on the left shows little relationship between party affiliation and space, but this plot on the right suggests that in fact opinions do differ based on party. This is the structure in our particular dataset, but is it convincing evidence that this relationship exists in the population of all Americans? This is a question to be answered with a hypothesis test.

3. Hypothesis test

Recall that in a hypothesis test you specify the variables you're studying, assume a hypothesis that can generate data, then for each dataset calculate a relevant test statistic that you can compare to your observed test statistic. In this case, since we're interested in the association between two variables, we can use the same null as before, that these variables are in fact independent. That allows us to generate data through permutation. The question is: what test statistic should we use?

4. Choosing a statistic

What we'd like is a test statistic that can capture how different each of these bar plots is from

5. Choosing a statistic

the bar plot that shows absolutely no relationship - this one. This plot, though, is built from proportions, and a statistic will be be easier to build from counts,

6. Choosing a statistic

so let's switch to looking at the bar plot of counts for space and the corresponding contingency table which this time we've renamed "observed_counts". What we'd like is a table of the counts that we could expect if these two variables were independent from one another. It's tempting to think that the counts of all cells should be equal, but keep in mind we have to respect the marginal distributions of both variables. For example, we need to be sure we're still expecting that there are more independents in our dataset than republicans. Computing these expected counts while respecting the marginal distributions is a bit tedious, so we'll be relying upon R for this calculation. This is the appropriate table of expected counts if the variables were independent of one another, we'll call it "expected_counts". The question is, how can we summarize the difference between the expected table and the observed table in just a single number?

7. Choosing a statistic

One option would be to simply find the difference in the counts in each cell and add them all up. That does result in a single number, but realize that the positive differences and the negative differences will cancel one another out, which isn't good. We can fix that by squaring each of those differences so that they're now positive. That's a big improvement but notice that the cells that have very large counts to begin with will dominate this sum. To put the cells on more even footing, we could divide each squared difference by the expected count.

8. Chi-squared distance

This statistic that we've just formulated is called the chi-squared statistic. It captures the distance between a contingency table and the table you would expect if the variables were independent of one another. We found that the statistic for the relationship between party and natspac was one-point-three-three. If we calculate the statistic for the relationship with natarms, we get a much greater distance: eighteen-point-nine-seven. With this statistic in hand, you can return to the hypothesis test to answer the question of if either of these observed statistics: one-point-three-three or eighteen-point-nine-seven is so great as to lead you to reject the null hypothesis that these spending priorities are independent of political party.

9. Let's practice!

You'll have a chance to answer that question in the following exercises.