Goodness of fit

1. Goodness of fit

The distribution of first digit counts in the Iran data

2. First Digit Distribution

captures many of the main features of Benford's Law: 1 is the most common digit,

3. First Digit Distribution

followed by 2, then it generally continues to decay as the numbers increase. You'll note, though, that it's not a perfect fit. There sure are a lot of twos

4. First Digit Distribution

and also the number of sevens

5. First Digit Distribution

is actually greater than the number of sixes. Essentially, we're left wondering if these deviations from Benford's Law are just due to random chance, or if they indicate that in fact Benford's Law is not a good description of this voter data. That's a question of statistical significance, and to answer that, we first need to come up with a statistic,

6. First Digit Distribution

a measure of how far the vote distribution is from Benford's Law. In fact, we don't need to look far.

7. Chi-squared distance

The Chi squared distance that we used to measure independence in the last chapter can also be used to measure goodness of fit. Let's review how we used that statistic to measure the distance between these two visualizations: the relationship between political party and space funding that we actually observed (on the left) and the distribution we'd expect if they're independent of one another (on the right).

8. Chi-squared distance

This calculation goes cell by cell, so let's start up in this corner

9. Chi-squared distance

and take difference between the observed count, O,

10. Chi-squared distance

and the expected count over here, E.

11. Chi-squared distance

To be sure this difference is positive, we square it ,

12. Chi-squared distance

then we divide it by E to scale it by the size of this cell. The resulting number is this cell's contribution to the chi-square.

13. Chi-squared distance

We then move to the next cell and do the same thing.

14. Chi-squared distance

Take the squared and scaled difference.

15. Chi-squared distance

We continue this routine through all 9 cells, then add them up: that is the chi-squared distance.

16. Chi-squared distance

For these distributions, that distance was 1-point-32. Let's see how this applies in the goodness of fit situation.

17. First Digit Distribution

Here each cell is on of the digit categories, but the rest of the calculation is the same. We calculate the squared difference between

18. First Digit Distribution

the counts of leading ones

19. First Digit Distribution

and we add that to the squared difference in counts of leading twos

20. First Digit Distribution

all the way up to leading nines, and we add them together and that gives us our chi squared distance.

21. Example: uniformity of party

Let's walk through an example with the gss data of using the Chi-squared to assess the goodness of fit. Here's the distribution of the party variable, a simple bar chart with three categories. Say we'd like to assess the distance between this distribution and the uniform distribution where each party is the same size. I've added a gold line to indicate what the bar heights would be if this was uniform. The first step in computing the chi-squared statistic is making a table of the counts in the bar chart. Once we have that table of counts, we need to describe the uniform distribution as a named vector of three probabilities , each one being 1/3. With those two components, we can use the built-in chisq-dot-test function to calculate the chi-squared distance, which we find is about 15-point-8. The next questions is: is this a big distance? That's a question for a hypothesis test, which at this point you have a lot of experience with. You can get a sense for the sort of data you would see under null hypothesis that the distribution is uniform by

22. Simulating the null

simulating a single sample. Let's walk through how this code works. You take the gss 2016 dataset, specify that the variable you're interested in is "party", then hypothesize a null determined by the vector of probabilities corresponding to the uniform distribution. Since you're giving specific parameter values, this is called a "point" null hypothesis. That allows us to generate a single dataset through simulation - essentially flipping many three-sided coins where each face is "D", "I", or "R". This yields a new dataset under the null hypothesis. I'd like to visualize this data, so I'll go ahead and save it to sim_1, then construct a bar plot.

23. Simulating the null

This one looks much closer to the uniform distribution, and we'd expect that the chi-squared distance is quite a bit less than 15-point-8. What we need to do next is repeat this many times to build up a null distribution of chi-squared statistics.

24. Let's practice!

So let's jump into the exercises and port this idea of a goodness of fit hypothesis test using the chi-squared back to your case study of evaluating Benford's Law's ability to describe voting data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.