Goodness of fit
1. Goodness of fit
The distribution of first digit counts in the Iran data2. First Digit Distribution
captures many of the main features of Benford's Law: 1 is the most common digit,3. First Digit Distribution
followed by 2, then it generally continues to decay as the numbers increase. You'll note, though, that it's not a perfect fit. There sure are a lot of twos4. First Digit Distribution
and also the number of sevens5. First Digit Distribution
is actually greater than the number of sixes. Essentially, we're left wondering if these deviations from Benford's Law are just due to random chance, or if they indicate that in fact Benford's Law is not a good description of this voter data. That's a question of statistical significance, and to answer that, we first need to come up with a statistic,6. First Digit Distribution
a measure of how far the vote distribution is from Benford's Law. In fact, we don't need to look far.7. Chi-squared distance
The Chi squared distance that we used to measure independence in the last chapter can also be used to measure goodness of fit. Let's review how we used that statistic to measure the distance between these two visualizations: the relationship between political party and space funding that we actually observed (on the left) and the distribution we'd expect if they're independent of one another (on the right).8. Chi-squared distance
This calculation goes cell by cell, so let's start up in this corner9. Chi-squared distance
and take difference between the observed count, O,10. Chi-squared distance
and the expected count over here, E.11. Chi-squared distance
To be sure this difference is positive, we square it ,12. Chi-squared distance
then we divide it by E to scale it by the size of this cell. The resulting number is this cell's contribution to the chi-square.13. Chi-squared distance
We then move to the next cell and do the same thing.14. Chi-squared distance
Take the squared and scaled difference.15. Chi-squared distance
We continue this routine through all 9 cells, then add them up: that is the chi-squared distance.16. Chi-squared distance
For these distributions, that distance was 1-point-32. Let's see how this applies in the goodness of fit situation.17. First Digit Distribution
Here each cell is on of the digit categories, but the rest of the calculation is the same. We calculate the squared difference between18. First Digit Distribution
the counts of leading ones19. First Digit Distribution
and we add that to the squared difference in counts of leading twos20. First Digit Distribution
all the way up to leading nines, and we add them together and that gives us our chi squared distance.21. Example: uniformity of party
Let's walk through an example with the gss data of using the Chi-squared to assess the goodness of fit. Here's the distribution of the party variable, a simple bar chart with three categories. Say we'd like to assess the distance between this distribution and the uniform distribution where each party is the same size. I've added a gold line to indicate what the bar heights would be if this was uniform. The first step in computing the chi-squared statistic is making a table of the counts in the bar chart. Once we have that table of counts, we need to describe the uniform distribution as a named vector of three probabilities , each one being 1/3. With those two components, we can use the built-in chisq-dot-test function to calculate the chi-squared distance, which we find is about 15-point-8. The next questions is: is this a big distance? That's a question for a hypothesis test, which at this point you have a lot of experience with. You can get a sense for the sort of data you would see under null hypothesis that the distribution is uniform by22. Simulating the null
simulating a single sample. Let's walk through how this code works. You take the gss 2016 dataset, specify that the variable you're interested in is "party", then hypothesize a null determined by the vector of probabilities corresponding to the uniform distribution. Since you're giving specific parameter values, this is called a "point" null hypothesis. That allows us to generate a single dataset through simulation - essentially flipping many three-sided coins where each face is "D", "I", or "R". This yields a new dataset under the null hypothesis. I'd like to visualize this data, so I'll go ahead and save it to sim_1, then construct a bar plot.23. Simulating the null
This one looks much closer to the uniform distribution, and we'd expect that the chi-squared distance is quite a bit less than 15-point-8. What we need to do next is repeat this many times to build up a null distribution of chi-squared statistics.24. Let's practice!
So let's jump into the exercises and port this idea of a goodness of fit hypothesis test using the chi-squared back to your case study of evaluating Benford's Law's ability to describe voting data.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.