Formulating and simulating a hypothesis
1. Formulating and simulating a hypothesis
When we studied linear regression, we assumed a linear model2. 2008 US swing state election results
for how the data are generated and then estimated the parameters that are defined by that model. But, how to we assess how reasonable it is that our observed data are actually described by the model? This is the realm of hypothesis testing. Let's start by thinking about a simpler scenario. Consider the following.3. Insert title here...
Ohio and Pennsylvania are similar states. They are neighbors and they both have liberal urban counties and also lots of rural conservative counties. I hypothesize that county-level voting in these two states have identical probability distributions. We have voting data to help test if this hypothesis. Stated more concretely,4. Hypothesis testing
we are going to assess how reasonable the observed data are assuming the hypothesis is true. The hypothesis we are testing is5. Null hypothesis
typically called the null hypothesis. We might start by just plotting the two ECDFs of6. ECDFs of swing state election results
the county-level votes. Whew! It is pretty tough to make a judgment here. Pennsylvania seems to be slightly more toward Obama in the middle part of the ECDFs, but not much. We can't really draw a conclusion here.7. Percent vote for Obama
We could just compare some summary statistics. Again, this is a tough call. The means and medians of the two states are really close, and the standard deviations are almost identical. So eyeballing the data is not enough. To resolve this issue,8. Simulating the hypothesis
we can simulate what the data would look like if the county-level voting trends in the two states were identically distributed. We can do this by putting the Democratic share of the vote for all of Pennsylvania's 67 counties and Ohio's 88 counties together.9. Simulating the hypothesis
We then ignore what state they belong to. Next, we randomly scramble10. Simulating the hypothesis
the ordering of the counties.11. Simulating the hypothesis
We then re-label the first 67 to be "Pennsylvania" and the remaining ones to be "Ohio." So, we just redid the election as if there was no difference between Pennsylvania and Ohio.12. Permutation
This technique, of scrambling the order of an array, is called a permutation. It is at the heart of simulating a null hypothesis were we assume two quantities are identically distributed.13. Generating a permutation sample
Let's look at how we can implement this in Python. First, we need to make a single array with all of the counties in it. We do this using the np dot concatenate function. Notice that this function takes a tuple of the arrays you wish to concatenate as an argument. Next, we use the function np dot random dot permutation to conveniently permute the entries of the array. We then assign the first 67 to be labeled Pennsylvania and the last 88 to be labeled Ohio. These samples are called permutation samples.14. Let's practice!
Now, let's practice doing some permutation sampling of real data!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.