Get startedGet started for free

Formulating and simulating a hypothesis

1. Formulating and simulating a hypothesis

When we studied linear regression, we assumed a linear model

2. 2008 US swing state election results

for how the data are generated and then estimated the parameters that are defined by that model. But, how to we assess how reasonable it is that our observed data are actually described by the model? This is the realm of hypothesis testing. Let's start by thinking about a simpler scenario. Consider the following.

3. Insert title here...

Ohio and Pennsylvania are similar states. They are neighbors and they both have liberal urban counties and also lots of rural conservative counties. I hypothesize that county-level voting in these two states have identical probability distributions. We have voting data to help test if this hypothesis. Stated more concretely,

4. Hypothesis testing

we are going to assess how reasonable the observed data are assuming the hypothesis is true. The hypothesis we are testing is

5. Null hypothesis

typically called the null hypothesis. We might start by just plotting the two ECDFs of

6. ECDFs of swing state election results

the county-level votes. Whew! It is pretty tough to make a judgment here. Pennsylvania seems to be slightly more toward Obama in the middle part of the ECDFs, but not much. We can't really draw a conclusion here.

7. Percent vote for Obama

We could just compare some summary statistics. Again, this is a tough call. The means and medians of the two states are really close, and the standard deviations are almost identical. So eyeballing the data is not enough. To resolve this issue,

8. Simulating the hypothesis

we can simulate what the data would look like if the county-level voting trends in the two states were identically distributed. We can do this by putting the Democratic share of the vote for all of Pennsylvania's 67 counties and Ohio's 88 counties together.

9. Simulating the hypothesis

We then ignore what state they belong to. Next, we randomly scramble

10. Simulating the hypothesis

the ordering of the counties.

11. Simulating the hypothesis

We then re-label the first 67 to be "Pennsylvania" and the remaining ones to be "Ohio." So, we just redid the election as if there was no difference between Pennsylvania and Ohio.

12. Permutation

This technique, of scrambling the order of an array, is called a permutation. It is at the heart of simulating a null hypothesis were we assume two quantities are identically distributed.

13. Generating a permutation sample

Let's look at how we can implement this in Python. First, we need to make a single array with all of the counties in it. We do this using the np dot concatenate function. Notice that this function takes a tuple of the arrays you wish to concatenate as an argument. Next, we use the function np dot random dot permutation to conveniently permute the entries of the array. We then assign the first 67 to be labeled Pennsylvania and the last 88 to be labeled Ohio. These samples are called permutation samples.

14. Let's practice!

Now, let's practice doing some permutation sampling of real data!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.