1. Model Errors and Randomness
We've seen linear model parameters as distributions, spread about some central peak.
Now we'll relate the parameter distributions to the "standard error" of linear model parameters, and check whether our parameter estimates are effected by randomness.
2. Types of Errors
We start by considering three types of errors common to sampling and measurement.
First, measurement error: mistakes made when collecting or recording the data.
For example, if we had a broken sensor, or wrote down measured values wrong.
Second, sampling bias: taking draws from one small portion of the population not representative of the rest.
For example, drawing temperatures only from August, when the days are hottest.
Third, variation due to random chance: for example, how do we know that the mean slope from the model fit is not due to just random fluctuations or "noise"?
3. Null Hypothesis
This last question can be restated:
"Is our effect due a relationship or due to random chance?"
Does the ordering or grouping of the data cause an effect larger than what could be produced by randomly shuffled data?
The rest of this lesson will focus on answering this question with The Null Hypothesis.
4. Ordered Data
To demonstrate, let's return to the hiking data, here plotted as total trip distance versus trip duration. We can see a linear relationship between distance and time.
5. Grouping Data
To simplify the demonstration, we group the data into short and long duration trips, shown here as red and blue.
6. Grouping Data
Now we discard the time values, and plot the histograms of distances in each group.
The short duration group has an mean trip distance of about 5 miles, and the long duration group, a mean of about 15.
The difference between any two points, one from each group, is, on average, about 10 miles.
7. Test Statistic
Let's see it in code.
Here we separate 1000 hiking trips into 2 groups, based on whether the trip duration was less than or greater than 5 hours.
As a test statistic, we compute the difference in distance between two randomly chosen points, one from each group, and repeat 500 times.
We then take the mean of these differences to be our "effect size" for how increasing time effects a change in distance.
8. Shuffle and Regrouping
Now we shuffle the data, removing any sense of time grouping.
9. Shuffling and Regrouping
The two distributions of shuffled group distances are entirely overlapped. We can already see that the same test will likely average to zero.
10. Shuffle and Split
To do it in code, we use `np.concatenate()` and `np.random.shuffle()` to recombine and shuffle away the time-grouping.
Next, we use `slice_index` and slice the shuffled data into two.
11. Resample and Test Again
Finally, we resample the arbitrary halfs of shuffled data, and recompute the test statistic and effect size.
12. p-Value
To visualize what if any change came from shuffling, we plot the two test statistic distributions.
Note that if we divide these distance differences by the time differences, the test statistic becomes the speed, an estimate for the slope parameter of the linear model.
The distribution found from the UNSHUFFLED or "ordered" groups, is in red: its mean is 10, seen as a black vertical line.
The distribution found from the SHUFFELED data is in blue: its mean is ZERO.
The measure of how often a value over 10 can be generated by RANDOMNESS, is seen as the fraction of shuffled data located to the right of 10. This fraction is called the "p-value", and here is 0.12, meaning there is a 12 percent chance to get a speed of 10 or more just from random chance.
13. Let's practice!
Now it's your turn to compute test statistics in the context of a linear relationship.