1. Mathematical approximation
In this chapter, we'll be using mathematical approximations to test and estimate the slope parameter. The approximations will build on the t-distribution which you may have seen in previous courses. The mathematical model is often correct and is usually easy to implement computationally.
2. Placeholder
Recall the Starbucks data. In this chapter, we will start with a continued investigation into the linear relationship between Fat and Calories.
3. Sampling distribution of slope: good t fit
Using the bootstrap sampling procedure from the previous chapter, we can generate a histogram of bootstrapped slopes. The red line on top of the histogram gives the appropriate t-distribution with n-2 degrees of freedom. You can see that the mathematical model (that is, the t-distribution) is remarkably similar to the computational methods described previously.
The R code uses the `dt` function to provide the density of the t-distribution. You won't typically use the `dt` function in your analyses, but you should know that R is using it in the background to calculate the p-value *because* the t-distribution fits the sampling distribution so well.
4. Sampling distribution of slope: good t fit
We can look at the histogram more closely, because interest is truly in the tails of the distribution. Remember, decisions to reject or not reject the null hypothesis are made by looking at the tails of the distribution.
5. Model fit
Here, we see that the mathematical model fits the computational model quite well, even in the tails. The good fit is seen by looking at the red line which is slightly above the histogram in some places and slightly below the histogram in other places.
6. Fiber & protein, a poor linear model
Unlike the relationship between fat and calories, fiber and protein does not seem to have an obvious linear model. although the least squares fit produces a line with a positive slope, the points don't seem to indicate a strong positive relationship.
7. Sampling distribution of slope: poor t fit
we can repeat the bootstrap analysis to identify whether the mathematical model fits the sampling distribution of the slopes when dealing with variables that do not seem to show a strong linear trend. we will argue that although the red line is not extremely different from the histogram, it is different enough to indicate concerns with the mathematical model used to describe fiber and protein.
Note that we again use the `dt` function, but the fit is not as good as it was with fat and calorie. R won't know when the fit is good and when it isn't, so R will always use the t-distribution to calculate p-values, even when it isn't a good idea.
8. Sampling distribution of slope: poor t fit
indeed, we are again most concerned with the fit of the t-distribution to the histogram in the tails of the distribution.
in looking at the left tail, notice that the red line is above the histogram values. that means, we are unlikely to see values as extreme if the null hypothesis is true. indeed, the p-value with the mathematical model will overestimate the true p-value.
9. Sampling distribution of slope: poor t fit
notice that the histogram values in the right tail are all above the red line. we are likely to call those value significant, even if the null hypothesis is true.
the p-value given by the mathematical model in the right tail will underestimate the true p-value.
the discrepancy from the mathematical model does not matter if the effect is extremely strong or if there is absolutely no effect. it matters when the data have low power either in terms of small sample size or minimal effect size.
10. Let's practice!
In this video we've talked about the mathematical model and how sometimes it can lead to inaccurate p-values. In the following chapter we'll cover the technical conditions which insure the mathematical model does fit. But for now, it's your turn to practice *using* the mathematical model.