Variability in regression lines

1. Welcome to the course!

Hi, welcome to Inference for Linear Regression. I'm Jo Hardin, I'm a professor of math and statistics at Pomona College, and I'll be your instructor for this course. I'm assuming that you've already worked through the first few courses in this intro stats series.

2. In this course you will

In this course, you will be building on your previous work to now make inferential (instead of descriptive) claims based on linear models. In particular, we will use the least squares regression line to test whether or not there is a relationship between two continuous variables. And we will also estimate confidence intervals which quantify the slope of the linear regression line.

3. Fat & calories: data

We will be working primarily with two continuous variables. Consider the scatterplot here, fat and calories are plotted for a handful of items on the Starbucks menu.

4. Fat & calories: linear model

As you did in previous courses, the least squares regression line is fit to the sample of observations. It seems like fat and calories have a reasonably strong positive linear association.

5. Fat & calories: sample 1 (n=20)

A subset of size 20 items shows a similar positive trend between fat and calories, despite having fewer observations on the plot.

6. Fat & calories: sample 2 (n=20)

Indeed, a second sample of size 20 also shows a positive linear trend.

7. Fat & calories: samples 1 and 2

When the two samples are plotted on the same figure, we see that the least squares regression lines are not identical. That is, there is variability in the regression line from sample to sample. The concept of sampling variability is something you've seen before, but in this chapter, you will focus on the variability of the line instead of the variability of a single statistic.

8. Sampling variability

That is, there is variability in the regression line from sample to sample. The concept of the sampling variability is something you've seen before, but in this chapter, you will focus on the variability of the line instead of the variability of a single statistic.

9. Fat & calories: many samples

Indeed, when we take repeated samples of size 20 (here we took 50 different samples), every single line is different. Notice that the `rep_sample_n` command let us take many samples of size 20 and ggplot fit the linear model separately for each of those samples.

10. Fat & calories: sampling distribution of slopes

We can characterize the sampling distribution of the slopes by making a density plot (a smoothed histogram) showing the variability associated with the different slopes.

11. Interpret the density plot

We can characterize the sampling distribution of the slopes by making a density plot (a smoothed histogram) showing the variability associated with the different slopes. The R code uses the `tidy` function in the `broom` package to pull out the slope coefficient for each of the separate models. Using ggplot, we can plot the different slope estimates. We can see that the slopes vary from about 8 to about 17. In no sample did the slope come anywhere close to zero. that is, there is a lot of evidence that the relationship between fat and calories is positive. We can see that the slopes vary from about 9 to about 16. In no sample did the slope come anywhere close to zero. that is, there is a lot of evidence that the relationship between fat and calories is positive.

12. Fat & carbohydrates: many samples

The same analysis can be done with different variables. Here, consider fat and carbohydrates. Again, for each of 50 samples of size 20, a different regression line (and different slope) is calculated.

13. Fat & carbohydrates: sampling distribution of slopes

Unlike the relationship between fat & calories, however, some of the sample slopes describing fat & carbohydrates WERE close to zero (or even negative!). Although the relationship is possibly positive, we are unable to make conclusions about fat and carbohydrates due to the large amount of sampling variability.

14. Interpreting the density plot

The high sampling variability makes it impossible to make conclusions about the relationship between fat and carbohydrates. Although the relationship is possibly positive, we are unable to make conclusions due to the large amount of sampling variability.

15. Let's practice!

Thanks for following along with this video, now it is your turn to practice: first you'll review how to run the linear model using the `broom` package, then you'll investigate how the lines change from sample to sample. Have fun!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.