Modeling survey data

1. Modeling with linear regression

It's time to study the regression line in more detail.

2. Regression line

In the last video we added the regression line to our scatter plot, which highlights the positive linear relationship between age and head size. We can now use this line to predict head size of a given baby. For example, what would we predict the head size of a 4 month old baby to be?

3. Regression line

Based on our plot, we would predict that a 4 month old baby would have a head size of around 42.5 centimeters. While these orange reference lines were helpful in making our prediction, it would be easier to use the regression equation directly. So what does the regression equation look like?

4. Regression equation

The linear regression equation is well, a line, where a is the intercept, b is the slope and y-hat is the predicted value for y based on a given value of x. But how do we know what a and b equal? As we mentioned before, they are chosen so that the squared distance between y and y-hat, scaled by its survey weight, is small.

5. Fitting regression model

Instead of using calculus to compute a and b by hand, R will find a and b for you via the svyglm() function. glm stands for generalized linear model. Here we specify y tilde x and our design. The summary() command provides useful output. Specifically, in the coefficients table, we see a equals 38.1 and b equals 1.07. We expect a newborn to have a head circumference of around 38.1 centimeters and for each additional month, we expect the head circumference to increase by about 1 centimeter. The table also displays the standard errors associated with the coefficient estimates. The last two columns provide test statistics and p-values for hypothesis tests. But what hypotheses are actually being tested? To answer this question, we need to come back to the definition of the regression line.

6. Linear regression inference

Our regression line, y-hat equals a plus b times x, is actually the estimated regression line based on the sample. We are estimating the true linear relationship between x and the average (or expected) value of y, denoted by capital E of y. Therefore a estimates the true intercept, denoted by captial A, and b estimates the true slope, denoted by capital B. Additionally, we assume the standard deviation of $Y$ equals sigma. This is a measure of how much points will vary around the regression line. A common inferential question we might ask is "are our two variables linearly related"? This is the same as asking if B is non-zero. Let's translate that question into null and alternative hypotheses.

7. Linear regression inference

The null is that there isn't a relationship and the slope equals zero. The alternative is there is a relationship so the slope is non-zero. And, how does this relate to the summary of our model? Well, lucky for us, the test statistic and p-value in the second row correspond to exactly these hypotheses. The test statistic, which is the estimated slope over its standard error, again follows a t-distribution. Here, with a test statistic of 18.1 and a p-value of essentially zero, we have evidence of a linear trend between age and head size.

8. Let's practice!

Are you ready to build some regression models?

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Analyzing Survey Data in R

IntermediateSkill Level

4.8+

114 reviews