Get startedGet started for free

Explaining teaching score with age

1. Explaining teaching score with age

Equipped with the background and terminology from Chapter 1, you can begin formal modeling using "basic" linear regression, where you model an outcome variable y using a single explanatory/predictor variable x.

2. Refresher: Exploratory data visualization

Earlier, you explored the relationship between teaching score and age via a scatterplot and the correlation coefficient of -0.107, indicating a weakly negative relationship. You also saw that this scatterplot suffers from overplotting. Let's keep this overplotting in mind as we move forward. Now, can you visually summarize the above relationship with a "best fitting" line? A line that cuts through the cloud of points, separating the signal from the noise? Yes! Using a regression line!

3. Regression line

Here is the ggplot2 code that produced the previous scatterplot. You can add a "best fitting" line by adding a geom_smooth() layer, with method equal lm for linear model, and SE equal false to omit standard error bars, SE bars are a concept for a more advanced course.

4. Regression line

Observe. The overall relationship is negative: as ages increase, scores decrease. This is consistent with our computed correlation coefficient of -0.107. Now, does that mean aging directly causes decreases in score? <pause> Not necessarily, as there may be other factors I'm not accounting for above. After all, correlation isn't necessarily causation. This "best-fitting" line is the “linear regression line”, and is a "fitted linear model" f-hat. Let's draw connections with our earlier modeling theory.

5. Refresher: Modeling in general

Recall the general modeling framework. Using only y and x, you fit a model f-hat that hopefully closely approximates the true unknown f while ignoring the error. F-hat yields fitted/predicted values y-hat which hopefully closely match the observed y's. -I'll later define what "closely match" means.

6. Modeling with basic linear regression

In linear regression, you assume f is a linear function i.e. a line, necessitating an intercept beta-0 and a slope for x beta-1. The observed value y thus has the following form. The fitted model f-hat is also assumed linear, but with fitted, or estimated, intercept beta-0-hat and slope for x beta-1-hat. These values are computed using our observed data. Plugging x into f-hat yields fitted/predicted values y-hat. Note that there is no epsilon term here, as our fitted model f-hat should only capture signal and not noise.

7. Back to regression line

The "best-fitting line" is thus beta-0-hat + beta-1-hat times x. But what are the numerical values of the fitted intercept and slope? I'll let R compute these for us.

8. Computing slope and intercept of regression line

You first fit an lm() linear model, using as arguments the data and a model formula of form y tilde x, where y is the outcome and x is the explanatory variable. I'll save this in model_score_1 and display its contents. While the intercept of 4.461 has a mathematical interpretation; the value of y when x equals 0, here it doesn't have a practical interpretation, the teaching score when age is 0. The slope of -0.0059 quantifies the relationship between score and age. Its interpretation is rise-over-run: for every increase of one in age, there's an associated decrease of on average 0.0059 units in score. The negative slope emphasizes the negative relationship. However, the latter output is a bit sparse and not in dataframe format. Let's improve this.

9. Computing slope and intercept of regression line

Let’s apply the get_regression_table() function from the moderndive package to model_score_1. This produces what's known as a regression table. This function is an example of a wrapper function: it takes other existing functions and hides its internal workings, so that all you need to worry about are the input and output format. The fitted intercept/slope are now in the second column estimate. The additional columns like standard-error and p-value, all speak to the statistical significance of our results. However, I'll leave these concepts for a more advanced course on statistical inference.

10. Let's practice!

Your turn! Instead of linearly modeling score as a function of age, you're going to linearly model teaching score as a function of beauty score!