1. Parts of a regression
Before we dive into hierarchical models, we'll look at the basics of regressions: Slopes and intercepts.
I'm reviewing them so that we use the same terms; if these are new, I suggest the DataCamp courses on regression modeling.
I use the terms linear regression and linear model interchangeably.
2. An intercept
In the most basic model, we examine the mean across all groups.
The name for this term, beta, is the intercept.
We also include an error term, epsilon.
3. Multiple intercepts
For the models with multiple intercepts, we can structure the model two ways.
First, we can use a global intercept beta-naught and then model the effects of groups 2 and 3 compared to the global intercept.
Second, we can model each group as its own intercept.
Each modeling approach can be helpful depending upon the situation.
If we are interested in estimating how two different treatments differ from a reference, we use the first approach.
Conversely, if we are interested in estimating the mean of each group, we use the second approach.
Also, this model is essentially an Analysis of Variance (ANOVA).
4. Linear models in R
In R, the base function lm() builds linear models.
The syntax includes a formula and data.
This linear model could be described as "y is predicted by x, using the data, myData."
Also, to run an ANOVA, we can wrap the linear model's output in the anova().
5. A simple linear regression with slopes
You've now seen how intercepts can be used to build a linear model for discrete predictor variables.
Now, we're going to look at using continuous predictor variables: slopes.
The most basic regression includes a response variable, y, an intercept, beta-naught, a slope beta-one, and an error term.
The slope predicts how the expected value changes because of the continuous predictor.
6. Multiple regression
The simple linear regression expands to include multiple response variables.
These include discrete predictors with corresponding intercepts as well as continuous predictors with corresponding slopes.
The DataCamp course on Multiple and Logistic Regression covers these in greater detail.
7. Multiple regression caveats
Multiple regression is a powerful tool, but has limitations.
First, predictor variables can change the estimates for other predictors if both are not independent.
Small deviations from this can cause quantitative changes to the model if parameters are added or dropped from the model.
Large deviations can cause qualitative and statistical changes to models.
Second, the previous consideration requires us to say "estimates have been corrected for" other variables.
Third, Simpson's paradox means missing important predictors can make our model wrong.
Fourth, multiple regression assumes linearity.
Fifth, we may need to consider interactions, which occur when groups have different slopes.
We'll look at interactions in the next exercise.
8. Multiple regression in R tips
Using linear models in R is relatively straightforward, but has some quirks.
First, a minus 1 is needed to estimate an intercept for each group rather than the intercept relative to the first group.
Second, if a predictor x is numeric, R treats it as a slope.
Fix this by converting predictors to factors.
Third, predictors may need scaling.
For example, if one predictor is in cm and the other 1,000-cm, using km or kilometers for the second predictor would be more appropriate.
Also, when using time for a slope estimate, you may want to rescale and use 0 for the start.
Last, a shortcut for an interaction is x1*x2 rather than the using plus colon syntax.
9. Refresher of running and plotting a linear regression in R
In this example, we ran a linear model.
Then, we extracted the coefficients using the the tidy function from the broom package.
Last, we added it to a ggplot().
Notice how we extracted the intercept and slope and used the geom_abline() function.
I did this rather than adding a linear model with geom_smooth() because this approach extends well to complicated models.
10. Let's practice!
Now, it's your turn for regressions!