1. Parallel slopes linear regression
Hi, I'm Richie. Welcome!
2. The previous course
This course builds on the skills from the previous course.
3. From simple regression to multiple regression
In that last course, you performed linear and logistic regression with a single explanatory variable. This time, you'll learn to fit models that include multiple explanatory variables. This is sometimes called "multiple regression".
Including more explanatory variables in the model often gives you more insight into the relationship between the explanatory variables and the response, and can provide more accurate predictions. It's an important step towards mastering regression.
4. The course contents
Here's the plan. In Chapter 1, you'll explore parallel slopes linear regression. This is a special case of multiple linear regression, with one numeric explanatory variable and one categorical explanatory variable.
Chapter 2 introduces interactions between variables and covers Simpson's Paradox, a counter-intuitive result affecting models containing categorical explanatory variables.
Chapter 3 extends linear regression to even more explanatory variables, and gives some deeper insight into how linear regression works.
Finally, Chapter 4 introduces multiple logistic regression, the logistic distribution, and digs into how logistic regression works.
5. The fish dataset
Here's the fish dataset from the previous course. Each row represents a fish, the mass is the response variable, and there is one numeric and one categorical explanatory variable.
6. One explanatory variable at a time
Recall that you run a linear regression by calling lm, passing a formula and a data frame. The formula has the response variable on the left and the explanatory variable on the right, with the variables separated by a tilde.
Here you can see mass modeled against length. Printing the model shows the model coefficients. With a single numeric explanatory variable, you get one intercept coefficient and one slope coefficient.
Let's change the explanatory variable to species. Recall that when you have a categorical explanatory variable, the coefficients are a little easier to understand if you use "plus zero" to tell R not to include an intercept in the model.
Now you get one intercept coefficient for each category. That is, one coefficient for each species of fish.
7. Both variables at same time
To include both explanatory variables in the model, you combine them in the right-hand side of the formula, separated with a plus, just like you did with the zero.
This time there is one slope coefficient, and one intercept coefficient for each category in the categorical variable.
8. Comparing coefficients
Examining the coefficients of each model, it's clear that the numbers are different.
Notice that the slope coefficient for length, labeled length_cm, changes from thirty five to forty three once you include species in the model as well. The intercept coefficients for each species show an even bigger change. For example, once you add length into the model, bream, labeled speciesBream, changes from six hundred and eighteen to minus six hundred and seventy two.
9. Visualization: 1 numeric explanatory var
Here's the standard visualization for a linear regression with a numeric explanatory variable. You draw a scatter plot, then use geom_smooth with method equals "lm" to add a linear trend line. Setting se to FALSE prevents a standard error ribbon being drawn.
10. Visualization: 1 categorical explanatory var
For a categorical response, there are a few possible plots. The simplest one is to draw a box plot for each category. The model coefficients are the means of each category, which I've added using stat_summary with the fun-dot-y argument set to mean. shape equals 15 makes the mean point square.
11. Visualization: both explanatory vars
With a numeric and a categorical explanatory variable, you can draw a scatter plot as before. ggplot2 doesn't have an easy way to plot the model results, but fortunately one is provided by the moderndive package.
In the plot the lines are parallel to each other. Consequently, this type of regression is nicknamed "parallel slopes regression", and the function to draw the lines is geom_parallel_slopes.
12. Let's practice!
Let's begin.