Parallel slopes linear regression

1. Parallel slopes linear regression

Hi, I'm Maarten. Welcome!

2. The previous course

This course builds upon the skills from the previous course. In that last course, you performed linear and logistic regression with a single explanatory variable.

3. From simple regression to multiple regression

This time, you'll learn to fit models that include multiple explanatory variables. This is sometimes called "multiple regression". Including more explanatory variables in the model often gives you more insight into the relationship between the explanatory variables and the response, and can provide more accurate predictions. It's an important step towards mastering regression.

4. The course contents

Here's the plan. In Chapter 1, you'll explore parallel slopes linear regression. This is a special case of multiple linear regression, with one numeric explanatory variable and one categorical explanatory variable. Chapter 2 introduces interactions between variables and covers Simpson's Paradox, a counter-intuitive result affecting models containing categorical explanatory variables. Chapter 3 extends linear regression to even more explanatory variables, and gives some deeper insight into how linear regression works. Finally, Chapter 4 introduces multiple logistic regression, the logistic distribution, and digs into how logistic regression works.

5. The fish dataset

Here's the same fish dataset from the previous course. Each row represents a fish, the mass is the response variable, and there is one numeric and one categorical explanatory variable.

6. One explanatory variable at a time

Recall that you run a linear regression by using ols from statsmodels dot formula dot api, passing a formula and a DataFrame. The formula has the response variable on the left and the explanatory variable on the right, with the variables separated by a tilde. You then fit the model using dot fit. Here you can see mass modeled against length. Printing the model parameters using the params attribute shows the model coefficients. With a single numeric explanatory variable, you get one intercept coefficient and one slope coefficient. Let's change the explanatory variable to species. Recall that when you have a categorical explanatory variable, the coefficients are a little easier to understand if you use "plus zero" to tell statsmodels not to include an intercept in the model. Now you get one intercept coefficient for each category. That is, one coefficient for each species of fish.

7. Both variables at the same time

To include both explanatory variables in the model, you combine them on the right-hand side of the formula, separated with a plus, just like you did with the zero. This time there is one slope coefficient, and an intercept coefficient for each category in the categorical variable.

8. Comparing coefficients

Examining the coefficients of each model, it's clear that the numbers are different. Notice that the slope coefficient for length, labeled length_cm, changes from thirty five to forty three once you include species in the model as well. The intercept coefficients for each species show an even bigger change. For example, once you add length into the model, bream changes from six hundred and eighteen to minus six hundred and seventy two.

9. Visualization: 1 numeric explanatory variable

Here's the standard visualization for a linear regression with a numeric explanatory variable. Using Seaborn's regplot function, you draw a scatterplot with linear trend line, specifying the x, y, and data arguments. Setting ci to None prevents plotting a confidence interval ribbon. You print the plot using plt dot show.

10. Visualization: 1 categorical explanatory variable

For a categorical response, there are a few possible plots. The simplest one is to draw a box plot for each category. The model coefficients are the means of each category, which I've added using the showmeans argument.

11. Visualization: both explanatory variables

Seaborn doesn't have an easy way to plot the 'both explanatory variables' model results, but we can draw the trend lines manually. To do this, you first need to extract the model coefficients into separate intercepts and the slope, as shown here. You then draw a standard scatter plot, with one additional argument: hue. The hue argument will be used throughout the course when you work with a categorical or continuous variable and want to color by its values. Lastly, you use plt dot axline four times, once for each fish species. axline draws a straight line, which is defined by at least one point and a slope. In this case, the xy1 argument specifies the intercept, which is different for each species. Assigning the slope coefficient to the slope argument in axline allows the function to draw the line. Additionally, you can specify a color argument. Since all slopes are equal in all plt dot axline calls, the trend lines are parallel to each other. Consequently, this type of regression is nicknamed "parallel slopes regression".

12. Let's practice!

Time to dive in your first set of exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.