Multiple linear regression

1. Multiple linear regression

In this video you'll learn about how to use multiple linear regression.

2. Omitted variable bias

One threat to the accuracy of the simple linear regression from before is what's called "omitted variable bias". This occurs when a variable not included in the regression is correlated with both the explanatory variable and the response variable.

3. The more effort, the less success?

Imagine we are looking at the relationship between the study time before an exam and the success achieved. If we just consider these two variables, we find a negative relationship: the more a person studies, the lower her exam score will be. Strange, isn't it?

4. The more effort, the more success!

Since IQ is positively related to exam success and negatively related to study time we need to include this variable in the regression. Then, with help of multiple regression, I now estimate the positive effect of study time.

5. Multiple linear regression

Let's estimate a multiple regression model using the `lm` function, including all the variables in the dataset. `futureMargin` is now modeled as a function of `margin`, `nOrders`, `nItems`, and so on; we save the model as `multipleLM`. Just as before, we use `summary`, now with `multipleLM` as an argument. That worked; although, we now encounter other problems.

6. Multicollinearity

Multicollinearity is one threat to a multiple linear regression. This occurs whenever one explanatory variable can be explained by the remaining explanatory variables. Then, the regression coefficients become unstable and the standard errors reported by the linear model are underestimates. Due to high correlation between `nOrders` and `nItems` as well as `marginPerOrder` and `marginPerItem`, these variables are candidates for multicollinearity.

7. Variance Inflation Factors

To systematically check all variables in a model for multicollinearity, we calculate the variance inflation factors (VIFs) using the `vif` function from the *rms* package. These indicate the increase in the variance of an estimated coefficient due to multicollinearity. A VIF higher than 5 is problematic and values above 10 indicate poor regression estimates. Let's look at our model's variance inflation factors. As expected, the VIFs for `nOrders` and `nItems` as well as `marginPerOrder` and `marginPerItem` are rather high.

8. New model

Hence, we exclude one of each pair from the regression; namely, `nItems` and `marginPerOrder`. Here are the VIFs of the new model - they're all acceptable now.

9. Interpretation of Coefficients

Finally, we are ready to interpret the model output. The intercept gives the expected margin in year 2 when all independent variables are set to 0. Hence, we observe an expected margin in year 2 of roughly 23, given that every explanatory variable is equal to zero. It's usually hard to make interpretations for just the value of the intercept in a multivariate regression model. The coefficient of each explanatory variable gives the effect that a one-unit change in that variable has on the expected margin in year two (with all other variables being held constant). The coefficient estimate of roughly `0.4` for the margin variable signifies a `0.4` Euro increase in future margins given an increase of 1 Euro for margins in the current year. Let's also look at the coefficient's significance. By default a t-test about whether or not the respective coefficient is 0 is conducted. If the p-value in the last column is smaller than 0.05, we can conclude the coefficient is significantly different from `0` at the `.05` significance level. In our example all variables except gender, age and the items per order are significant at the 95% confidence level. There is also a test if all coefficients are simultaneously equal to zero, but more on that later.

10. Let's practice!

Let's practice first!