Get startedGet started for free

Model selection: regression models

1. Model selection: regression models

Welcome back! In this lesson we're going to discuss multicollinearity, what it is and what to do about it. So let's get started!

2. Multicollinearity

Multicollinearity is when independent variables are highly correlated. One of the outputs of a regression model are the estimated regression coefficients. These coefficients are interpreted as the amount of change in the dependent variable that can be explained by an independent variable while holding all other variables constant in a multiple regression framework. But when independent variables are correlated, all of a sudden interpreting the amount of explained variance gets less clear, threatening the results of your linear regression analysis.

3. Effects of multicollinearity

Multicollinearity can affect machine learning models by reducing coefficients and p-values, causing variance to be unpredictable, since coefficient estimates will have inflated variance in the presence of multicollinearity, overfitting of the model, increased standard error measures which lowers statistical significance further leading to failing to reject which is a type II error for hypothesis testing, and makes it more difficult to determine the actual relationship with the target variable.

4. Techniques to address multicollinearity

So we definitely want to determine if multicollinearity exists in our data and then do something about it, but what? The first thing you should do is create a correlation matrix. Then, since they can be hard to interpret, plot the correlation as a heatmap to get a better visual understanding. Calculate the variance inflation factor, also called VIF for short. You can introduce penalizations, remember ridge and lasso? And, finally, you can do PCA since it has the side effect of removing multicollinearity.

5. Correlation matrix vs heatmap

As you can see by looking at the correlation matrix of the classic mtcars dataset on the left, although doable, they can be difficult to easily interpret. However, plotting the matrix as a heatmap is much easier for us to quickly explore the relationships between the variables. Here, the darker blue suggests a strong positive relationship and the darker red a strong negative one between features suggesting further investigation is required.

6. Variance inflation factor

The variance inflation factor, or VIF, is another way to help determine whether or not features are collinear with each other. If they have a VIF of less than or equal to 1, it suggests that they are not highly correlated and therefore not collinear. The higher their correlation, however, the higher the VIF value. Values between 1 and 5 can be safely ignored. But for VIF values greater than 5, further techniques to address the variables experiencing multicollinearity is highly suggested. There is no function readily available in sklearn so you won't be practicing this in the exercises, but feel free to explore VIF on your own.

7. Functions

In the exercises you'll practice exploring multicollinearity, and perform techniques to address it, starting with Linear Regression. The corr function returns a correlation matrix while sns dot heatmap of the matrix returns a heatmap. Once you've trained a model, the coef underscore attribute provides the estimated model coefficients. The mean squared error and r2 underscore score functions return the MSE and r-squared, respectively. Finally, a little review, the dot columns attribute on a dataframe returns the names of the columns.

8. Let's practice!

Okay, are you ready to put all this into practice?