Get startedGet started for free

Multicollinearity

1. Detecting and dealing with multicollinearity

Insignificant variables can imply correlation between independent variables. If several variables are strongly related, they don't really provide any new linearly independent information that can help predict the outcome. So what is correlation?

2. Understanding correlation

Correlation is the measure of association between two numeric variables.

3. Visualizing correlation

Correlation coefficient r measures the strength and the direction of the association between two variables and ranges between -1 to +1. Here you can see that if correlation coefficient r is closer to -1, it reflects strong negative relationship between the variables while if correlation coefficient r is closer to +1, it reflects strong positive relationship between the variables. Low correlation coefficient indicates a weak relationship between the variables.

4. Calculating correlation in R

For e.g., if you are interested in knowing whether there is a relationship between the employee age and compensation, a correlation coefficient can be calculated to answer this question. To calculate the correlation coefficient between two variables, you can use the cor() function, as shown here.

5. What is multicollinearity?

However, there can also be situations where one independent variable can be highly collinear with a set of two or more independent variables. This phenomenon is called multicollinearity. One of the assumptions a logistic regression model makes is that there exists little or no multicollinearity among the independent variables. If your independent variables are multicollinear, it becomes difficult to assess how much effect each variable has on the dependent variable.

6. How to detect multicollinearity?

To measure multicollinearity, you have to compute variance inflation factor of each independent variable. You can do this using the vif() function from the car package. To calculate VIF of each variable you need to pass the logistic regression model to vif().

7. Variance inflation factor

The output of vif() is a matrix as shown here. The variance inflation factor is always greater than or equal to 1 and is shown in the GVIF column. This number shows percentage of variance inflated for each coefficient. For example, the VIF of gender variable is 1.26, which indicates that the variance is 26% bigger than what you would expect if there was no correlation with other independent variables.

8. Rule of thumb for interpreting VIF value

If VIF of a variable is one, it means that it is not correlated with any of the variables. However, in practice, there is always a small amount of collinearity among the independent variables. When VIF is between 1 and 5, even though moderate multicollinearity exists, it is not a cause for concern. As a rule of thumb, a VIF of greater than 5 is problematic and thus the variable should be removed the from the model.

9. How to deal with multicollinearity?

So this is what you are going to do in the following exercises. First, you will check if any variable has a VIF of greater than 5 in your full model. If only one variable has a VIF greater than 5, you will remove that variable from the model. If multiple variables have a VIF greater than 5, remove the variable with the highest VIF. Repeat this process until all variables have a VIF of less than 5.

10. Removing a variable from a model

Note that you can easily remove a variable by using the syntax shown here. After the period, specify the negative sign followed by the variable name. This will fit a model with all the variables except the one specified after the negative sign.

11. Let's practice!

Now it's your turn to detect multicollinear variables in the dataset.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.