Multicollinearity techniques - feature engineering

Multicollinearity is a common issue that might affect your performance in any machine learning context. Knowing how to discuss this small detail could take your explanation of modeling from good to great and really set you apart in an interview.

In this exercise, you'll practice creating a baseline model using Linear Regression on the diabetes dataset and explore some of the output metrics. Then you'll practice techniques to visually explore the correlation between the independent variables before finally perform feature engineering on 2 variables that are highly correlated.

For the first two steps, use X_train, X_test, y_train, and y_test which have been imported to your workspace.

Additionally, all relevant packages have been imported for you: pandas as pd, train_test_split from sklearn.model_selection, LinearRegression from sklearn.linear_model, mean_squared_error and r2_score from sklearn.metrics, matplotlib.pyplot as plt and seaborn as sns.

1
- Instantiate, fit, and predict a Linear Regression.
- Print the model coefficients, MSE, and r-squared.

2
- Create a correlation matrix, plot it to a heatmap.
- Print the matrix to explore the independent variable relationships.
3
- Engineer a new feature by combining s1 and s2 from diabetes, then remove them.
- Split your data into training and testing data with 30% test size and print the column names.
4
- Instantiate, fit, and predict a Linear Regression.
- Print the model coefficients, MSE, and r-squared.

Data Pre-processing and Visualization

Supervised Learning

Unsupervised Learning

Model Selection and Evaluation

Exercise

Multicollinearity techniques - feature engineering

Instructions 1/4