Multicollinearity techniques - feature engineering
Multicollinearity is a common issue that might affect your performance in any machine learning context. Knowing how to discuss this small detail could take your explanation of modeling from good to great and really set you apart in an interview.
In this exercise, you'll practice creating a baseline model using Linear Regression on the diabetes
dataset and explore some of the output metrics. Then you'll practice techniques to visually explore the correlation between the independent variables before finally perform feature engineering on 2 variables that are highly correlated.
For the first two steps, use X_train
, X_test
, y_train
, and y_test
which have been imported to your workspace.
Additionally, all relevant packages have been imported for you:
pandas
as pd
, train_test_split
from sklearn.model_selection
, LinearRegression
from sklearn.linear_model
, mean_squared_error
and r2_score
from sklearn.metrics
, matplotlib.pyplot
as plt
and seaborn
as sns
.
This exercise is part of the course
Practicing Machine Learning Interview Questions in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Instantiate, fit, predict
lin_mod = ____()
lin_mod.____(____, ____)
y_pred = lin_mod.____(____)
# Coefficient estimates
print('Coefficients: \n', lin_mod.____)
# Mean squared error
print("Mean squared error: %.2f"
% ____(____, ____))
# Explained variance score
print('R_squared score: %.2f' % ____(____, ____))