Filtering out highly correlated features
You're going to automate the removal of highly correlated features in the numeric ANSUR dataset. You'll calculate the correlation matrix and filter out columns that have a correlation coefficient of more than 0.95 or less than -0.95.
Since each correlation coefficient occurs twice in the matrix (correlation of A to B equals correlation of B to A) you'll want to ignore half of the correlation matrix so that only one of the two correlated features is removed. Use a mask trick for this purpose.
This is a part of the course
“Dimensionality Reduction in Python”
Exercise instructions
- Calculate the correlation matrix of
ansur_df
and take the absolute value of this matrix. - Create a boolean mask with
True
values in the upper right triangle and apply it to the correlation matrix. - Set the correlation coefficient threshold to
0.95
. - Drop all the columns listed in
to_drop
from the DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Calculate the correlation matrix and take the absolute value
corr_df = ansur_df.____().____()
# Create a True/False mask and apply it
mask = np.____(np.____(corr_df, dtype=____))
tri_df = corr_df.____(mask)
# List column names of highly correlated features (r > 0.95)
to_drop = [c for c in tri_df.columns if any(tri_df[c] > ____)]
# Drop the features in the to_drop list
reduced_df = ansur_df.____(____, axis=1)
print(f"The reduced_df DataFrame has {reduced_df.shape[1]} columns.")
This exercise is part of the course
Dimensionality Reduction in Python
Understand the concept of reducing dimensionality in your data, and master the techniques to do so in Python.
In this first out of two chapters on feature selection, you'll learn about the curse of dimensionality and how dimensionality reduction can help you overcome it. You'll be introduced to a number of techniques to detect and remove features that bring little added value to the dataset. Either because they have little variance, too many missing values, or because they are strongly correlated to other features.
Exercise 1: The curse of dimensionalityExercise 2: Train - test splitExercise 3: Fitting and testing the modelExercise 4: Accuracy after dimensionality reductionExercise 5: Features with missing values or little varianceExercise 6: Finding a good variance thresholdExercise 7: Features with low varianceExercise 8: Removing features with many missing valuesExercise 9: Pairwise correlationExercise 10: Correlation intuitionExercise 11: Inspecting the correlation matrixExercise 12: Visualizing the correlation matrixExercise 13: Removing highly correlated featuresExercise 14: Filtering out highly correlated featuresExercise 15: Nuclear energy and pool drowningsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.