Filtering out highly correlated features

You're going to automate the removal of highly correlated features in the numeric ANSUR dataset. You'll calculate the correlation matrix and filter out columns that have a correlation coefficient of more than 0.95 or less than -0.95.

Since each correlation coefficient occurs twice in the matrix (correlation of A to B equals correlation of B to A) you'll want to ignore half of the correlation matrix so that only one of the two correlated features is removed. Use a mask trick for this purpose.

Bu egzersiz

Dimensionality Reduction in Python

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Calculate the correlation matrix of ansur_df and take the absolute value of this matrix.
Create a boolean mask with True values in the upper right triangle and apply it to the correlation matrix.
Set the correlation coefficient threshold to 0.95.
Drop all the columns listed in to_drop from the DataFrame.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Calculate the correlation matrix and take the absolute value
corr_df = ansur_df.____().____()

# Create a True/False mask and apply it
mask = np.____(np.____(corr_df, dtype=____))
tri_df = corr_df.____(mask)

# List column names of highly correlated features (r > 0.95)
to_drop = [c for c in tri_df.columns if any(tri_df[c] >  ____)]

# Drop the features in the to_drop list
reduced_df = ansur_df.____(____, axis=1)

print(f"The reduced_df DataFrame has {reduced_df.shape[1]} columns.")

Kodu Düzenle ve Çalıştır