Using Corr()
The old adage 'Correlation does not imply Causation' is a cautionary tale. However, correlation does give us a good nudge to know where to start looking promising features to use in our models. Use this exercise to get a feel for searching through your data for the first time, trying to find patterns.
A list called columns
containing column names has been created for you. In this exercise you will compute the correlation between those columns and 'SALESCLOSEPRICE'
, and find the maximum.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Use a
for
loop iterate through thecolumns
. - In each loop cycle, compute the correlation between the current column and
'SALESCLOSEPRICE'
using thecorr()
method. - Create logic to update the maximum observed correlation and with which column.
- Print out the name of the column that has the maximum correlation with
'SALESCLOSEPRICE'
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Name and value of col with max corr
corr_max = 0
corr_max_col = columns[0]
# Loop to check all columns contained in list
for ____ in ____:
# Check the correlation of a pair of columns
corr_val = df.____(____, ____)
# Logic to compare corr_max with current corr_val
if ____ ____ ____:
# Update the column name and corr value
corr_max = corr_val
corr_max_col = col
print(corr_max_col)