Differences
Let's explore generating features using existing ones. In the midwest of the U.S. many single family homes have extra land around them for green space. In this example you will create a new feature called 'YARD_SIZE'
, and then see if the new feature is correlated with our outcome variable.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Create a new column using
withColumn()
calledLOT_SIZE_SQFT
and convertACRES
to square feet by multiplying byacres_to_sqfeet
the conversion factor. - Create another new column called
YARD_SIZE
by subtractingFOUNDATIONSIZE
fromLOT_SIZE_SQFT
. - Run
corr()
on each of the independent variablesYARD_SIZE
,FOUNDATIONSIZE
,LOT_SIZE_SQFT
against the dependent variableSALESCLOSEPRICE
. Does new feature show a stronger correlation than either of its components?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Lot size in square feet
acres_to_sqfeet = 43560
df = df.____(____, df[____] * ____)
# Create new column YARD_SIZE
df = df.____(____, df[____] - df[____])
# Corr of ACRES vs SALESCLOSEPRICE
print("Corr of ACRES vs SALESCLOSEPRICE: " + str(df.____(____, ____)))
# Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE
print("Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE: " + str(df.____(____, ____)))
# Corr of YARD_SIZE vs SALESCLOSEPRICE
print("Corr of YARD_SIZE vs SALESCLOSEPRICE: " + str(df.____(____, ____)))