Differences
Let's explore generating features using existing ones. In the midwest of the U.S. many single family homes have extra land around them for green space. In this example you will create a new feature called 'YARD_SIZE', and then see if the new feature is correlated with our outcome variable.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Create a new column using withColumn()calledLOT_SIZE_SQFTand convertACRESto square feet by multiplying byacres_to_sqfeetthe conversion factor.
- Create another new column called YARD_SIZEby subtractingFOUNDATIONSIZEfromLOT_SIZE_SQFT.
- Run corr()on each of the independent variablesYARD_SIZE,FOUNDATIONSIZE,LOT_SIZE_SQFTagainst the dependent variableSALESCLOSEPRICE. Does new feature show a stronger correlation than either of its components?
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Lot size in square feet
acres_to_sqfeet = 43560
df = df.____(____, df[____] * ____)
# Create new column YARD_SIZE
df = df.____(____, df[____] - df[____])
# Corr of ACRES vs SALESCLOSEPRICE
print("Corr of ACRES vs SALESCLOSEPRICE: " + str(df.____(____, ____)))
# Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE
print("Corr of FOUNDATIONSIZE vs SALESCLOSEPRICE: " + str(df.____(____, ____)))
# Corr of YARD_SIZE vs SALESCLOSEPRICE
print("Corr of YARD_SIZE vs SALESCLOSEPRICE: " + str(df.____(____, ____)))