Adjusting Time Features
We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.
In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.  
NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Import the following functions from pyspark.sql.functionsto use later on:datediff(),to_date(),lit().
- Convert the date string '2017-12-10' to a pyspark date by first calling the literal function, lit()on it and thento_date()
- Create test_dfby filteringOFFMKTDATEgreater than or equal to thesplit_dateandLISTDATEless than or equal to thesplit_dateusingwhere().
- Replace DAYSONMARKETby calculating a new column calledDAYSONMARKET, the new column should be the difference betweensplit_dateandLISTDATEusedatediff()to perform the date calculation. Inspect the new column and the original using the code provided.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
from pyspark.sql.functions import ____, ____, ____
split_date = ____(____('2017-12-10'))
# Create Sequential Test set
test_df = df.____(df[____] >= ____).____(df[____] <= ____)
# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])
# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn(____, ____(____, ____))
# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show()