Adjusting Time Features
We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.
In this exercise, we are going to ensure that DAYSONMARKET
only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df
to reflect what information we currently have as of 2017-12-10.
NOTE: This example will use the lit()
function. This function is used to allow single values where an entire column is expected in a function call.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Import the following functions from
pyspark.sql.functions
to use later on:datediff()
,to_date()
,lit()
. - Convert the date string '2017-12-10' to a pyspark date by first calling the literal function,
lit()
on it and thento_date()
- Create
test_df
by filteringOFFMKTDATE
greater than or equal to thesplit_date
andLISTDATE
less than or equal to thesplit_date
usingwhere()
. - Replace
DAYSONMARKET
by calculating a new column calledDAYSONMARKET
, the new column should be the difference betweensplit_date
andLISTDATE
usedatediff()
to perform the date calculation. Inspect the new column and the original using the code provided.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.sql.functions import ____, ____, ____
split_date = ____(____('2017-12-10'))
# Create Sequential Test set
test_df = df.____(df[____] >= ____).____(df[____] <= ____)
# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])
# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn(____, ____(____, ____))
# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show()