Get startedGet started for free

Adjusting Time Features

We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Import the following functions from pyspark.sql.functions to use later on: datediff(), to_date(), lit().
  • Convert the date string '2017-12-10' to a pyspark date by first calling the literal function, lit() on it and then to_date()
  • Create test_df by filtering OFFMKTDATE greater than or equal to the split_date and LISTDATE less than or equal to the split_date using where().
  • Replace DAYSONMARKET by calculating a new column called DAYSONMARKET, the new column should be the difference between split_date and LISTDATE use datediff() to perform the date calculation. Inspect the new column and the original using the code provided.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.sql.functions import ____, ____, ____

split_date = ____(____('2017-12-10'))
# Create Sequential Test set
test_df = df.____(df[____] >= ____).____(df[____] <= ____)

# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])

# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn(____, ____(____, ____))

# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show()
Edit and Run Code