ComeçarComece de graça

Adjusting Time Features

We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.

Este exercício faz parte do curso

Feature Engineering with PySpark

Ver curso

Instruções do exercício

  • Import the following functions from pyspark.sql.functions to use later on: datediff(), to_date(), lit().
  • Convert the date string '2017-12-10' to a pyspark date by first calling the literal function, lit() on it and then to_date()
  • Create test_df by filtering OFFMKTDATE greater than or equal to the split_date and LISTDATE less than or equal to the split_date using where().
  • Replace DAYSONMARKET by calculating a new column called DAYSONMARKET, the new column should be the difference between split_date and LISTDATE use datediff() to perform the date calculation. Inspect the new column and the original using the code provided.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

from pyspark.sql.functions import ____, ____, ____

split_date = ____(____('2017-12-10'))
# Create Sequential Test set
test_df = df.____(df[____] >= ____).____(df[____] <= ____)

# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])

# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn(____, ____(____, ____))

# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show()
Editar e executar o código