Creating Time Splits
In the video, we learned why splitting data randomly can be dangerous for time series as data from the future can cause overfitting in our model. Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. In the video, we showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.
In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training. Please note that timedelta() has already been imported for you from the standard python library datetime.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Create a function train_test_split_date()that takes in a dataframe,df, the date column to use for splittingsplit_coland the number of days to use for the test set,test_daysand set it to have a default value of 45.
- Find the minandmaxdates forsplit_colusing,().
- Find the date to split the test and training sets using max_dateand subtracttest_daysfrom it by usingtimedelta()which takes adaysparameter, in this case, pass in `test_days,
- Using OFFMKTDATEas thesplit_colfindsplit_dateand use it to filter the dataframe into two new ones,train_dfandtest_df, Wheretest_dfis only the last 45 days of the data. Additionally, ensure that thetest_dfonly contains homes listed as of the split date by filteringdf['LISTDATE']less than or equal to thesplit_date.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
def train_test_split_date(df, split_col, test_days=____):
  """Calculate the date to split test and training sets"""
  # Find how many days our data spans
  max_date = df.____({____: ____}).collect()[0][0]
  min_date = df.____({____: ____}).collect()[0][0]
  # Subtract an integer number of days from the last date in dataset
  split_date = ____ - timedelta(days=____)
  return split_date
# Find the date to use in spitting test and train
split_date = train_test_split_date(df, ____)
# Create Sequential Test and Training Sets
____ = df.where(df[____] < split_date) 
____ = df.where(df[____] >= split_date).where(df[____] <= split_date)