Creating Time Splits
In the video, we learned why splitting data randomly can be dangerous for time series as data from the future can cause overfitting in our model. Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. In the video, we showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.
In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training. Please note that timedelta()
has already been imported for you from the standard python
library datetime
.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Create a function
train_test_split_date()
that takes in a dataframe,df
, the date column to use for splittingsplit_col
and the number of days to use for the test set,test_days
and set it to have a default value of 45. - Find the
min
andmax
dates forsplit_col
using,()
. - Find the date to split the test and training sets using
max_date
and subtracttest_days
from it by usingtimedelta()
which takes adays
parameter, in this case, pass in `test_days, - Using
OFFMKTDATE
as thesplit_col
findsplit_date
and use it to filter the dataframe into two new ones,train_df
andtest_df
, Wheretest_df
is only the last 45 days of the data. Additionally, ensure that thetest_df
only contains homes listed as of the split date by filteringdf['LISTDATE']
less than or equal to thesplit_date
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def train_test_split_date(df, split_col, test_days=____):
"""Calculate the date to split test and training sets"""
# Find how many days our data spans
max_date = df.____({____: ____}).collect()[0][0]
min_date = df.____({____: ____}).collect()[0][0]
# Subtract an integer number of days from the last date in dataset
split_date = ____ - timedelta(days=____)
return split_date
# Find the date to use in spitting test and train
split_date = train_test_split_date(df, ____)
# Create Sequential Test and Training Sets
____ = df.where(df[____] < split_date)
____ = df.where(df[____] >= split_date).where(df[____] <= split_date)