Date Math
In this example, we'll look at verifying the frequency of our data. The Mortgage dataset is supposed to have weekly data but let's make sure by lagging the report date and then taking the difference of the dates.
Recall that to create a lagged feature we will need to create a window()
. window()
allows you to return a value for each record based off some calculation against a group of records, in this case, the previous period's mortgage rate.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Cast
mort_df['DATE']
to date type withto_date()
- Create a window with the
Window()
function and useorderBy()
to sort bymort_df[DATE]
- Create a new column
DATE-1
usingwithColumn()
by lagging theDATE
column withlag()
and window it usingover(w)
- Calculate the difference between
DATE
andDATE-1
usingdatediff()
and name itDays_Between_Report
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.sql.functions import lag, datediff, to_date
from pyspark.sql.window import Window
# Cast data type
mort_df = mort_df.withColumn(____, ____(____))
# Create window
w = ____().____(____)
# Create lag column
mort_df = mort_df.____(____, ____(____, count=1).____(w))
# Calculate difference between date columns
mort_df = mort_df.withColumn(____, ____(____, ____))
# Print results
mort_df.select('Days_Between_Report').distinct().show()