Get startedGet started for free

Date Math

In this example, we'll look at verifying the frequency of our data. The Mortgage dataset is supposed to have weekly data but let's make sure by lagging the report date and then taking the difference of the dates.

Recall that to create a lagged feature we will need to create a window(). window() allows you to return a value for each record based off some calculation against a group of records, in this case, the previous period's mortgage rate.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Cast mort_df['DATE'] to date type with to_date()
  • Create a window with the Window() function and use orderBy() to sort by mort_df[DATE]
  • Create a new column DATE-1 using withColumn() by lagging the DATE column with lag() and window it using over(w)
  • Calculate the difference between DATE and DATE-1 using datediff() and name it Days_Between_Report

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.sql.functions import lag, datediff, to_date
from pyspark.sql.window import Window

# Cast data type
mort_df = mort_df.withColumn(____, ____(____))

# Create window
w = ____().____(____)
# Create lag column
mort_df = mort_df.____(____, ____(____, count=1).____(w))

# Calculate difference between date columns
mort_df = mort_df.withColumn(____, ____(____, ____))
# Print results
mort_df.select('Days_Between_Report').distinct().show()
Edit and Run Code