Date Math
In this example, we'll look at verifying the frequency of our data. The Mortgage dataset is supposed to have weekly data but let's make sure by lagging the report date and then taking the difference of the dates.
Recall that to create a lagged feature we will need to create a window()
. window()
allows you to return a value for each record based off some calculation against a group of records, in this case, the previous period's mortgage rate.
Este ejercicio forma parte del curso
Feature Engineering with PySpark
Instrucciones del ejercicio
- Cast
mort_df['DATE']
to date type withto_date()
- Create a window with the
Window()
function and useorderBy()
to sort bymort_df[DATE]
- Create a new column
DATE-1
usingwithColumn()
by lagging theDATE
column withlag()
and window it usingover(w)
- Calculate the difference between
DATE
andDATE-1
usingdatediff()
and name itDays_Between_Report
Ejercicio interactivo práctico
Prueba este ejercicio y completa el código de muestra.
from pyspark.sql.functions import lag, datediff, to_date
from pyspark.sql.window import Window
# Cast data type
mort_df = mort_df.withColumn(____, ____(____))
# Create window
w = ____().____(____)
# Create lag column
mort_df = mort_df.____(____, ____(____, count=1).____(w))
# Calculate difference between date columns
mort_df = mort_df.withColumn(____, ____(____, ____))
# Print results
mort_df.select('Days_Between_Report').distinct().show()