ComenzarEmpieza gratis

Date Math

In this example, we'll look at verifying the frequency of our data. The Mortgage dataset is supposed to have weekly data but let's make sure by lagging the report date and then taking the difference of the dates.

Recall that to create a lagged feature we will need to create a window(). window() allows you to return a value for each record based off some calculation against a group of records, in this case, the previous period's mortgage rate.

Este ejercicio forma parte del curso

Feature Engineering with PySpark

Ver curso

Instrucciones del ejercicio

  • Cast mort_df['DATE'] to date type with to_date()
  • Create a window with the Window() function and use orderBy() to sort by mort_df[DATE]
  • Create a new column DATE-1 using withColumn() by lagging the DATE column with lag() and window it using over(w)
  • Calculate the difference between DATE and DATE-1 using datediff() and name it Days_Between_Report

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

from pyspark.sql.functions import lag, datediff, to_date
from pyspark.sql.window import Window

# Cast data type
mort_df = mort_df.withColumn(____, ____(____))

# Create window
w = ____().____(____)
# Create lag column
mort_df = mort_df.____(____, ____(____, count=1).____(w))

# Calculate difference between date columns
mort_df = mort_df.withColumn(____, ____(____, ____))
# Print results
mort_df.select('Days_Between_Report').distinct().show()
Editar y ejecutar código