Get startedGet started for free

Convert window function from dot notation to SQL

We are going to add a column to a train schedule so that each row contains the number of minutes for the train to reach its next stop.

  • We have a dataframe df where df.columns == ['train_id', 'station', 'time'].
  • df is registered as a SQL table named schedule.
  • The following window function query uses dot notation. It gives a new dataframe dot_df.
window = Window.partitionBy('train_id').orderBy('time')
dot_df = df.withColumn('diff_min', 
                    (unix_timestamp(lead('time', 1).over(window),'H:m') 
                     - unix_timestamp('time', 'H:m'))/60)

Note the use of the unix_timestamp function, which is equivalent to the UNIX_TIMESTAMP SQL function.

Please be aware of the scaffolding in the sample code. Formatting the answer according to the scaffolding will ensure that your submitted answer is not erroneously rejected due to a formatting issue.

This exercise is part of the course

Introduction to Spark SQL in Python

View Course

Exercise instructions

  • Create a SQL query to obtain an identical result to dot_df. Please format the query according to the scaffolding (i.e., the placeholder underscores _____) .

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a SQL query to obtain an identical result to dot_df
query = """
SELECT *, 
(____(____(time, 1) ____ (____ BY train_id ____ BY time),'H:m') 
 - ____(time, 'H:m'))/60 AS diff_min 
FROM schedule 
"""
sql_df = spark.sql(query)
sql_df.show()
Edit and Run Code