More ID tricks
Once you define a Spark process, you'll likely want to use it many times. Depending on your needs, you may want to start your IDs at a certain value so there isn't overlap with previous runs of the Spark task. This behavior is similar to how IDs would behave in a relational database. You have been given the task to make sure that the IDs output from a monthly Spark task start at the highest value from the previous month.
The spark
session and two DataFrames, voter_df_march
and voter_df_april
, are available in your workspace. The pyspark.sql.functions
library is available under the alias F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Determine the highest
ROW_ID
invoter_df_march
and save it in the variableprevious_max_ID
. The statement.rdd.max()[0]
will get the maximum ID. - Add a
ROW_ID
column tovoter_df_april
starting at the value ofprevious_max_ID
+ 1. - Show the
ROW_ID
's from both Data Frames and compare.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Determine the highest ROW_ID and save it in previous_max_ID
____ = ____.select('ROW_ID').rdd.max()[0] + 1
# Add a ROW_ID column to voter_df_april starting at the desired value
voter_df_april = ____.withColumn('ROW_ID', ____ + ____)
# Show the ROW_ID from both DataFrames and compare
____.select('ROW_ID').show()
____