More ID tricks

Once you define a Spark process, you'll likely want to use it many times. Depending on your needs, you may want to start your IDs at a certain value so there isn't overlap with previous runs of the Spark task. This behavior is similar to how IDs would behave in a relational database. You have been given the task to make sure that the IDs output from a monthly Spark task start at the highest value from the previous month.

The spark session and two DataFrames, voter_df_march and voter_df_april, are available in your workspace. The pyspark.sql.functions library is available under the alias F.

Determine the highest ROW_ID in voter_df_march and save it in the variable previous_max_ID. The statement .rdd.max()[0] will get the maximum ID.
Add a ROW_ID column to voter_df_april starting at the value of previous_max_ID + 1.
Show the ROW_ID's from both Data Frames and compare.

DataFrame details

Manipulating DataFrames in the real world

Improving Performance

Complex processing and data pipelines

Exercise

More ID tricks

Instructions