Adding an ID Field

When working with data, you sometimes only want to access certain fields and perform various operations. In this case, find all the unique voter names from the DataFrame and add a unique ID number. Remember that Spark IDs are assigned based on the DataFrame partition - as such the ID values may be much greater than the actual number of rows in the DataFrame.

With Spark's lazy processing, the IDs are not actually generated until an action is performed and can be somewhat random depending on the size of the dataset.

The spark session and a Spark DataFrame df containing the DallasCouncilVotes.csv.gz file are available in your workspace. The pyspark.sql.functions library is available under the alias F.

This exercise is part of the course

Cleaning Data with PySpark

View Course

Exercise instructions

Select the unique entries from the column VOTER NAME and create a new DataFrame called voter_df.
Count the rows in the voter_df DataFrame.
Add a ROW_ID column using the appropriate Spark function.
Show the rows with the 10 highest ROW_IDs.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Select all the unique council voters
voter_df = df.____(df["VOTER NAME"]).____()

# Count the rows in voter_df
print("\nThere are %d rows in the voter_df DataFrame.\n" % ____)

# Add a ROW_ID
voter_df = voter_df.____('ROW_ID', F.____())

# Show the rows with 10 highest IDs in the set
voter_df.orderBy(voter_df.____.desc()).show(____)

Edit and Run Code