Adding an ID Field
When working with data, you sometimes only want to access certain fields and perform various operations. In this case, find all the unique voter names from the DataFrame and add a unique ID number. Remember that Spark IDs are assigned based on the DataFrame partition - as such the ID values may be much greater than the actual number of rows in the DataFrame.
With Spark's lazy processing, the IDs are not actually generated until an action is performed and can be somewhat random depending on the size of the dataset.
The spark
session and a Spark DataFrame df
containing the DallasCouncilVotes.csv.gz
file are available in your workspace. The pyspark.sql.functions
library is available under the alias F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Select the unique entries from the column
VOTER NAME
and create a new DataFrame calledvoter_df
. - Count the rows in the
voter_df
DataFrame. - Add a ROW_ID column using the appropriate Spark function.
- Show the rows with the 10 highest ROW_IDs.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Select all the unique council voters
voter_df = df.____(df["VOTER NAME"]).____()
# Count the rows in voter_df
print("\nThere are %d rows in the voter_df DataFrame.\n" % ____)
# Add a ROW_ID
voter_df = voter_df.____('ROW_ID', F.____())
# Show the rows with 10 highest IDs in the set
voter_df.orderBy(voter_df.____.desc()).show(____)