Aan de slagGa gratis aan de slag

Adding an ID Field

When working with data, you sometimes only want to access certain fields and perform various operations. In this case, find all the unique voter names from the DataFrame and add a unique ID number. Remember that Spark IDs are assigned based on the DataFrame partition - as such the ID values may be much greater than the actual number of rows in the DataFrame.

With Spark's lazy processing, the IDs are not actually generated until an action is performed and can be somewhat random depending on the size of the dataset.

The spark session and a Spark DataFrame df containing the DallasCouncilVotes.csv.gz file are available in your workspace. The pyspark.sql.functions library is available under the alias F.

Deze oefening maakt deel uit van de cursus

Cleaning Data with PySpark

Cursus bekijken

Oefeninstructies

  • Select the unique entries from the column VOTER NAME and create a new DataFrame called voter_df.
  • Count the rows in the voter_df DataFrame.
  • Add a ROW_ID column using the appropriate Spark function.
  • Show the rows with the 10 highest ROW_IDs.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Select all the unique council voters
voter_df = df.____(df["VOTER NAME"]).____()

# Count the rows in voter_df
print("\nThere are %d rows in the voter_df DataFrame.\n" % ____)

# Add a ROW_ID
voter_df = voter_df.____('ROW_ID', F.____())

# Show the rows with 10 highest IDs in the set
voter_df.orderBy(voter_df.____.desc()).show(____)
Code bewerken en uitvoeren