IDs with different partitions
You've just completed adding an ID field to a DataFrame. Now, take a look at what happens when you do the same thing on DataFrames containing a different number of partitions.
To check the number of partitions, use the method .rdd.getNumPartitions()
on a DataFrame.
The spark
session and two DataFrames, voter_df
and voter_df_single
, are available in your workspace. The instructions will help you discover the difference between the DataFrames. The pyspark.sql.functions
library is available under the alias F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Print the number of partitions on each DataFrame.
- Add a
ROW_ID
field to each DataFrame. - Show the top 10 IDs in each DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print the number of partitions in each DataFrame
print("\nThere are %d partitions in the voter_df DataFrame.\n" % ____)
print("\nThere are %d partitions in the voter_df_single DataFrame.\n" % ____)
# Add a ROW_ID field to each DataFrame
voter_df = voter_df.____('ROW_ID', ____)
voter_df_single = ____
# Show the top 10 IDs in each DataFrame
voter_df.____(voter_df.____.desc()).show(____)
____.orderBy(____).show(10)