Exercise

Writing Spark configurations

Now that you've reviewed some of the Spark configurations on your cluster, you want to modify some of the settings to tune Spark to your needs. You'll import some data to review that your changes have affected the cluster.

The spark configuration is initially set to the default value of 200 partitions.

The spark object is available for use. A file named departures.txt.gz is available for import. An initial DataFrame containing the distinct rows from departures.txt.gz is available as departures_df.

Instructions

100 XP
  • Store the number of partitions in departures_df in the variable before.
  • Change the spark.sql.shuffle.partitions configuration to 500 partitions.
  • Recreate the departures_df DataFrame reading the distinct rows from the departures file.
  • Print the number of partitions from before and after the configuration change.