Writing Spark configurations
Now that you've reviewed some of the Spark configurations on your cluster, you want to modify some of the settings to tune Spark to your needs. You'll import some data to review that your changes have affected the cluster.
The spark configuration is initially set to the default value of 200 partitions.
The spark object is available for use. A file named departures.txt.gz is available for import. An initial DataFrame containing the distinct rows from departures.txt.gz is available as departures_df.
Deze oefening maakt deel uit van de cursus
Cleaning Data with PySpark
Oefeninstructies
- Store the number of partitions in
departures_dfin the variablebefore. - Change the
spark.sql.shuffle.partitionsconfiguration to 500 partitions. - Recreate the
departures_dfDataFrame reading the distinct rows from the departures file. - Print the number of partitions from before and after the configuration change.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Store the number of partitions in variable
before = departures_df.____
# Configure Spark to use 500 partitions
____('spark.sql.shuffle.partitions', ____)
# Recreate the DataFrame using the departures data file
departures_df = spark.read.csv('departures.txt.gz').____
# Print the number of partitions for each instance
print("Partition count before change: %d" % ____)
print("Partition count after change: %d" % ____)