Comparing broadcast vs normal joins
You've created two types of joins, normal and broadcasted. Now your manager would like to know what the performance improvement is by using Spark optimizations. If the results are promising, you'll be given more opportunity to tweak the Spark setup as needed.
Your DataFrames normal_df
and broadcast_df
are available for your use.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Execute
.count()
on the normal DataFrame. - Execute
.count()
on the broadcasted DataFrame. - Print the count and duration of the DataFrames noting and differences.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
start_time = time.time()
# Count the number of rows in the normal DataFrame
normal_count = ____
normal_duration = time.time() - start_time
start_time = time.time()
# Count the number of rows in the broadcast DataFrame
broadcast_count = ____
broadcast_duration = time.time() - start_time
# Print the counts and the duration of the tests
print("Normal count:\t\t%d\tduration: %f" % (normal_count, normal_duration))
print("Broadcast count:\t%d\tduration: %f" % (broadcast_count, broadcast_duration))