Comparing broadcast vs normal joins
You've created two types of joins, normal and broadcasted. Now your manager would like to know what the performance improvement is by using Spark optimizations. If the results are promising, you'll be given more opportunity to tweak the Spark setup as needed.
Your DataFrames normal_df and broadcast_df are available for your use.
Latihan ini adalah bagian dari kursus
Cleaning Data with PySpark
Petunjuk latihan
- Execute
.count()on the normal DataFrame. - Execute
.count()on the broadcasted DataFrame. - Print the count and duration of the DataFrames noting and differences.
Latihan interaktif praktis
Cobalah latihan ini dengan menyelesaikan kode contoh berikut.
start_time = time.time()
# Count the number of rows in the normal DataFrame
normal_count = ____
normal_duration = time.time() - start_time
start_time = time.time()
# Count the number of rows in the broadcast DataFrame
broadcast_count = ____
broadcast_duration = time.time() - start_time
# Print the counts and the duration of the tests
print("Normal count:\t\t%d\tduration: %f" % (normal_count, normal_duration))
print("Broadcast count:\t%d\tduration: %f" % (broadcast_count, broadcast_duration))