When to use a broadcast join
Now that you know shuffle is the bottleneck, your team at Global Retail Analytics wants to fix a slow query. It joins a sales_transactions table (50 million rows) with a product_categories table (2,000 rows) using a standard left join, and it runs for over ten minutes. You suggest switching to a broadcast join using F.broadcast().
Why would a broadcast join speed up this query?
Cet exercice fait partie du cours
Data Transformation with Spark SQL in Databricks
Exercice interactif pratique
Passez de la théorie à la pratique avec l’un de nos exercices interactifs
Commencer l’exercice