Optimization and query plans
1. Optimization and query plans
In the last video, we saw how delaying collect sped up the team's queries. But how does Polars actually make that happen? Let's demystify query optimization.2. Introducing query optimization
Polars uses a query optimizer behind the scenes. It takes our code to create the naive plan and then optimizes it by doing things like3. Introducing query optimization
processing the minimal number of rows or columns.4. Introducing query optimization
Or find serial operations5. Introducing query optimization
And run them in parallel.6. Introducing query optimization
Or find any operations repeated across queries7. Introducing query optimization
And instead, share the output, as we saw with collect_all. Let's examine the unoptimized and optimized query plans to see how this works in practice.8. Most request types by department
The Chicago data team now wants to understand which departments handle the most types of requests. We'll build this query step by step, then look at how Polars optimizes it.9. Most request types by department
They start by filtering to completed requests.10. Most request types by department
Then they group by department.11. Most request types by department
And aggregate the number of unique request types with n_unique.12. Most request types by department
Finally, they sort the results and keep the top five. That's the full query, but what's Polars actually planning to do with it? When we build a lazy query, Polars first creates an unoptimized plan called the naive plan.13. Unoptimized plan
We can examine the naive plan by printing the LazyFrame.14. Unoptimized plan
We read query plans from the bottom. The first stage scans the CSV. PROJECT star-slash-39 means all 39 columns are loaded "Project" or "Projection" means choosing which columns to use.15. Unoptimized plan
The next block filters for status after the full CSV has been loaded.16. Unoptimized plan
Then we group by the department and count the request types.17. Unoptimized plan
The last step sorts the result and slices the first five rows. So the unoptimized plan is really just our code reflected back in the same order. Now let's see how Polars rewrites this plan to run faster.18. Optimized plan
To see the optimized plan, we call explain() and print the result.19. Optimized plan
The optimized plan shows PROJECT 3 out of 39. Polars sees it only needs three columns from the CSV. Project always means selecting columns. SELECTION on the bottom line says Status must equal Completed, so Polars also limits the rows it reads.20. Optimized plan
The next line says simple pi 2/2 followed by two column names. Pi is short for project, as this is another place where Polars sees that only two columns are needed for the rest of the query.21. Optimized plan
Above that we have the groupby-agg step which is the same as in the unoptimized plan. However, there is still one query optimization left.22. Optimized plan
At the top, Polars has combined the sort and slice operations into two steps: SORT BY and FILTER because the query optimizer sees that it can find the top 10 rows in the n_request_types column, filter out the rest, and then sort just those 10 rows. This is much faster than sorting the entire DataFrame!23. Optimized plan as a graph
We can also see the optimized plan as a graph with show_graph. This makes the flow easier to read at a glance from scan to final sort.24. Further optimizations
Now that we can read query plans, the team asks us to check another pipeline they think might be sub-optimal.25. Further optimizations
In this pipeline, there are two separate filter statements26. Further optimizations
followed by two with_columns statements. In eager mode - where each operation runs immediately - this would be sub-optimal because Polars would: - apply the filters one after the other - run the with_columns one after the other Things are different in lazy mode.27. Further optimizations
In the optimized plan, Polars combines the two filters into a single SELECTION with an AND predicate it can run in parallel.28. Further optimizations
Polars also clusters the two with_columns operations into one block, so it can run them in parallel. As you start to work with more complex queries, it becomes essential to read the query plan and understand how Polars processes your data.29. Let's practice!
Now it's your turn to optimize some queries!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.