Optimization and query plans

1. Optimization and query plans

In the last video, we saw how delaying collect sped up the team's queries. But how does Polars actually make that happen? Let's demystify query optimization.

2. Introducing query optimization

Polars uses a query optimizer behind the scenes. It takes our code to create the naive plan and then optimizes it by doing things like

3. Introducing query optimization

processing the minimal number of rows or columns.

4. Introducing query optimization

Or find serial operations

5. Introducing query optimization

And run them in parallel.

6. Introducing query optimization

Or find any operations repeated across queries

7. Introducing query optimization

And instead, share the output, as we saw with collect_all. Let's examine the unoptimized and optimized query plans to see how this works in practice.

8. Most request types by department

The Chicago data team now wants to understand which departments handle the most types of requests. We'll build this query step by step, then look at how Polars optimizes it.

9. Most request types by department

They start by filtering to completed requests.

10. Most request types by department

Then they group by department.

11. Most request types by department

And aggregate the number of unique request types with n_unique.

12. Most request types by department

Finally, they sort the results and keep the top five. That's the full query, but what's Polars actually planning to do with it? When we build a lazy query, Polars first creates an unoptimized plan called the naive plan.

13. Unoptimized plan

We can examine the naive plan by printing the LazyFrame.

14. Unoptimized plan

We read query plans from the bottom. The first stage scans the CSV. PROJECT star-slash-39 means all 39 columns are loaded "Project" or "Projection" means choosing which columns to use.

15. Unoptimized plan

The next block filters for status after the full CSV has been loaded.

16. Unoptimized plan

Then we group by the department and count the request types.

17. Unoptimized plan

The last step sorts the result and slices the first five rows. So the unoptimized plan is really just our code reflected back in the same order. Now let's see how Polars rewrites this plan to run faster.

18. Optimized plan

To see the optimized plan, we call explain() and print the result.

19. Optimized plan

The optimized plan shows PROJECT 3 out of 39. Polars sees it only needs three columns from the CSV. Project always means selecting columns. SELECTION on the bottom line says Status must equal Completed, so Polars also limits the rows it reads.

20. Optimized plan

The next line says simple pi 2/2 followed by two column names. Pi is short for project, as this is another place where Polars sees that only two columns are needed for the rest of the query.

21. Optimized plan

Above that we have the groupby-agg step which is the same as in the unoptimized plan. However, there is still one query optimization left.

22. Optimized plan

At the top, Polars has combined the sort and slice operations into two steps: SORT BY and FILTER because the query optimizer sees that it can find the top 10 rows in the n_request_types column, filter out the rest, and then sort just those 10 rows. This is much faster than sorting the entire DataFrame!

23. Optimized plan as a graph

We can also see the optimized plan as a graph with show_graph. This makes the flow easier to read at a glance from scan to final sort.

24. Further optimizations

Now that we can read query plans, the team asks us to check another pipeline they think might be sub-optimal.

25. Further optimizations

In this pipeline, there are two separate filter statements

26. Further optimizations

followed by two with_columns statements. In eager mode - where each operation runs immediately - this would be sub-optimal because Polars would: - apply the filters one after the other - run the with_columns one after the other Things are different in lazy mode.

27. Further optimizations

In the optimized plan, Polars combines the two filters into a single SELECTION with an AND predicate it can run in parallel.

28. Further optimizations

Polars also clusters the two with_columns operations into one block, so it can run them in parallel. As you start to work with more complex queries, it becomes essential to read the query plan and understand how Polars processes your data.

29. Let's practice!

Now it's your turn to optimize some queries!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.