Using optimization to full effect

1. Using optimization to full effect

We've seen how Polars optimizes queries automatically. Now let's help the Chicago team go further with techniques that make their pipelines even faster.

2. Pivot in lazy mode

In one pipeline, the team has a query with the monthly requests by department in long format. They need to pivot it on department to wide format while keeping their query optimized and scalable by doing it in lazy mode.

3. Pivoting departments

We know they can do a lazy pivot in this case because the pivoted column names will be the department names.

4. Pivoting departments

We start by calling pivot on the monthly_requests LazyFrame.

5. Pivoting departments

As in eager mode, we specify to pivot on department, with month as the row index and values from the count column.

6. Pivoting departments

And for a lazy pivot, we pass the department_names list as the on_columns argument.

7. Pivoting departments

Now the team can run an optimized pipeline end-to-end to get their pivoted output.

8. Profiling lazy queries

With the pivot sorted, the team has a new question: how does Polars actually spend its time running a query? To find out, we profile one of their queries on the cost of requests.

9. Profiling lazy queries

They first cast the integer columns to 32-bit integers to reduce memory usage.

10. Profiling lazy queries

Then they filter the dataset to completed requests.

11. Profiling lazy queries

And finally, they sort by estimated cost.

12. Profiling lazy queries

To get timing information, we execute the query with profile instead of collect. The output is a tuple where the first element is the output DataFrame and the second is a timings DataFrame.

13. Reading profile output

The node column shows the name of each stage, while start and end mark when each stage started and stopped in microseconds.

14. Reading profile output

We've also added duration_seconds to make it easier to read. The first node, optimization, combines several operations into one stage, including parsing the CSV. Because the optimizer merges steps, profile gives limited visibility into individual operations. The next line shows time spent casting integers and sorting by cost. The sort takes the longest, and the team realizes they can drop it since it's not strictly needed.

15. Sorted data and fast paths

But sorting isn't always wasted work. Polars has another category of optimizations for sorted data called fast-path algorithms. Imagine we wanted rows with id less than 4 in this table.

16. Sorted data and fast paths

If Polars knows the data is sorted, it can stop on the third row.

17. Fast-path filter

In this query, the data analytics team analyzes requests from before the 2020 city council elections. But how do they tell Polars that the CSV is sorted by the CREATED_DATE column?

18. Fast-path filter

We do this by passing CREATED_DATE to the set_sorted method.

19. Fast-path filter

Now we apply the filter to get the requests before 2020 and execute. Polars scans through the CSV but stops when it finds the first row created after 2019.

20. Fast-path statistics

Some common statistics can also be calculated on fast paths.

21. Fast-path statistics

Polars can get the min of a sorted column by taking the first value or the median of a column by jumping directly to the middle values.

22. Other fast-path optimizations

There are still more pipelines where the team can apply fast-path optimizations. For example, joins can be much faster if Polars knows that both of the join columns are sorted.

23. Let's practice!

In this video, we saw how we can keep a pipeline lazy even if we need to pivot, how profiling shows where time is spent, and how sorted fast-paths let Polars run certain queries with much less work. Now it's time to practice getting more out of optimization.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.