Using optimization to full effect
1. Using optimization to full effect
We've seen how Polars optimizes queries automatically. Now let's help the Chicago team go further with techniques that make their pipelines even faster.2. Pivot in lazy mode
In one pipeline, the team has a query with the monthly requests by department in long format. They need to pivot it on department to wide format while keeping their query optimized and scalable by doing it in lazy mode.3. Pivoting departments
We know they can do a lazy pivot in this case because the pivoted column names will be the department names.4. Pivoting departments
We start by calling pivot on the monthly_requests LazyFrame.5. Pivoting departments
As in eager mode, we specify to pivot on department, with month as the row index and values from the count column.6. Pivoting departments
And for a lazy pivot, we pass the department_names list as the on_columns argument.7. Pivoting departments
Now the team can run an optimized pipeline end-to-end to get their pivoted output.8. Profiling lazy queries
With the pivot sorted, the team has a new question: how does Polars actually spend its time running a query? To find out, we profile one of their queries on the cost of requests.9. Profiling lazy queries
They first cast the integer columns to 32-bit integers to reduce memory usage.10. Profiling lazy queries
Then they filter the dataset to completed requests.11. Profiling lazy queries
And finally, they sort by estimated cost.12. Profiling lazy queries
To get timing information, we execute the query with profile instead of collect. The output is a tuple where the first element is the output DataFrame and the second is a timings DataFrame.13. Reading profile output
The node column shows the name of each stage, while start and end mark when each stage started and stopped in microseconds.14. Reading profile output
We've also added duration_seconds to make it easier to read. The first node, optimization, combines several operations into one stage, including parsing the CSV. Because the optimizer merges steps, profile gives limited visibility into individual operations. The next line shows time spent casting integers and sorting by cost. The sort takes the longest, and the team realizes they can drop it since it's not strictly needed.15. Sorted data and fast paths
But sorting isn't always wasted work. Polars has another category of optimizations for sorted data called fast-path algorithms. Imagine we wanted rows with id less than 4 in this table.16. Sorted data and fast paths
If Polars knows the data is sorted, it can stop on the third row.17. Fast-path filter
In this query, the data analytics team analyzes requests from before the 2020 city council elections. But how do they tell Polars that the CSV is sorted by the CREATED_DATE column?18. Fast-path filter
We do this by passing CREATED_DATE to the set_sorted method.19. Fast-path filter
Now we apply the filter to get the requests before 2020 and execute. Polars scans through the CSV but stops when it finds the first row created after 2019.20. Fast-path statistics
Some common statistics can also be calculated on fast paths.21. Fast-path statistics
Polars can get the min of a sorted column by taking the first value or the median of a column by jumping directly to the middle values.22. Other fast-path optimizations
There are still more pipelines where the team can apply fast-path optimizations. For example, joins can be much faster if Polars knows that both of the join columns are sorted.23. Let's practice!
In this video, we saw how we can keep a pipeline lazy even if we need to pivot, how profiling shows where time is spent, and how sorted fast-paths let Polars run certain queries with much less work. Now it's time to practice getting more out of optimization.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.