Effective query execution
1. Effective query execution
Welcome to the course! I'm Liam, a data scientist2. Meet your instructor
and Polars contributor. I'll guide you through building scalable data pipelines with Polars.3. From laptop to cloud
Here's the promise of Polars: you write one pipeline on your laptop, and that same code scales to larger-than-memory, multi-file datasets. But making that happen isn't automatic. This course teaches you the techniques that make it possible.4. Is this course for you?
Before we dive in, this course builds on two earlier ones. You should be comfortable creating a lazy query with scan_csv and writing expressions with filter, select, and group_by. If not, start with Introduction to Polars and Transforming Data with Polars.5. Chapter 1 - optimization
So what's ahead? We start in Chapter 1 with query optimization,6. Chapter 1 - optimization
learning how Polars can automatically speed up your queries when you structure them right.7. Chapter 2 - effective I/O
Chapter 2 tackles data sources like CSV and Parquet, and how file format choices impact pipeline performance.8. Chapter 3 - richer dtypes
Chapter 3 goes deeper into Polars' type system, including nested dtypes and memory-efficient representations that keep your pipelines less memory hungry.9. Chapter 4 - scaling pipelines
And Chapter 4 scales up with the streaming engine for datasets that don't fit in memory, plus testing to keep things robust. By the end, you'll have a complete toolkit for production-ready Polars pipelines.10. Chicago requests dataset
Alright, let's jump in. The Mayor of Chicago's data analytics team has built some Polars pipelines, but their queries are slow, and they've called us in to help.11. Inspecting the dataset
Their main dataset is massive, with millions of rows across 39 columns, recording every service request from Chicago's citizens. We ask a team member to walk us through their code, and they start by creating a LazyFrame called requests with pl.scan_csv.12. Inspecting the dataset
Then they call collect to execute the query. Recall, this turns the LazyFrame into an in-memory DataFrame.13. Inspecting the dataset
And then head to grab the first five rows. We see pothole complaints, tree trim requests, garbage pickups. But here's the problem: they loaded the entire CSV just to look at five rows.14. Limiting rows processed
We show them a better approach: call head before collect, while the query is still lazy.15. Limiting rows processed
Now, when we call collect, Polars knows we only need five rows and stops reading early. A real time-saver during development. But the bigger lesson here is to delay calling collect. We'll see exactly why with their first pipeline.16. Department summary query
Each week, the team prepares a report for city council: a breakdown of completed requests by department.17. Department summary query
They start with the lazy scan of the CSV.18. Department summary query
And then filter to completed requests only.19. Department summary query
And then they make a key mistake: they call collect, converting to an eager DataFrame too early.20. Department summary query
Only after that do they group by department and count rows. See the problem? Polars couldn't optimize anything because collect cut the lazy pipeline short.21. Department summary query
The fix? Move collect to the end. Now Polars sees the full picture. It only needs the STATUS and DEPARTMENT columns from the CSV. Everything else gets skipped. The team's impressed, but they mention their full pipeline has two queries sharing the same source.22. Diverging query branches
We've already seen the first query: completed requests by department. The second counts completions by month. Same source, different groupings.23. How the team runs this today
Right now, the team runs these separately, so the CSV gets loaded twice. Wouldn't it be better to load it just once?24. Executing diverging queries
That's exactly what pl.collect_all does.25. Executing diverging queries
We pass both lazy queries as a list, and Polars recognizes they share a source, so the CSV gets loaded only once.26. Executing diverging queries
The result is a list of DataFrames, one per query. The team's queries have gone from minutes to seconds, and we've only just started optimizing.27. Let's practice!
Time to put these techniques into practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.