Effective query execution

1. Effective query execution

Welcome to the course! I'm Liam, a data scientist

2. Meet your instructor

and Polars contributor. I'll guide you through building scalable data pipelines with Polars.

3. From laptop to cloud

Here's the promise of Polars: you write one pipeline on your laptop, and that same code scales to larger-than-memory, multi-file datasets. But making that happen isn't automatic. This course teaches you the techniques that make it possible.

4. Is this course for you?

Before we dive in, this course builds on two earlier ones. You should be comfortable creating a lazy query with scan_csv and writing expressions with filter, select, and group_by. If not, start with Introduction to Polars and Transforming Data with Polars.

5. Chapter 1 - optimization

So what's ahead? We start in Chapter 1 with query optimization,

6. Chapter 1 - optimization

learning how Polars can automatically speed up your queries when you structure them right.

7. Chapter 2 - effective I/O

Chapter 2 tackles data sources like CSV and Parquet, and how file format choices impact pipeline performance.

8. Chapter 3 - richer dtypes

Chapter 3 goes deeper into Polars' type system, including nested dtypes and memory-efficient representations that keep your pipelines less memory hungry.

9. Chapter 4 - scaling pipelines

And Chapter 4 scales up with the streaming engine for datasets that don't fit in memory, plus testing to keep things robust. By the end, you'll have a complete toolkit for production-ready Polars pipelines.

10. Chicago requests dataset

Alright, let's jump in. The Mayor of Chicago's data analytics team has built some Polars pipelines, but their queries are slow, and they've called us in to help.

11. Inspecting the dataset

Their main dataset is massive, with millions of rows across 39 columns, recording every service request from Chicago's citizens. We ask a team member to walk us through their code, and they start by creating a LazyFrame called requests with pl.scan_csv.

12. Inspecting the dataset

Then they call collect to execute the query. Recall, this turns the LazyFrame into an in-memory DataFrame.

13. Inspecting the dataset

And then head to grab the first five rows. We see pothole complaints, tree trim requests, garbage pickups. But here's the problem: they loaded the entire CSV just to look at five rows.

14. Limiting rows processed

We show them a better approach: call head before collect, while the query is still lazy.

15. Limiting rows processed

Now, when we call collect, Polars knows we only need five rows and stops reading early. A real time-saver during development. But the bigger lesson here is to delay calling collect. We'll see exactly why with their first pipeline.

16. Department summary query

Each week, the team prepares a report for city council: a breakdown of completed requests by department.

17. Department summary query

They start with the lazy scan of the CSV.

18. Department summary query

And then filter to completed requests only.

19. Department summary query

And then they make a key mistake: they call collect, converting to an eager DataFrame too early.

20. Department summary query

Only after that do they group by department and count rows. See the problem? Polars couldn't optimize anything because collect cut the lazy pipeline short.

21. Department summary query

The fix? Move collect to the end. Now Polars sees the full picture. It only needs the STATUS and DEPARTMENT columns from the CSV. Everything else gets skipped. The team's impressed, but they mention their full pipeline has two queries sharing the same source.

22. Diverging query branches

We've already seen the first query: completed requests by department. The second counts completions by month. Same source, different groupings.

23. How the team runs this today

Right now, the team runs these separately, so the CSV gets loaded twice. Wouldn't it be better to load it just once?

24. Executing diverging queries

That's exactly what pl.collect_all does.

25. Executing diverging queries

We pass both lazy queries as a list, and Polars recognizes they share a source, so the CSV gets loaded only once.

26. Executing diverging queries

The result is a list of DataFrames, one per query. The team's queries have gone from minutes to seconds, and we've only just started optimizing.

27. Let's practice!

Time to put these techniques into practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.