1. Introducing lazy mode
Now we meet
one of Polars' most powerful features: lazy mode.
2. Eager mode vs. lazy mode
Polars offers two modes to work with data - eager and lazy.
Let's say our colleague asks for the name and price of every property in our rentals dataset. We start in eager mode with pl.read_csv,
which loads all the rentals data into a DataFrame.
3. Eager mode vs. lazy mode
Alternatively we can start a lazy query with pl.scan_csv. When we run pl.scan_csv
Polars first starts a query plan that sets out what we want to do. At this stage the plan is just to load the CSV. Secondly Polars checks
the first rows of the CSV to get the schema - that is the column names and dtypes.
4. Eager mode vs. lazy mode
So if we run the eager query,
we get the full DataFrame. But, if we run the lazy query
we get a query plan. There'll be more on this later.
5. Eager mode vs. lazy mode
As our colleague only wants names and prices
we need to select name and price. In eager mode Polars reads the full CSV into a DataFrame in the first line and then drops all columns apart from name and price in the second line.
6. Eager mode vs. lazy mode
Adding this select step in lazy mode
updates the query plan. Polars
optimizes the query plan and limits the amount of data it will load into a DataFrame to only the selected columns.
7. Optimized query plan
Ending
a lazy query with .explain(),prints the optimized query plan.
The first line of the optimized plan is the scan of the CSV file.
The second line says Project 2 out of 8 columns, meaning only 2 out of 8 columns should be loaded into a DataFrame.
"Project" refers to projection pushdown, the technical name for limiting the columns.
8. Executing a lazy query
Now we turn our lazy query into a DataFrame to share with our colleague.
We call .collect() at the end of the lazy query, which tells Polars to execute the optimized query plan and return a DataFrame.
The optimized query result matches the eager query, but it's faster and uses less memory.
9. Eager mode vs. lazy mode
The key differences between these modes is that in eager mode, Polars executes code
one line at a time, whereas in lazy mode, Polars finds the
optimized way to execute the full set of operations.
10. Eager mode vs. lazy mode
So if lazy mode is optimized, when should we use eager mode? Eager mode is best for seeing
what happens step-by-step - as we do in this course. For similar reasons,
eager mode is useful for debugging. We use lazy mode when
we want to optimize the performance of a script for speed. As our rental properties dataset grows, we would start to see much faster performance with lazy mode.
11. Let's practice!
Now it's time to create your own lazy queries.