Sinking large queries to disk

1. Sinking large queries to disk

The Chicago team can now target different execution engines. But there is still a problem: some production queries are not meant to become in-memory DataFrames at all.

2. A large output problem

The team has noticed that some pipelines create large outputs that are only meant to be written to Parquet files for downstream use, not inspected in memory. For example, they need to write a cleaned version of the 311 requests dataset for a separate auditing team.

3. A cleaned request dataset

The team starts with the 311 Parquet file.

4. A cleaned request dataset

They need a cleaned dataset with transformed columns and the same number of rows.

5. Collect then write

One option is to collect the result and then call write_parquet. This is fine when the output is small. But for a full cleaned 311 extract, the team wants to know if they can just write the output directly to disk.

6. Sinking to Parquet

That's what the sink_parquet method is for. It writes the lazy query result straight to a Parquet file. By default, it uses the streaming engine, so it works even when the output is large.

7. Controlling the sink

The sink still gives the team familiar Parquet controls like setting compression or the row group size. So sink_parquet is the disk-writing version of a lazy execution path.

8. Other sink methods

Parquet is the main sink format, but not the only one. Polars also has sink methods for CSV and newline-delimited JSON.

9. Partitioned output

In real data lakes, one large Parquet output may be inefficient for querying. As in Chapter 2, the team may want a directory layout partitioned by CREATED_DATE. For that, sink_parquet can take a Polars PartitionBy object.

10. Partitioned output

Here, the output is written in a local directory and partitioned by CREATED_DATE. The max_rows_per_file argument keeps any one file from growing too large.

11. Partitioned output

The mkdir argument tells Polars to create missing output directories when needed.

12. Partitioned output

The result is the same kind of partitioned layout we saw in Chapter 2. Instead of one file, the dataset is arranged into folders keyed by CREATED_DATE. Downstream queries can then target only the partitions they need.

13. Multi-column partitions

Partitions can also use multiple columns. Here, the team writes one dataset partitioned first by STATUS and then by CREATED_DATE. This matches common filters, like completed requests for a specific day. The team also has pipelines where they run multiple queries against the same source and write each output to a separate file. Can they do that efficiently?

14. Building lazy sinks

It turns out the team needs two outputs. First, an internal dataset that keeps everything. They create an internal_sink by calling sink_parquet with lazy equals True. The lazy argument tells Polars to return a LazyFrame that represents the write, instead of executing it right away.

15. Building lazy sinks

Second, a public output that drops some internal-only columns. This goes into public_sink. At this point, neither file has been written yet.

16. Multiplexing sinks

Then pl.collect_all executes both sink queries together. This is called multiplexing. Polars plans the outputs together rather than as isolated jobs. collect_all is useful when several outputs share the same expensive scan or preparation steps.

17. Written outputs

Both files are written to disk in one pass. Usually, a file is exactly what the team needs. But what if the destination isn't a file?

18. Custom batch sinks

The team's audit service, for instance, expects records pushed to it in JSON chunks. So they define a function that takes one batch and sends it onward.

19. Custom batch sinks

Then they pass that function to sink_batches. Polars streams the query and calls send_batch on each chunk. The chunk_size argument sets how many rows go into each batch, and the final batch simply holds whatever rows are left over.

20. Sink workflow

To recap: use sink_parquet when a large query should go straight to disk. Use PartitionBy when the output should become a partitioned dataset. Use lazy sinks with collect_all when several outputs need to be planned and executed together. Finally, use sink_batches when each batch should be sent to custom Python code instead of a built-in file sink.

21. Let's practice!

Now let's practice sinking large lazy query results directly to disk.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.