Working with Multifile Datasets

1. Working with Multifile Datasets

Now the team is getting a new file every day. So today, we help them build pipelines over many files at once.

2. Daily 311 files

After a change to internal processing, the Chicago team now receives one CSV per day, with the date in the file name. How can they build efficient pipelines from this data?

3. Scanning many CSV files

First, the team wants to scan all data from 2026. Can they do this without a cumbersome loop over files?

4. Scanning many CSV files

Yes, by using a glob pattern. Here, the wildcard asterisk picks out all CSV files that have 2026 in the date component.

5. Scanning many CSV files

Once the files are scanned, the rest of the lazy pipeline looks the same. But they find they need to write increasingly awkward glob patterns to target queries at different date ranges. They wonder if there is a way to handle this within Polars?

6. Writing hive partitions

There is. They start by reading the 311 CSV and writing it back out as Parquet.

7. Writing hive partitions

First, they provide the target output directory for the dataset.

8. Writing hive partitions

Then they add the partition_by argument and choose CREATED_DATE as the partition column.

9. Partitioned datasets

Now the data is stored in a structure known as a Hive partition. The directory names are written as key-value pairs, for example, CREATED_DATE=2025-12-31.

10. Partitioned datasets

Or CREATED_DATE=2026-01-01 with a Parquet file in each directory. Now part of the data lives in the path instead of inside the file.

11. Scanning a hive-partitioned dataset

But by setting hive_partitioning to True in scan_parquet, Polars can parse partition values from the directory names. The team can now filter by those partition keys as if they were regular columns.

12. A partition-aware query

Here they filter to partitions from January 1st, 2026 onward.

13. A partition-aware query

And then group by Department, count rows, and collect. Then they get the normal output.

14. A partition-unaware query!

The team work with the Hive-partitioned dataset, but they ask why this query on 2026 data is still slow. We explain that Polars can only skip partitions when the filter uses the exact partition column, which here is CREATED_DATE. Filtering on a derived column like YEAR does not trigger that optimization, even if it seems logically equivalent. But loading many files brings other challenges, too.

15. A schema drift problem

One challenge is that the datasets often change over time. They show us this example with the daily CSV files, where the first file has the BLOCK_CODE column, but it's missing from the next file. The team needs one scan that can handle both file layouts.

16. Inserting missing columns

By default, Polars expects the files in the scan to have matching columns, so schema drift would raise an error. We solve this by setting the missing_columns argument to insert.

17. Checking the combined schema

Now the rows from the first file have BLOCK_CODE values, while there are null values from the second file. That lets the team keep one unified schema across the dataset. The same missing_columns argument is also available for Parquet.

18. A dtype mismatch problem

Another common issue is that the same column has different dtypes in different daily files. Here, one day stores WARD as integers, while another stores it as strings.

19. Two scans with different dtypes

The team scans the two daily files separately so each one gets the dtype that matches its source data. For the first file, they use the default scan because WARD is already stored as integers. For the second file, they add schema_overrides so Polars reads WARD as String to match that source file. Now they need to combine those two lazy queries into one dataset.

20. Combining with vertical_relaxed

With vertical_relaxed, Polars concatenates the queries and coerces mismatched columns to a common supertype. A supertype is a dtype that can hold the values from both inputs. In this case, WARD is cast to String instead of causing a dtype conflict.

21. Checking the relaxed concat

The WARD column is now String for all rows. So vertical_relaxed combines daily batches that are logically the same but have slightly different dtypes.

22. Let's practice!

Now it's time to practice loading multifile datasets in Polars.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.