Working with Multifile Datasets
1. Working with Multifile Datasets
Now the team is getting a new file every day. So today, we help them build pipelines over many files at once.2. Daily 311 files
After a change to internal processing, the Chicago team now receives one CSV per day, with the date in the file name. How can they build efficient pipelines from this data?3. Scanning many CSV files
First, the team wants to scan all data from 2026. Can they do this without a cumbersome loop over files?4. Scanning many CSV files
Yes, by using a glob pattern. Here, the wildcard asterisk picks out all CSV files that have 2026 in the date component.5. Scanning many CSV files
Once the files are scanned, the rest of the lazy pipeline looks the same. But they find they need to write increasingly awkward glob patterns to target queries at different date ranges. They wonder if there is a way to handle this within Polars?6. Writing hive partitions
There is. They start by reading the 311 CSV and writing it back out as Parquet.7. Writing hive partitions
First, they provide the target output directory for the dataset.8. Writing hive partitions
Then they add the partition_by argument and choose CREATED_DATE as the partition column.9. Partitioned datasets
Now the data is stored in a structure known as a Hive partition. The directory names are written as key-value pairs, for example, CREATED_DATE=2025-12-31.10. Partitioned datasets
Or CREATED_DATE=2026-01-01 with a Parquet file in each directory. Now part of the data lives in the path instead of inside the file.11. Scanning a hive-partitioned dataset
But by setting hive_partitioning to True in scan_parquet, Polars can parse partition values from the directory names. The team can now filter by those partition keys as if they were regular columns.12. A partition-aware query
Here they filter to partitions from January 1st, 2026 onward.13. A partition-aware query
And then group by Department, count rows, and collect. Then they get the normal output.14. A partition-unaware query!
The team work with the Hive-partitioned dataset, but they ask why this query on 2026 data is still slow. We explain that Polars can only skip partitions when the filter uses the exact partition column, which here is CREATED_DATE. Filtering on a derived column like YEAR does not trigger that optimization, even if it seems logically equivalent. But loading many files brings other challenges, too.15. A schema drift problem
One challenge is that the datasets often change over time. They show us this example with the daily CSV files, where the first file has the BLOCK_CODE column, but it's missing from the next file. The team needs one scan that can handle both file layouts.16. Inserting missing columns
By default, Polars expects the files in the scan to have matching columns, so schema drift would raise an error. We solve this by setting the missing_columns argument to insert.17. Checking the combined schema
Now the rows from the first file have BLOCK_CODE values, while there are null values from the second file. That lets the team keep one unified schema across the dataset. The same missing_columns argument is also available for Parquet.18. A dtype mismatch problem
Another common issue is that the same column has different dtypes in different daily files. Here, one day stores WARD as integers, while another stores it as strings.19. Two scans with different dtypes
The team scans the two daily files separately so each one gets the dtype that matches its source data. For the first file, they use the default scan because WARD is already stored as integers. For the second file, they add schema_overrides so Polars reads WARD as String to match that source file. Now they need to combine those two lazy queries into one dataset.20. Combining with vertical_relaxed
With vertical_relaxed, Polars concatenates the queries and coerces mismatched columns to a common supertype. A supertype is a dtype that can hold the values from both inputs. In this case, WARD is cast to String instead of causing a dtype conflict.21. Checking the relaxed concat
The WARD column is now String for all rows. So vertical_relaxed combines daily batches that are logically the same but have slightly different dtypes.22. Let's practice!
Now it's time to practice loading multifile datasets in Polars.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.