Working with Parquet files

1. Working with Parquet files

Welcome back! Even the best query plan can't outrun a slow file format. That's the wall the Chicago team has hit with their CSV files. Today, we'll show them why Parquet is a better fit.

2. File storage formats

To understand why, consider a DataFrame like this.

3. CSV Format

When we save it as a CSV, each record is written row by row. So even when the team only wants a few columns, Polars still has to scan every row to find them. What does Parquet do differently?

4. Parquet Format

In Parquet, data is stored by column instead of by row. So Polars can read only the columns needed for a query and skip the rest. So how can the team move their archive across?

5. Converting the archive

First, they read the CSV.

6. Converting the archive

Then they write it as a Parquet file with write_parquet. By convention, the filename extension is dot-parquet.

7. Inspecting a Parquet file

Now the team wants to inspect the file. Unlike CSV, where all data is stored as text, a Parquet file stores data with a schema. We can get the schema with read_parquet_schema without loading the dataset into memory. This tells the team the schema before they build a query. But is the payoff really large enough to justify the conversion from CSV to Parquet?

8. CSV vs Parquet

Once the archive is converted, the team compares the two formats. First, the Parquet version is much smaller on disk. Second, their pipeline with Parquet is much faster than with CSV. So can the team just use Parquet for everything?

9. Parquet is not for appending rows

Not quite. New requests to the city arrive like this, one at a time. But Parquet is not designed for efficiently appending small numbers of rows. We advise them to keep CSV for efficient appends and create complete analytical Parquet files for the queries.

10. Parquet row groups

We have seen that Parquet stores data by column. It also divides each column into chunks called row groups. Polars can use these groups to skip data it doesn't need during a query.

11. Filtering with row groups

Say the team wants to filter on CREATED_DATE for rows before 2020. We use scan_parquet here to kick off a lazy query.

12. Parquet row groups

Because the data is sorted by CREATED_DATE, all rows before 2020 land inside a single row group.

13. Parquet row groups

Parquet stores the minimum and maximum value for each column in each row group. Polars reads these statistics and sees that only row group 1 is needed. Polars can also take advantage of row groups when reading Parquet in parallel.

14. Scanning with parallel strategies

Polars can read Parquet files in parallel across columns, across row groups, or both. By default, Polars estimates which strategy is better for each query, but it won't always find the fastest way.

15. Scanning with parallel strategies

The team can control the approach when trying to speed up a bottleneck with the parallel argument. In this case, they find that reading over row groups in parallel is 20% faster. Other options are reading in parallel over columns, or prefiltered, where Polars first applies filter optimizations on row groups. This can be fast for large files with lots of row groups. We can also turn parallel reading off with none.

16. Controlling Parquet writes

The team can also control how the Parquet file is laid out. Parquet files use compression to reduce file size. Setting compression_level between 1 and 20 controls how aggressive that compression is. A value between 3 and 6 is a good starting point for most use cases. Higher compression gives smaller files, but slower reads and writes.

17. Controlling Parquet writes

Row-group size controls how many rows are stored in each row group. This can help concentrate subsets of data into particular row groups. Smaller row groups mean Polars has to read more statistics metadata and do more read operations, but there is more filtering potential.

18. Let's practice!

Now it's time to practice working with Parquet files in Polars!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.