Parsing CSVs

1. Parsing CSVs

In the previous video, we told the team to keep CSV for new data. But not all CSVs are clean. So today we help them load messy CSVs into reliable Polars pipelines.

2. Finance CSV extract

The team receives a CSV from the finance team with costs for each request. But Polars can't read it out of the box. We need to look at the raw file to understand the issues. Problem number one is that the file starts with two descriptive rows before the real header. Problem number two is that the columns are separated with semicolons instead of commas.

3. Skipping rows

The first fix is to skip the two descriptive rows before the real header. Now, Polars starts reading where the column names appear. This is a small CSV, so we use read_csv, but it works the same way in lazy mode.

4. Skipping extra header rows

Then we tell Polars that this file is separated with semicolons.

5. Checking the parsed file

To verify the parsing, the team previews a few rows. Now the columns are split correctly, and the header row has been read as column names.

6. A schema inference problem

Later, the team receives a new file with another issue. With a CSV Polars looks through the first 100 rows to infer the schema. If the first 100 REQUEST_COST values are all integers, Polars infers that column as Int64.

7. Schema inference

But when a later REQUEST_COST row contains 61.5, it no longer fits the inferred integer dtype, and read_csv fails.

8. Schema inference

The first decimal value occurs on the 150th line, so we set infer_schema_length to 200 to make sure Polars catches it. A larger infer_schema_length slows down the read, but Polars is more likely to correctly infer the dtype.

9. Schema inference

Now Polars sees enough numeric values to infer REQUEST_COST as Float64. This is often the simplest way to fix an inference problem in a messy CSV. But what happens when the team knows the schema in advance?

10. Providing the schema

In this case, they can pass the desired schema directly as a dictionary.

11. Overriding the inferred schema

Sometimes the team only needs to correct some inferred columns rather than provide the full schema. In those cases, they can override part of the inferred schema with schema_overrides.

12. Checking the override

Now the numeric column is guaranteed to have the dtype the team expects. Schema overrides are especially helpful when a pipeline must be stable across repeated CSV deliveries where the data can be variable.

13. A bad data problem

These messy finance CSVs often have genuine bad data as well. The team knows that the WARD column sometimes contains the text "unknown" instead of integers. That kind of value causes parsing to fail if Polars tries to read WARD as an integer column.

14. Ignoring parse errors

The team's preferred strategy to deal with this is to set ignore_errors=True. This sets any values Polars can't parse to null, so it can continue parsing. We advise the team that this is a dangerous approach that can hide other errors.

15. Marking null values

If the team knows a placeholder like unknown might be present, they should tell Polars explicitly with the null_values argument. This preserves the intended dtype and avoids more serious errors creeping through unnoticed.

16. Let's practice!

Now it's time to practice parsing messy CSV files in Polars.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.