Inserting missing columns
One year's extracted file is missing the pub column (publisher), but the team still wants to scan both files as one dataset. Pick the right argument so Polars inserts null where a column is missing instead of failing.
polars is loaded as pl, and the directory is in DRIFT_DIR. The header of each file is printed for you, so you can see the schema difference.
Deze oefening maakt deel uit van de cursus
Scaling and Optimizing Data Pipelines with Polars
Oefeninstructies
- Use a glob pattern to scan every
seattle_*.csvfile inDRIFT_DIR. - Add the right argument so Polars inserts nulls for columns that are missing in some files.
Interactieve oefening met praktijkervaring
Probeer deze oefening door deze voorbeeldcode aan te vullen.
# Scan both yearly files as one combined dataset
combined = pl.scan_csv(
str(DRIFT_DIR / "____"),
try_parse_dates=True,
# Insert missing columns instead of failing on schema differences
____="____",
)
result = combined.select("date", "format", "title", "pub").collect()
print("First rows (from 2023 file):")
print(result.head(3))
print("\nLast rows (from 2024 file):")
print(result.tail(3))