BaşlayınÜcretsiz başlayın

Scanning a hive-partitioned dataset

The team also stores cleaned-up Parquet checkouts in a hive-partitioned layout, with one directory per year (checkoutyear=2023/, checkoutyear=2024/). Scan the partitioned dataset and filter on the partition column so Polars only reads the years you actually need.

polars is loaded as pl, and the root directory is in HIVE_DIR. The partition directories are printed for you, so you can see the layout.

Bu egzersiz, kursun bir parçasıdır

Scaling and Optimizing Data Pipelines with Polars

Kursa Göz Atın

Egzersiz talimatları

  • Scan HIVE_DIR using the right argument to enable hive partitioning.
  • Filter the result to checkouts from 2024 onward.

Uygulamalı etkileşimli egzersiz

Bu egzersizi bu örnek kodu tamamlayarak deneyin.

requests = pl.scan_parquet(
    HIVE_DIR,
    # Enable hive partitioning
    ____=True,
)

result = (
    requests
    # Filter to the 2024 partition
    .filter(pl.col("checkoutyear") >= ____)
    .group_by("format")
    .agg(pl.col("checkouts").sum().alias("total"))
    .sort("total", descending=True)
    .collect()
)
print(result)
Kodu Düzenle ve Çalıştır