BaşlayınÜcretsiz başlayın

Sinking a cleaned extract to Parquet

Back to the Seattle library data. The team has a cleaned-up checkout extract they want to write to Parquet for downstream tools, but they don't want to materialize the whole thing in memory first. Write the lazy query straight to disk.

clean_checkouts is preloaded, along with the export path CLEAN_EXPORT_PATH.

Bu egzersiz, kursun bir parçasıdır

Scaling and Optimizing Data Pipelines with Polars

Kursa Göz Atın

Egzersiz talimatları

  • Write clean_checkouts to CLEAN_EXPORT_PATH directly from the lazy query.
  • Set the row group size to 5,000.
  • Use the streaming engine.

Uygulamalı etkileşimli egzersiz

Bu egzersizi bu örnek kodu tamamlayarak deneyin.

# Write clean_checkouts straight to disk
clean_checkouts.____(
    CLEAN_EXPORT_PATH,
    # 5,000 rows per row group
    row_group_size=____,
    # Streaming engine
    engine="____",
)

# Confirm what landed in the Parquet file
result = pl.scan_parquet(CLEAN_EXPORT_PATH).select(
    pl.len().alias("rows"),
    pl.col("checkouts").sum().alias("total_checkouts"),
).collect()
print(result)
Kodu Düzenle ve Çalıştır