Comece agoraComece grátis

Multiplexing Parquet sinks

The team now needs two extracts from the same cleaned query: one for digital checkouts and one for physical. Build both writes lazily so Polars can plan the shared scan once, then run them together in one pass.

clean_checkouts is preloaded, along with DIGITAL_EXPORT_PATH and PHYSICAL_EXPORT_PATH.

Este exercicio faz parte do curso

Scaling and Optimizing Data Pipelines with Polars

Ver curso

Instruções do exercicio

  • Build both sinks lazily so they don't execute immediately.
  • Run both sinks together on the streaming engine.

exercicio interativo prático

Tente este exercicio completando este código de exemplo.

# Lazy sink for digital
digital_sink = (
    clean_checkouts
    .filter(pl.col("use") == "Digital")
    .sink_parquet(DIGITAL_EXPORT_PATH, lazy=____)
)

# Lazy sink for physical
physical_sink = (
    clean_checkouts
    .filter(pl.col("use") == "Physical")
    .sink_parquet(PHYSICAL_EXPORT_PATH, lazy=____)
)

# Run both sinks together
pl.____([digital_sink, physical_sink], engine="streaming")

# Check the row counts of each extract
result = pl.DataFrame(
    {
        "extract": ["digital", "physical"],
        "rows": [
            pl.scan_parquet(DIGITAL_EXPORT_PATH).select(pl.len()).collect().item(),
            pl.scan_parquet(PHYSICAL_EXPORT_PATH).select(pl.len()).collect().item(),
        ],
    }
)
print(result)
Editar e Executar Código