Multiplexing Parquet sinks
The team now needs two extracts from the same cleaned query: one for digital checkouts and one for physical. Build both writes lazily so Polars can plan the shared scan once, then run them together in one pass.
clean_checkouts is preloaded, along with DIGITAL_EXPORT_PATH and PHYSICAL_EXPORT_PATH.
This exercise is part of the course
Scaling and Optimizing Data Pipelines with Polars
Exercise instructions
- Build both sinks lazily so they don't execute immediately.
- Run both sinks together on the streaming engine.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Lazy sink for digital
digital_sink = (
clean_checkouts
.filter(pl.col("use") == "Digital")
.sink_parquet(DIGITAL_EXPORT_PATH, lazy=____)
)
# Lazy sink for physical
physical_sink = (
clean_checkouts
.filter(pl.col("use") == "Physical")
.sink_parquet(PHYSICAL_EXPORT_PATH, lazy=____)
)
# Run both sinks together
pl.____([digital_sink, physical_sink], engine="streaming")
# Check the row counts of each extract
result = pl.DataFrame(
{
"extract": ["digital", "physical"],
"rows": [
pl.scan_parquet(DIGITAL_EXPORT_PATH).select(pl.len()).collect().item(),
pl.scan_parquet(PHYSICAL_EXPORT_PATH).select(pl.len()).collect().item(),
],
}
)
print(result)