Get startedGet started for free

Shuffle and streaming engine

1. Shuffle and streaming engine

Now let’s explore some Dataflow-specific performance optimization options. Dataflow Shuffle is the base operation behind Dataflow transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. Currently, Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and persistent disk storage. The service-based Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend. The service-based Dataflow Shuffle has the following benefits: Faster execution time of batch pipelines for the majority of pipeline job types. A reduction in consumed CPU, memory, and persistent disk storage resources on the worker VMs. Better autoscaling, since VMs no longer hold any shuffle data and can therefore be scaled down earlier. Better fault tolerance. An unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature. Dataflow Shuffle and the Streaming Engine feature offloads the window state storage operation from the persistent disks (PDs) attached to workers, to a backend service. It also implements an efficient shuffle for streaming cases. The Dataflow Shuffle service is applicable to batch pipelines, while the Streaming Engine service is built for streaming pipelines. No code changes are required to get the benefits of these features. Worker nodes continue running your user code that implements data transforms, and transparently communicate with the Streaming or Shuffle engine to store the pipeline state. Many scalability and autoscaling issues can be resolved by enabling Shuffle and Streaming Engine for your batch and streaming pipelines, respectively. This is the end of this module. You should now be able to: Understand performance considerations for pipelines, and Consider how the shape of your data can affect pipeline performance.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.