Shuffle and streaming engine

1. Shuffle and streaming engine

Now let’s explore some Dataflow-specific performance optimization options. Dataflow Shuffle is the base operation behind Dataflow transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. Currently, Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and persistent disk storage. The service-based Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend. The service-based Dataflow Shuffle has the following benefits: Faster execution time of batch pipelines for the majority of pipeline job types. A reduction in consumed CPU, memory, and persistent disk storage resources on the worker VMs. Better autoscaling, since VMs no longer hold any shuffle data and can therefore be scaled down earlier. Better fault tolerance. An unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature. Dataflow Shuffle and the Streaming Engine feature offloads the window state storage operation from the persistent disks (PDs) attached to workers, to a backend service. It also implements an efficient shuffle for streaming cases. The Dataflow Shuffle service is applicable to batch pipelines, while the Streaming Engine service is built for streaming pipelines. No code changes are required to get the benefits of these features. Worker nodes continue running your user code that implements data transforms, and transparently communicate with the Streaming or Shuffle engine to store the pipeline state. Many scalability and autoscaling issues can be resolved by enabling Shuffle and Streaming Engine for your batch and streaming pipelines, respectively. This is the end of this module. You should now be able to: Understand performance considerations for pipelines, and Consider how the shape of your data can affect pipeline performance.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Serverless Data Processing with Dataflow: Operations

AdvancedSkill Level

4.9+

7 reviews

In this module, we learn how to use the Jobs List page to filter for jobs that we want to monitor or investigate. We look at how the Job Graph, Job Info, and Job Metrics tabs collectively provide a comprehensive summary of your Dataflow job. Lastly, we learn how we can use Dataflow’s integration with Metrics Explorer to create alerting policies for Dataflow metrics.

Exercise 1: Job List Exercise 2: Job Info Exercise 3: Job Graph Exercise 4: Job Metrics Exercise 5: Metrics Explorer Exercise 6: Quiz Question 1 Exercise 7: Quiz Question 2 Exercise 8: Additional Resources

This module reviews the topics covered in the course

Exercise 1: Course Summary