Dataflow Shuffle Service

1. Dataflow Shuffle Service

person: In this video, we look at the Dataflow Shuffle service. A shuffle is a Dataflow-based operation behind transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. Currently, Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and persistent disk storage. The service-based Dataflow Shuffle feature available for batch pipelines only moves the shuffle operations out of the worker VMs and into the Dataflow service backend. With the Dataflow Shuffle service, you will have faster execution time of batch pipelines for the majority of the job types. The worker nodes will benefit from a reduction in consumed CPU, memory, and persistent disk storage resources, and your pipelines will have better autoscaling because the worker nodes VMs no longer hold any shuffle data, and can therefore be scaled down earlier. Also, because of the service, you will get better fault tolerance. An unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, which would happen without the feature. See the Dataflow official documentation to learn how to enable the Dataflow Shuffle service for your batch pipelines.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Serverless Data Processing with Dataflow: Foundations

IntermediateSkill Level

4.9+

8 reviews

This module covers the course outline and does a quick refresh on the Apache Beam programming model and Google’s Dataflow managed service.

Exercise 1: Course Introduction Exercise 2: Beam and Dataflow Refresher

In this course, we started with the refresher of what Apache Beam is, and its relationship with Dataflow.

Exercise 1: Course Summary Exercise 2: Additional Resources