Get startedGet started for free

Pipeline Design

1. Pipeline Design

Hi, I am Ajay, I am a Strategic Cloud Engineer at Google. In this module, we will discuss performance considerations we should be aware of while developing batch and streaming pipelines in Dataflow. We will begin with some general pipeline design considerations, then we will discuss effects of data shape on performance of a pipeline. Next we will see impact of external systems on a dataflow pipeline. In the end we will wrap up this section with insights of Dataflow specific performance optimization options. Let’s begin with pipeline design decisions which impacts performance of a pipeline. We might sometimes underestimate simple considerations that are critical to a pipeline’s performance. Filtering data early might be considered one such option. It is recommended to place transformations that reduce the volume of data as high up on the graph as possible. This includes placing them above window operations, even though the Window transform itself does nothing more than tag elements in preparation for the next aggregation step in the DAG. Choose coders that provide good performance. For example, in the Java SDK, do not use SerializableCoder. Choose a more efficient coder, for example ProtoCoder or Schemas. Pro Tip: Encoding and decoding are a large source of overhead. Thus, if you have a large blob but only need part of it, you could selectively decode just that part. For example, 'com.google.protobuf.FieldMask' in Protobufs enables reading specific bits of information without deserializing whole blob. If your pipeline has large windows aggregating large volumes of data, you can create smaller Window + Combine patterns before the main sliding window to reduce the volume of elements to be processed when the window slides. Runners may support fusion as part of graph optimization. Such optimizations can include fusing multiple steps or transforms in your pipeline's execution graph into a single phase. There are a few cases in your pipeline where you may want to prevent the Dataflow service from performing fusion optimizations. Before we discuss graph optimization further let’s first understand what a fanout transformation is. In a fanout transformation a single element can output hundreds or thousands of times as many elements. For example in this diagram, a single input element (key1 and value1) outputs multiple output elements. The primary example where fusion is not desirable is a large fanout transform. To prevent Fusion you can insert a Reshuffle after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation. Alternatively you can pass your intermediate PCollection as a side input to another ParDo. The Dataflow service always materializes side inputs. One of the common service ticket resolutions for performance comes from that old favorite: too much logging! In the Dataflow runner, logs are sent from all workers to a central location in Stackdriver. A thousand machines, all pushing hundreds of logs per second, can cause massive back pressure! Log.info should almost always be avoided against PCollection element granularity. These will rarely be useful in logs. Log.error should also be carefully considered. Using a dead letter pattern followed by a count per window of 5 minutes may be better suited for reporting data errors.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.