Source, Sinks & external systems
1. Source, Sinks & external systems
In this section, we'll discuss the impact of external systems on a Dataflow pipeline. In the Dataflow service, more sources and sinks abstract the user from the need to deal with Read stage parallelism. Sometimes, this hides underlying issues that impact a pipeline's performance. For example, if you're reading gzip files via TextIO, gzip files can't be read in parallel. A single thread will deal with each file. This will have three negative effects. First one is that only one machine can do the read I/O operation. After the read stage, all fused stages will need to run on the same worker that read the data. In any shuffle stage, a single machine will need to push all the data from the file to all other machines. This single host network becomes the bottleneck. Switch to uncompressed files while using TextIO, or switch to compressed Avro format. Beam runners are designed to be able to rapidly chew through parallel work They can spin up many threads across many machines to achieve this goal. This can easily swamp an external system. This is an issue for both batch and streaming pipelines. The effect on the external system are often more pronounced in batch, or during backlog processing in a streaming pipeline. To alleviate this issue, make use of a batch mechanism in the call to external system and use a mechanism, like GroupIntoBatches, transforms, or @StartBundle and @FinishBundle. You should also provision external services to handle the peak volume of the Dataflow pipeline. While working in Cloud, it's sometimes easy to forget the impact of the simple choices we make while developing applications. Colocation is one such aspect. Using services and resources from same region usually means relatively lower latency for interservice communication. This lower latency may result in significant performance gains, especially when the pipeline involves significant interaction with actual analysis services like BigQuery, Bigtable, or any other service outside of Dataflow.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.