Pipeline Optimizations

1. Pipeline Optimizations

person: This is our last section on dataflow best practices. Here we'll explore a few things that we should keep in mind while designing our dataflow pipelines. Let's look into some general guidelines we should consider while developing dataflow pipelines. Whenever possible, filter data early in the pipeline and move any steps that reduce data volume up in your pipeline. This will reduce the overall data volume flowing through the pipeline, enabling efficient use of the pipeline resources. This includes placing them about window operations as well, even though the window transform itself does nothing more than DAG element in preparation for the next aggregation step in the DAG. Data collected from external systems often needs cleaning, since a single message can suffer from multiple issues that needs correction. Think carefully about the direct acyclic graph or DAG you will need. If an element contains data with multiple effects, you must ensure that the elements flows through all of the appropriate transforms whenever possible. Applied data transformation serially to let the Dataflow service optimize data for you. Whenever transformations are applied serially, they can be merged together in single stage, enabling them to be processed in the same worker nodes and reducing costly IO network operations. If your pipeline interacts with external systems, look out for back pressure and external systems. May be a key value store like BigTable or [indistinct] used for lookups in a pipeline or an [indistinct] sink your pipeline writes to. It is recommended that you ensure the appropriate capacity of external systems to avoid back pressure issues. Enabling auto scaling for Dataflow pipelines is also a good idea. If for some reason your [indistinct] system is backlogged, your Dataflow pipeline can scale down instead of underutilizing pipeline resources. In this module, we started with intro to Beam schemas. We discussed its usefulness while dealing with structured data. Then we looked into best practices for handling unprocessable or erroneous records. Next, we covered best practices around error handling and generation of modules. We wrapped up this session with overview of DAG optimizations, and ways to exploit lifecycle of [indistinct] to do batch processing. Thanks for joining.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.