Data Shape

1. Data Shape

In this section we will discuss the effects of data shape on a pipeline’s performance. Unique characteristics of data you are processing also effects the performance of Dataflow pipelines. For example, data skew often results in unbalanced data processing. As discussed in the “Serverless Data Processing with Dataflow” course, operations like groupByKey merge multiple PCollections into one. During a GroupByKey or combine operation, keys will be shuffled to workers. All values related to one key will be sent to the same machine throughout the process. This can be a problem when the data you are processing is skewed. For example, columns used as keys that are @nullable often end up being hot keys. To mitigate the hot key issue, we can use one of the following three techniques. The first one is to use the helper API “withFanout(int)”. This allows for the definition of intermediate workers before the final combine step. Another similar API is withHotKeyFanout(Sfn). It is available for Combine.perkey and allows for a function to determine intermediate steps. Using Dataflow Shuffle for batch or Streaming Service also alleviates this issue. The Dataflow shuffle or streaming service offloads the shuffle operation to a backend service. This means that shuffle operation is not constrained by resources available on a single worker machine. The Dataflow service makes it easier to detect and surface hot keys. To do so, set “hotKeyLoggingEnabled” flag to true. Enabling this flag will print the specific key that is your bottleneck, which can help Dataflow developers to implement custom logic for that specific key. Without the flag, Dataflow will print if they think they've detected a hot key, but cannot reveal what that key is. Key space used in your pipeline also has an impact on its performance. For example, the maximum amount of parallelism is determined by the number of keys. More machines will not be able to do any more work if key space is limited. Below are some general guidelines regarding key space: Too few keys is bad for performance. Limited keyspace will make it hard to share workload, and per-key ordering will kill performance. Too many keys can be bad too as overhead starts to creep in. If the key space is very large, consider using hashes separating keys out internally. This is especially useful if keys carry date/time information. In this case you can "re-use" processing keys from the past that are no longer active, essentially for free. Pro Tip! If windows are distinct, the window can be added as part of the key to shard work across more workers. Adding the window to the key improves the ability of the system to parallelize processing since those keys can now be processed in parallel on different machines since they are now recognized as unrelated.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Serverless Data Processing with Dataflow: Operations

AdvancedSkill Level

4.9+

7 reviews

In this module, we learn how to use the Jobs List page to filter for jobs that we want to monitor or investigate. We look at how the Job Graph, Job Info, and Job Metrics tabs collectively provide a comprehensive summary of your Dataflow job. Lastly, we learn how we can use Dataflow’s integration with Metrics Explorer to create alerting policies for Dataflow metrics.

Exercise 1: Job List Exercise 2: Job Info Exercise 3: Job Graph Exercise 4: Job Metrics Exercise 5: Metrics Explorer Exercise 6: Quiz Question 1 Exercise 7: Quiz Question 2 Exercise 8: Additional Resources

This module reviews the topics covered in the course

Exercise 1: Course Summary