Get startedGet started for free

Job Graph

1. Job Graph

person: Now let's focus on what we see on the job graph. In the middle of the page, we see a visual representation on the steps from my Beam code. I am running a batch pipeline that reads data from a BigQuery table, reshuffles it, and writes it to a cloud storage bucket in TensorFlowRecord format. The records are split across training, validation, and test sets. In Beam, some steps are made up of substeps. We can see these by expanding each step. As this is a batch job, steps are executed sequentially. The next part should not start until the one before it finishes. DataFlow optimizes your pipeline for both streaming and batch pipelines. Part of the optimization includes fusing multiple steps in your pipeline into single steps. If we press on any step, we can see how DataFlow splits each stage into a number of optimized stages. Some stages will be shared between different steps. For example, if I press on the RecordToExample step and view its stages, and then press on ReshuffleResults and view its stages, we can see they share a common stage. When that stage starts, the UI will show both RecordToExample and ReshuffleResults running. If we were running a streaming job, all the stages and steps would run concurrently. Pressing on each step not only shows us the optimized stages, but also throughput info for each step across time. Below that, we see the total number of elements added and the estimated size. Another metric available at each step and substep is the wall time. This shows the total amount of time by the assigned workers to run each step. This can be a useful metric to look at when you want to see where your workers are spending the most amount of time. Eventually, the batch job will complete. In batch jobs, as we are dealing with a known amount of data, jobs do get completed. Once a job finishes, all steps should be marked with a green check mark, as shown here. If a job fails, the steps that failed will be shown in red with an error symbol. As streaming jobs process unbounded collections, there is no completion time for the job unless you cancel or drain it. Beam lets you set custom metrics for your pipeline. The metrics class has three methods that can be used: counter, distribution, and gauge. The counter method lets you increment and decrement any variable or event you are interested in checking. The distribution method is not a histogram, but tracks for you the count, minimum, maximum, and the mean. The gauge method lets you see the latest value of their variable you set it to track. Please review our public docs to see which custom metric types are supported in DataFlow. The DataFlow UI displays any custom metric on the right pane of the Job Graph page. For example, this is a pipeline run of the Beam Java word count example. On the DataFlow UI, we see the custom metrics associated with the job here. We count the number of empty lines and the length distribution of each line.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.