Job Metrics

1. Job Metrics

person: The other page available on the Dataflow UI is the Job Metrics tab. This shows us time series data for our job. This page varies between batch and streaming. Let's look at the Job Metrics for the BigQuery two tensor flow records batch pipeline we ran earlier. The first graph shows the number of workers that ran across the lifetime of the job with auto scaling enabled. At certain points during the job, we can see that the Dataflow service decided that more workers were needed to increase the job throughput. The green line shows how many workers are needed, and the blue line shows the current number of workers. There will be a small time gap between the two as each new worker needs time to spin up and for work to be assigned to it. The second graph shows the throughput for each sub step versus time. Recall that your beam steps are made up of sub steps. And here we see the throughput of each one. In the Job Graph tab, we discussed how batch pipelines do not run all the steps concurrently. We can see that on this graph here. The first hump shows the records being read, and the second one shows the records being partitioned and saved to Google Cloud Storage. The third graph shows each CPU utilization percentage. In our job run, we see the all workers reached near 100% CPU utilization. A healthy pipeline should have all the workers running around the same CPU utilization rate. If you see that a couple of your workers are running at 100% and the rest of the workers have low utilization, your pipeline is likely unhealthy and suffering from an uneven distribution of workload. Some beam operations like group by key cannot be split across workers. Each worker will be assigned a range of keys to group. If your data is heavily skewed, one worker could end up doing all the work while the others do nothing. On the CPU utilization graph, we see this as a couple of workers having a high CPU utilization, while the others have low CPU utilization. The last graph in batch pipelines is the worker error log count. As the name suggests, this shows the number of log entries from the workers that had a level of error. In batch jobs. If processing an element fails four times in a row, the whole batch pipeline fails. Let us now look at a streaming pipelines Job Metrics page. This is for a pipeline I ran that reads from Pub/Sub and syncs to BigQuery. Just like batch pipelines, there are graphs for auto scaling throughput, CPU utilization, and worker error log count. In addition to these graphs, there are a few graphs for streaming jobs. Let us start with the first two, the data freshness and system latency graphs. These graphs are great to measure the health of a streaming pipeline. The data freshness graph shows the difference between real time and the output watermark. The output watermark is a timestamp where any time step prior to the watermark is nearly guaranteed to have been processed. For example, if the current time is 9:26 a.m., and the data freshness graphs value at that time is six minutes, that means that all elements with a timestamp of 9:20 a.m. or earlier have arrived and have been processed by the pipeline. The system latency graph shows how long it takes elements to go through the pipeline. If the pipeline is blocked at any stage, the latency will increase. For example, imagine our pipeline reads from Pub/Sub, does some beam transformation on the elements, then syncs them into Spanner. Suddenly, Spanner goes down for five minutes. When this happens, Pub/Sub won't receive confirmation from Dataflow that an element has been sunk into Spanner. This confirmation is needed for Pub/Sub to delete that element. As there is no confirmation, the system latency and data freshness graphs will both rise to five minutes. Once the Spanner service comes back, all the elements will be written into Spanner and data flow will confirm that with Pub/Sub, returning the system latency and data freshness graphs to normal. In addition to the data freshness and system latency graphs, streaming jobs can also have an input and output metrics at the bottom of the metrics page. Input metrics and output metrics are displayed if your streaming Dataflow job has read or written records using Pub/Sub. In my case, I only had Pub/Sub as an input so I can only see input metrics. If I have more than one Pub/Sub source or sink, I can view the metrics of any one of them by clicking on the drop down and choosing the Pub/Sub source or sink I want. In my case, I only have one Pub/Sub source and that is my subscription name data flow fund. The first graph we talked about is the request per second graph. Requests per seconds is the rate of API request to read or write data by the source or sink over time. If this rate drops to zero or decreases significantly for an extended period relative to your expected behavior, then the pipeline might be blocked performing certain operations, or there is no data to read. If this happens, you should review steps that have a high system watermark to see where the blockage is happening. Also, examine the worker logs for errors or indications that slow processing is occurring. The second graph is the response errors per seconds by error type graph. Response errors per second by type error is the rate of failed API requests to read or write data by the source or sink over time. If errors occur frequently and repeatedly, see what they are and cross reference them to the specific error code documentation on Pub/Sub error codes. For all pipelines, you can restrict the timeline for the graphs and logs using the time selector tool. Right now I have a job that has been running for a few hours. How do I focus on a specific time interval? This is where the time selector tool comes in, and I'll show you how to use it. Open the time selector tool by pressing on the button showing the current time range selected. This will open a drop down menu, you can select a time range for the charts and logs ranging from hours to the maximum lifetime of the pipeline. You can even choose a custom time range by setting the start and end time you want to view. Let's click the max time for the pipeline to see how the graphs change across the pipeline's entire time. And press apply to see the change. Keep an eye on the data freshness and system latency graphs. At the beginning of our run, the pipeline had a lot of data to read. If I bring the cursor near the peak of the data freshness graph, we can see the pipeline was approximately 16 hours behind wall time when it started. This is because I first sent data to a Pub/Sub subscription for 16 hours before starting the pipeline. If I want to zoom into a specific time period from the graph, I press on the start point I am interested in and drag and hold to the end of the time period I am interested in. Once I released the pointer all the graphs will be zoomed into the time range highlighted. If I want to exit the zoomed view, I press on the Reset Zoom button at the top.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.