Monitoring

1. Monitoring

person: Now we will discuss how a good monitoring strategy can improve the reliability of your Dataflow workloads. First, we start with a reminder that for batch jobs, tasks with failing items are retried up to four times. After four failures, the job will fail. As mentioned before, a batch job can be rerun with little fear of data loss or interruption of existing services, as long as this is run within a user desired service level objective or resolution. For streaming pipelines, failing work items will be retried indefinitely. The rest of this module, we will discuss techniques to prevent your streaming workloads from being stuck forever. Erroneous records may cause your pipeline to get stuck or fail outright. As described in previous modules, we highly recommend implementing a dead-letter queue, and error logging to prevent these failure modes. This can help catch problems in your code or in data shapes. This code snippet shows an example of a pattern written in Java, There are a few things to know here. We wrapped user code inside a process element function with a try catch block. Inside of the catch block, we do not log every error or exception, as it may overwhelm the whole pipeline. Instead, we send the erroneous record to an alternative dead-letter sink. We use tuple tags so that we can write to multiple outputs in the resulting p collection. This helps us write to downstream p collections as well as send raw data to a persistent storage medium like BigQuery or cloud storage, so that we can inspect them offline. To maximize the reliability of your workloads, it is essential to implement a robust monitoring and alerting strategy. Monitoring and alerting policies can help you catch issues with your data processing before they bring down production systems. It lets you combine different types of metrics and observe important Service Level Indicators or SLS of pipeline performance. If you are comparing those SLIs against acceptable threshold, monitoring can give you critical insights for early detection of potential issues. Dataflow provides a web based monitoring interface that can be used to view and manage jobs. You can create metrics based alerts with a couple of clicks. We've covered this in our monitoring module. In addition, data flows integration with cloud monitoring provides extensive flexibility for pipeline monitoring. You can collect custom metrics that point to health conditions that are relevant for your use case, like the number of erroneous records that have been detected. The possibilities are endless with Dataflow's monitoring integration. For batch workloads, you might be interested in the overall runtime of your job. If the job runs on a recurring schedule, you might want to ensure that the job completes successfully within a given period of time. Some variance in pipeline execution time is expected across runs due to a variety of factors, but if they are violating your service level objectives, or SLOs, you need to be notified right away. With cloud monitoring, you can track the elapsed time for a job and create an alert that goes off if the elapsed time exceeds a threshold that is equivalent to your SLO. This can be entirely done using its integration with Dataflow. For streaming pipelines, you want steady and sustained data processing. Dataflow provides standard metrics like data freshness and system latency that make it easy to track whether your pipeline is falling behind. You can create an alert with a couple of clicks from the Dataflow monitoring UI that will be triggered if this selected metrics fall behind the specified threshold. This is an example of a simple alerting policy. You can combine it with custom metrics or with other statistics to determine the failure condition that matters to your workload. These alerts are essential for improving the reliability posture of your Dataflow pipelines.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.