Introduction to Reliability

1. Introduction to Reliability

Federico: Hi, I'm Federico Patota, a cloud consultant here at Google. In this module, we will learn how to implement reliability in Dataflow pipelines. There are different approaches for reliability based on the type of pipeline you are running. Batch jobs are simple. If a batch job does not launch or if it fails during execution, you can always rerun the job. Source data is not lost and partial data written to sinks can be rewritten, if it was written at all. Streaming jobs, on the other hand, are more complex. Streaming jobs are continuously processing data and behave like a long-lived application. Thus, reliability is of the utmost importance. You must be vigilant for various failure modes, and when a failure inevitably occurs, you must act fast to minimize data loss and downtime. Most of the reliability best practices in this specific module are for streaming pipelines. In particular, the second half of this module focuses on disaster recovery and high availability configurations. With that being said, lessons covered in the monitoring and geolocation sections are also relevant for your batch workloads. We can classify the pipeline's failures in two broad categories, failures related to user code and data shapes, and failures caused by outages. Software bugs are a reality of software engineering, including data processing applications. Transient errors and corrupted data can impact your data processing jobs. So it's important to know how to mitigate the adverse effects that can be produced unintentionally by software bugs. Dataflow sits at the center of multiple Google Cloud services. That also means that Dataflow is susceptible to various outage modalities, including service outages, zonal outages, and regional outages. If a network service is down, Dataflow will likely be impacted by it. Similarly, if Compute Engine instances are inaccessible in a particular zone or region where Dataflow workers are running, the data processing jobs will be affected. Since Dataflow is often connected between different parts of a customer application, running on GCP, users need to be especially vigilant for any service disruptions. In the following modules, we will discuss different strategies to mitigate the risk of these incidents and increase the reliability of your Dataflow workloads.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Serverless Data Processing with Dataflow: Operations

AdvancedSkill Level

4.9+

7 reviews

In this module, we learn how to use the Jobs List page to filter for jobs that we want to monitor or investigate. We look at how the Job Graph, Job Info, and Job Metrics tabs collectively provide a comprehensive summary of your Dataflow job. Lastly, we learn how we can use Dataflow’s integration with Metrics Explorer to create alerting policies for Dataflow metrics.

Exercise 1: Job List Exercise 2: Job Info Exercise 3: Job Graph Exercise 4: Job Metrics Exercise 5: Metrics Explorer Exercise 6: Quiz Question 1 Exercise 7: Quiz Question 2 Exercise 8: Additional Resources

This module reviews the topics covered in the course

Exercise 1: Course Summary