Disaster Recovery

1. Disaster Recovery

In this video, we will look at disaster recovery methods with Dataflow. These methods only apply to streaming pipelines. Data is your most prized asset, which is why it is essential to have a disaster recovery strategy in place for your production systems. One way is to take snapshots of your data source. This capability is supported in many popular relational databases and data warehouses. But what if you are using a messaging application? Pub/Sub offers this capability. You can implement a disaster recovery strategy with two features: Pub/Sub Snapshots, which allows you to capture the message acknowledgement state of a subscription, and Pub/Sub Seek, which allows you to alter the acknowledgement state of messages in bulk. If you are using this strategy, you will have to reprocess messages in the event of a pipeline failure. This means you will have to consider how to reconcile this in your data sync and duplicate some records that have been written twice. Let's go over what we need to do to use Pub/Sub Snapshot to support our disaster recovery requirements. First, you should take Snapshots of the Pub/Sub subscription. To do this, you can use the command line interface, CLI in short, or the Cloud console. After your Pub/Sub Snapshot has been created, you can stop and drain your Dataflow pipeline. You can do this using the command line interface or in the Job Details page of Dataflow UI. Once your pipeline has stopped processing messages, you can use Pub/Sub’s Seek functionality to revert the acknowledgement of messages in your subscription. Again, you can achieve this using the command line tool. Finally, you are ready to resubmit your pipeline. You can launch your pipeline using any of the ways that you use to deploy your Dataflow job either directly from your development environment or by using the command line tool to launch a template. The example on this slide shows a simple command for a templated job that has been launched with a command line interface. An important caveat to consider is that Pub/Sub messages have a maximum data retention of seven days. This means that after seven days, a Pub/Sub Snapshot no longer has any use for your stream processing. If you choose to use Pub/Sub Snapshots for your disaster recovery, we recommend that you take Snapshots weekly at a minimum to ensure that you do not lose any data in the event of a pipeline failure. Using Pub/Sub Snapshots in conjunction with Seek is a good starting point. But when you are using Pub/Sub and Dataflow for your streaming analytics, there are important things to consider. When you use Pub/Sub Seek to restart your data pipeline from a Pub/Sub snapshot, messages will be reprocessed. This creates a few challenges. First, you might observe duplicate records in your sync. The amount of duplication depends on how many messages were processed between the time of when the snapshot was taken, and the time the pipeline was terminated. In addition to that, data that has been read by your pipeline, but yet to be processed and written to sink will need to be processed over again. Remember that Dataflow acknowledges a message from Pub/Sub when it has read the message, not when the record has been written to the sink. This presents a challenge for pipelines with complex transformation logic. For example, if your pipeline is processing millions of messages per second and goes through multiple processing steps, having to reprocess the data represents a significant amount of lost compute. Lastly, if your pipeline is implemented exactly-once processing, windowing logic will be interrupted when you drain and restart your pipeline. Since you have to lose the buffered state when you drain your pipeline, you must conduct a tedious reconciliation exercise if exactly-once processing is a requirement for your use case. Luckily, Dataflow also has natural capabilities. If you recall, we produce Dataflow Snapshots as a useful tool for testing and rolling back updates to streaming pipelines in our Testing and CI/CD module. Dataflow Snapshots can also be used for disaster recovery scenarios. Since Dataflow Snapshots saves streaming pipeline state, we can restart the pipeline without reprocessing in-flight data. This saves you money whenever you have to restart your pipeline. Moreover, you can restore your pipeline much faster than using the Pub/Sub Snapshots and Seek strategy. This ensures that you have minimal downtime. Dataflow Snapshots can be created with a corresponding Pub/Sub source Snapshot. This helps you coordinate the Snapshot of your pipeline with your source. In other words, you can pick up your processing where you left off when you restart the pipeline. This saves you the hassle of having to manage Pub/Sub Snapshots. Let's take a look at how we can use Dataflow Snapshots for disaster recovery scenarios. Our first step involves creating the snapshot of the Dataflow pipeline. We can do this directly in the UI with the Create Snapshot button in the menu bar. You will be prompted to create a Snapshot with or without sources. If your pipeline is using Pub/Sub, we recommend that you select the With Sources option. You can also create a Snapshot using the command line interface. Next, we need to stop and drain your Dataflow pipeline. This is also possible in both the UI and using the command line interface. Lastly, we create a new job from the snapshot. This is accomplished by passing in the snapshot ID into a parameter when you deploy your job from your deployment environment. Since Dataflow Snapshots, like its Pub/Sub counterpart, has a maximum retention of seven days, we recommend scheduling a coordinated Dataflow and Pub/Sub snapshot at least once a week. This means that if your pipeline goes down, you have a point in time in the past seven days from which you can restart processing, ensuring you can almost always avoid any data loss scenario. You can use Cloud Composer or Cloud Scheduler to schedule this weekly snapshot. Snapshots are located in the region of their region job. When you create a job from a snapshot, you must launch the job in the same region. This is useful for zonal outages. If a zone goes down, you can relaunch the job from a snapshot in a different zone in the same region. This protects your workloads against normal outages. However, Dataflow Snapshots cannot help migrate to a different region in the event of a regional outage. The best action to take in that event is to wait for the region to come back online or to relaunch the job in a new region without the snapshot. If you've taken a snapshot, though, you can ensure that your data is not lost.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.