Deployment

1. Deployment

Having tested, built, and established our environment, let's talk about deployment. We’ll look at three distinct stages of the pipeline lifecycle: deployment, in-flight actions, and termination. Let’s begin with deployment. There are two ways to deploy your Dataflow job. We can use a direct launch, in which we run the pipeline directly from the development environment. For Java pipelines this means running the pipeline from Gradle or Maven. For Python, this means running the python script directly. This method can be used for both batch and streaming pipelines. We can also use templates. Templates allow us to launch a pipeline without having access to a developer environment. Templates are built and deployed as an artifact on Cloud Storage, and can be used for both batch and streaming pipelines. Separating the development and execution environments makes it easy to automate your Dataflow deployments and enable non-technical users. If you're using an external scheduler, like Airflow, you'll be able to use Airflow's built-in support for Dataflow, which calls a template when invoked. You would supply your pipeline options via the operator arguments. More generically, if you're orchestrating the deployment via CLI tooling, you can provide the options via gcloud commands. You might notice that Dataflow SQL is not on this list. Dataflow SQL is a special case of a templated deployment—it’s actually a user interface built on top of a flex template. When we're deploying our pipeline for the first time, we need to submit the pipeline to the Dataflow service with a unique name. This name will be used by the monitoring tools when you look at the monitoring page in the console. Next, we’ll move onto pipeline actions that can be taken on in-flight pipelines. These actions are only available to streaming pipelines, since batch jobs can simply be relaunched. As a streaming pipeline processes data, it accumulates state. It's useful to have ways to preserve the state so that we can manage changes to our pipeline without the risk of permanent data loss. We can use snapshots for this. With snapshots, we can save the state of an executing pipeline before launching a new pipeline. This way, if we need to roll our pipeline back to a previous version, we can do that without losing the data processed by the version being rolled back. Since streaming pipelines are long-running applications, we’re likely to need to modify our pipeline from time to time. Now that we’ve saved the state of the running pipeline with a snapshot, we can safely update it. When you update a job on a Dataflow service, you replace the existing job with a new job that runs your updated pipeline code. The Dataflow service retains the job name, but runs the replacement job with an updated job ID. If for any reason you are not happy with how the replacement job is running, you can roll back to the prior version by creating a job from a snapshot. Let’s explore these two actions a little more deeply. Dataflow snapshots are useful for several scenarios. As mentioned, Dataflow Snapshots provides a copy of intermediate state of your pipeline at the moment the snapshot is taken. You can use that snapshot to validate a pipeline update, or use it as checkpoint for you to roll back your pipeline in the event of an unhealthy release. You can also use Snapshots for backups and recovery use cases. We’ll explore this in more depth in the Reliability module. Lastly, Snapshots create a safe path for migrating pipelines to Streaming Engine. If you want to reap the benefits of smoother autoscaling and superior performance, you can take a snapshot and create a job from that snapshot. The new job will run on Streaming Engine. The flip side of this is that jobs created from Snapshots cannot be run with Streaming Engine disabled. Let’s see how we use Dataflow Snapshots. We’ll cover two Snapshot workflows: creating snapshots, and creating a job from a snapshot. We can create a snapshot in the UI or using the CLI. We can navigate to the Job Details page of the pipeline of interest. You’ll see a Create Snapshot button to initiate the snapshot. After you click on the button, you’ll be prompted to select if your snapshot will or will not be created with a source snapshot. If you are using Pub/Sub, creating a snapshot with source will allow you to create a coordinated snapshot between your unread messages and accumulated state. This makes it easier to roll back your pipeline to a known point in time. The pipeline will pause processing while the snapshot is being taken. Depending on how much state is buffered, it could take a matter of minutes. We recommend planning to take snapshots during periods of the day when latency can be tolerated, such as non-business hours. You can also create snapshots using the CLI or API. These methods allow you to automate snapshots of your Dataflow pipelines, which lends itself nicely to scheduling snapshots on a weekly cadence. To create a job from a snapshot, we have to pass in two extra parameters, as seen in the sample command here. We have to enable Streaming Engine with the --enableStreamingEngine flag. Secondly, we pass in the Snapshot ID into the createFromSnapshot parameter. If you are creating a job from the snapshot for a modified graph, the new graph must be compatible with the prior job. We’ll discuss update compatibility shortly. Now that we’ve snapshotted our pipeline, we’re ready to update our pipeline. There are various reasons why you might want to update your Dataflow job: One is to enhance or otherwise improve your pipeline code. Another is to fix bugs in your pipeline code. You might also want to update your pipeline to handle changes in the data format. Finally, you might want to change your pipeline to account for version and other changes in your data source. To update your pipeline, you’ll need to do a couple of things. First, you need to pass the "update" and "jobName" options when you submit the new pipeline. You’ll have to set jobName to the name of the existing pipeline, or else the old job will not be replaced. This tells Dataflow that you're updating the job, rather than deploying a new pipeline. Second, if you added, removed, or changed any transform names, you'll need to tell Dataflow about these changes by providing a transformNameMapping. The replacement job will preserve any intermediate state data from the prior job. Note, however, that changing the windowing or triggering strategies will not affect data that's already buffered or already being processed by the pipeline. "In-flight" data will still be processed by the transforms in your new pipeline. Additional transforms that you add in your replacement pipeline code may or may not take effect, depending on where the records are buffered. Updates can also be triggered via the API. This can enable continuous deployment contingent on other tests passing. When you update your job, the Dataflow service performs a compatibility check between your currently running job and your potential replacement job. The compatibility check ensures that things like intermediate state information and buffered data can be transferred from your prior job to your replacement job. This means that some changes are not possible with streaming update. Let’s review the most common compatibility breaks. Modifying your pipeline without providing a transform mapping will fail the compatibility check. When you update a job, the Dataflow service tries to match your transforms from your prior job to your new job so that intermediate state data for each step can be fully processed. If your changes have renamed or removed any steps, you will have to provide a transform mapping so that Dataflow can match the state. Adding or removing side inputs will also cause the check to fail. Changing coders. The Dataflow service isn’t able to serialize or deserialize records if your updated job uses different data encoding. Running your job with a new zone and a new region will also cause your compatibility check to fail. Your replacement job must run in the same zone in which you ran your prior job. Lastly, removing stateful operations. Dataflow fuses multiple steps together for efficiency, but if you’ve removed a state-dependent operation from within a fused step, the check will fail. If your pipeline requires any of these changes, we recommend draining your pipeline, then launching a new job with the updated code. Now that we’ve discussed actions you can take on your streaming pipeline, we’ll discuss two ways that you can terminate your pipeline. First, we start with drain. Selecting drain will tell the Dataflow service to stop consuming messages from your source and finish processing all buffered data. After the last record is processed, the Dataflow workers are torn down. This action is only applicable to streaming pipelines. Secondly, we can cancel the job. Using the Cancel option ceases all data processing and drops any intermediate, unprocessed data. We can cancel both batch and streaming jobs. Let’s take a closer look at both of these options. We can terminate our job from the UI. When you navigate to the Job Details page of your job, you will find a Stop button in the menu bar. You’ll be prompted between two options for stopping your job: canceling your job, or draining your job. Let’s explore each of these options. When we drain the pipeline, it will stop pulling data from the source, and it will finish processing data that has already been read into the pipeline. This provides an advantage from cancelling your job outright, since no record is dropped. When you relaunch your streaming job, it will continue processing unacknowledged messages from your source. However, when a streaming pipeline is drained, the watermark is moved to infinity, which closes all windows. Closing all the windows in this way will result in incomplete aggregations, since draining the pipeline will not wait for open windows to be closed before stopping pulling from the source system. Consider the impact of incomplete aggregations on downstream systems when draining your pipelines. Beam attaches a PaneInfo object that provides information about the pane an element belongs to, as every pane is implicitly associated with a window. You can use PaneInfo to identify incomplete windows and choose to write that data elsewhere, which can save you the hassle of reconciling incomplete aggregations with your production dataset. When you cancel a job, Dataflow will immediately begin shutting down the resources associated with your job. The pipeline will stop pulling and processing data immediately, which means you may lose any data that was still being processed when the pipeline was canceled. If your use case can tolerate data loss, then cancelling your job will fit your purpose. So, to summarize the lifecycle of a streaming pipeline, let's review our deployment options. First, if it's the first time deploying the pipeline, there's no existing state to consider. So you just deploy the pipeline. If there's an existing pipeline, and you want to update it, you should take a snapshot of your pipeline. This ensures that you have a working state you can revert your pipeline to if you observe an issue with your new deployment. Once you’ve taken a snapshot, you’re ready to update your job. You need to account for any changes to the names of the pipeline's transformations by providing the mapping from old names to the new. If the updated pipeline is compatible, the update will succeed, and you'll get a new pipeline in place of the old, without losing the state of the previous version of the pipeline. If the update is not possible, then you’ll need to choose between drain and cancel options. If you can replay the source, then you can choose to cancel the pipeline, which will drop any in-flight data. You can then deploy the new pipeline, and replay the data from the source. Note that we cannot use a Dataflow snapshot if the pipeline modifications are not update compatible, but taking a snapshot of the source with a Pub/Sub snapshot, you can minimize unnecessary reprocessing. If replay is not possible, then you can drain the pipeline. This will not lose data, but you may end up with incomplete aggregations in your output sink. Your downstream business logic should inform the appropriate approach to handling this. Once the pipeline has been drained, you can relaunch the pipeline. This is the end of the module. You should now be able to: Execute test approaches with your Dataflow pipeline, Snapshot and update your Dataflow pipelines, And drain and cancel your Dataflow pipelines.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.