Testing and CI/CD Overview

1. Testing and CI/CD Overview

Hi! My name is Vince Gonzalez, and I am a data engineer with Google Cloud. This module will introduce frameworks and features available to streamline your CI/CD workflow for Dataflow pipelines. In this module, we'll cover an overview of testing and CI/CD. We'll discuss unit testing your Beam pipelines, integration tests, artifact building, and considerations around deploying your pipelines. Let's get into the overview. Software engineers are no stranger to application lifecycle management. After all, that is how we keep applications fresh and up to date. Dataflow pipelines are no different. Dataflow pipelines are authored according to well-understood best practices within software engineering. First, Dataflow pipelines need a comprehensive testing strategy. So we should be implementing unit tests, integration tests, and end-to-end tests to ensure that our pipeline behaves as we expect. The approach to deployment should also be well structured. A haphazard rollout can result in corrupted data being written to the sink or disruptions to your downstream applications. Finally, data engineers should strive to validate changes made to pipeline logic, and have a rollback plan if there is a bad release. While all these considerations are similar to general application development, there are some key differences to point out. Data pipelines often aggregate data, and this makes them stateful, in that they must accumulate the result of some aggregation over time. This means that if you need to update your pipeline, you need to consider any state that may exist in the pipeline you're updating. We'll discuss this in more detail later, but when you change your pipeline, you'll need to account for existing state, as well as any changes to the pipeline logic and topology. Changes you make need to be compatible with the pipeline you're updating. If they are not, you will have to devise alternate migration strategies that might require reprocessing data. If you do roll out a bad configuration, you could be dealing with more than just an unpleasant experience for end users. If your pipeline makes non-idempotent side effects to external systems, you will have to account for those effects after a rollback. This raises the stakes for ensuring safe releases. Now that we understand some of the challenges that come with testing and deploying data processing applications, let’s take a look at what testing and CI/CD look like with Beam and Dataflow. Testing in Beam is summed up well by this diagram. You can read this from the center out, starting with the Beam pipeline itself, and some hand crafted test inputs, then moving to other PTransforms and DoFn subclasses, before considering integration testing, which involve real data sources and sinks. So let's talk about Unit tests. All pipelines revolve around transforms, and the lowest level we typically deal with in Beam is the DoFn. Since these are essentially functions, we validate their behavior with unit tests that operate on input datasets. They produce output datasets that we validate with assertions. Similarly, we can provide test inputs to the entire pipeline, which might contain our DoFns as well as other PTransforms and DoFn subclasses. We also assert that the results of the entire pipeline are what we expect. For system integration tests, we incorporate a small amount of test data using the actual I/Os. This should be a small amount of data, since our goal is to ensure the interaction with the IOs produces the expected results. Finally, end-to-end tests use a full testing dataset, which is more representative of the data our pipeline will see in production. Whatever tooling you're using in your CI/CD testing environment, you'll make use of the Direct Runner, which runs on your local machine, and your production runners, which run on the cloud service of your choice, like Dataflow. The Direct Runner will be used for local development, unit tests, and small integration tests with your data sources. You'll use your production runner when it's time to do larger integration tests, when you want to test performance, and when you want to test pipeline deployment and rollback. More broadly, the CI/CD lifecycle looks something like this. It's iterative, and moves through a cycle of development, building artifacts and testing, followed by deployment. In the development part of the lifecycle, we write our code, executing unit tests locally using the direct runner and executing integration tests using the Dataflow runner. As we develop and test, we're committing to source repositories along the way. These commits and pushes trigger the continuous integration system to compile and test our code in an automated manner, using Cloud Build or a similar CI system. Once the builds complete successfully, artifacts are deployed, first to a preproduction environment where end-to-end tests are run. If these succeed, we deploy to our production environment.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.