Continuous integration and continuous delivery (CI/CD) for data pipelines

1. Continuous integration and continuous delivery (CI/CD) for data pipelines

If DevOps practices help you collaborate with many teammates and increase your velocity in introducing changes, how do they help ensure that those changes are correct and reliable and don't negatively impact, say, a data pipeline, for example? This is where the concepts of continuous integration and continuous delivery, often shortened to CICD, play an important role. Continuous integration refers to the practice of introducing changes into a central codebase and running automated tests and builds on those introduced changes. This practice helps teams quickly find and address any introduced bugs, while also validating the correctness of those new changes against the existing codebase. If anything is out of the ordinary, say a test fails for some reason due to a breaking change, you're able to address it before the change is rolled out to large swaths of end users or systems. Oftentimes, this entire workflow is automated through the use of a third-party tool. Continuous delivery refers to the practice of pushing changes into dedicated environments. By environments, I mean a data environment used for a specific purpose. For example, when building, say, apps or pipelines, it's very common for teams to have multiple development environments, each with their own purpose. One might be a staging environment, where teams can safely test out these new changes in a place that won't impact end users. And by test, I don't mean running automated tests that likely already happened before making it into an environment. I mean actually seeing and interacting with a change in the app or pipeline. This helps engineers observe and interact with that change to see how it works or how it doesn't work. Another environment might be a production environment, and this typically represents the real environment where end users actually operate. And these are just a couple of examples. I've seen environments of all sorts, of all different names, from development to staging to testing to production. In fact, production is quite common from a naming perspective. You set up environments just like these earlier. In this course, we're going to focus on the latter part of CICD, continuous delivery. There are many different testing frameworks and approaches for testing code in an automated fashion, and we won't be diving into the details of that. This is because implementations can vary immensely from data environment to data environment. Rather, we're going to focus on the functionality and tools that Snowflake provides to help you implement efficient, continuous delivery of data pipelines. We'll focus on introducing changes using source control, deploying those changes to specific environments to test out the changes, and automating this entire workflow with the command line and GitHub actions. Here's specifically what we'll do. First, we're going to use Snowflake CLI. Earlier, I mentioned using tools to move quickly as part of DevOps, and oftentimes teams use product-specific command line interfaces to achieve this. We'll use the Snowflake CLI to help with deployment of changes into multiple environments. Second, since we're using GitHub to host our source-controlled files, we'll also use GitHub actions to automate the deployment of our pipelines. We'll dive into the details of different components of this in an upcoming exercise, and I'll walk you through the process step by step. We've now covered two of the three DevOps best practices for data engineering, source control, and declarative change management. Let's dive into the third, implementing continuous delivery for our data pipelines.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.