Running a data pipeline in production

1. Running a data pipeline in production

So far, we've explored techniques to extract, transform, and load data. We've also learned about tools to test data pipelines, and make sure that they run as expected. In this last video, we'll focus on best practices for running a data pipeline in production.

2. Data pipeline architecture patterns

Several techniques exist to run a data pipeline in a production-like setting. One of the most common is by executing a script that triggers the extract, transform, and load logic that forms the pipeline. On the left, both the function definitions and the execution of the pipeline exist in a single file. While this is an easy way to define and run a pipeline, it's not always the best. Too much code in a single file can cause confusion when debugging or sharing your code. A better way to architect a pipeline involves storing function definitions in a separate file from the execution logic. Then, these functions can be imported and called as needed. On the right, the extract, transform, and load functions are stored in the pipeline_utils-dot-py file. When needed for execution, they are imported and called. This architecture pattern helps to separate execution details from the definitions of the extract, transform, and load logic.

3. Running a data pipeline end-to-end

Using the architecture we have just discussed, we can combine all of the tools we've learned so far to run the pipeline end-to-end. First, we'll import the logging module, as well as the extract, transform, and load functions from the pipeline_utils-dot-py file. After the logger is configured, we can execute our ETL logic inside a try-except block. This, coupled with our logging statements creates a basic monitoring and alerting solution that allows for a Data Engineer to track the execution of a pipeline. When placed within a dot-py script, this pipeline can be executed, and will extract, transform, and load data in a production environment. However, there is some added functionality that most Data Engineers look for when deploying a pipeline to production.

4. Orchestrating data pipelines in production

When choosing how to deploy a data pipeline, Data Engineers look for the ability to automatically run a pipeline on a schedule, with flexible resources available at run time. In addition to this, it's important that a pipeline retry on failure, and alert when an error occurs. While we've been able to implement some of this logic using homegrown tooling, most Data Engineers will reach for an orchestration, or ETL, tool to help deliver a production-grade pipeline. Orchestration tools deliver additional features to a pipeline's execution, including scheduling, resource scaling, and more robust error handling. The most commonly-used orchestration tool is Apache Airflow. A little more than forty percent of all data pipelines are implemented using this tool. In terms of market share, Airflow is by far and away the most popular tool for building, deploying, and monitoring data pipelines. In addition to being open source, it offers a wide range of features and plug-ins. Outside of Airflow, tools such as Prefect and Dagster are also used for orchestration. Despite the wide range of features that third-party orchestration tools bring to the table, many pipelines are still built without the help of a third-party tool, and use homegrown or custom-built solutions. Choosing an orchestration tool is an important decision, and if the right tool is chosen, it can help to dramatically expedite pipeline development.

5. Let's practice!

Let's practice running a pipeline end-to-end in a production-like setting with a few hands-on exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.