Get startedGet started for free

Orchestration

1. Orchestration

As we get more data, we'll need to properly coordinate the different jobs to operate our platform. Let's review how orchestration solves this.

2. What is orchestration?

Let's imagine we have five tables to ingest in a delta approach. After we ingest that data, we want to perform some data quality validations and transformations and finally create a datamart with that data to feed a dashboard. This process needs to happen every day. So, how would we do it? Well, there are a couple of approaches: First, we could try to create multiple schedulers to run every job independently and estimate the time each job takes to schedule the next one. Another possibility will be to schedule the ingestion jobs and send some sort of notification after it ends so someone on the team can start the next job or have a service that listens to it to start the next job. This last approach is a form of orchestration. Orchestration is the automated configuration and coordination of complex workflows. And as we can imagine, there are a lot of benefits around it, but the most important one is freeing up human resources.

3. Orchestration vs scheduling

So, in our previous example, we also mentioned scheduling. That's a valid option, but, as we may imagine it's extremely sensitive to many factors that we may not necessarily control. So, that's why orchestration comes into play. Nonetheless, scheduling is not discarded due to orchestration. Actually, scheduling most of the time is the starter of our orchestration workflows.

4. Apache Airflow

One popular tool for orchestration is Apache Airflow. It is a platform to programmatically author, schedule, and monitor workflows using Python code. We will use it to explain some of the main concepts around orchestration.

5. Core concepts of orchestration

Let's start from the ground up. Tasks refer to the basic unit of execution in Airflow. It may be as simple as running a SQL query or a Python script and as complex as starting a whole ETL job. However, complex workflows require many tasks. Some can be done independently, but others may depend on the results of previous tasks. This is where dependencies come in. For example, task B, which may involve data analysis, can only be completed after task A, the data extraction. These tasks and dependencies form what we call a Directed Acyclic Graph, or DAG. In Airflow, a DAG is essentially a workflow. It's "directed" because tasks follow specific paths and "acyclic" because tasks don't loop back on themselves, ensuring workflows always move forward. For instance, in this DAG, we have task "A" that starts the DAG, then we have tasks "B" and "C" that depend on "A", and finally, task "D" depends on the previous ones without any cycles.

6. Core concepts of orchestration

Now, how are these tasks carried out? Here's where operators come into play. Each task is an instance of an operator class. The operator determines the nature of the task, whether it's running a Bash command, executing a Python function, or even waiting for a certain condition to be met, which leads us to sensors. Sensors are a special kind of operator. They wait for a specific condition to be met. For instance, a sensor could pause the workflow until a particular file lands in a specific location. Finally, we have the scheduler. The scheduler automates the triggering of our tasks based on a given interval. It checks the DAGs to see if they have any tasks to run and triggers them accordingly.

7. Let's practice!

Let's apply what we've learned about orchestration.