Get startedGet started for free

Workflow scheduling frameworks

1. Workflow scheduling frameworks

Hi, and welcome to the last video of the chapter. In this video, we'll introduce the last tool we'll talk about in this course. These are the workflow scheduling frameworks. We've seen how to pull data from existing databases. We've also seen parallel computing frameworks like Spark. However, something needs to put all of these jobs together. It's the task of the workflow scheduling framework to orchestrate these jobs.

2. An example pipeline

Let's take an example. You can write a Spark job that pulls data from a CSV file, filters out some corrupt records, and loads the data into a SQL database ready for analysis. However, let's say you need to do this every day as new data is coming in to the CSV file. One option is to run the job each day manually. Of course, that doesn't scale well: what about the weekends? There are simple tools that could solve this problem, like cron, the Linux tool. However, let's say you have one job for the CSV file and another job to pull in and clean the data from an API, and a third job that joins the data from the CSV and the API together. The third job depends on the first two jobs to finish. It quickly becomes apparent that we need a more holistic approach, and a simple tool like cron won't suffice.

3. DAGs

So, we ended up with dependencies between jobs. A great way to visualize these dependencies is through Directed Acyclic Graphs, or DAGs. A DAG is a set of nodes that are connected by directed edges. There are no cycles in the graph, which means that no path following the directed edges sees a specific node more than once. In the example on the slide, Job A needs to happen first, then Job B, which enables Job C and D and finally Job E. As you can see, it feels natural to represent this kind of workflow in a DAG. The jobs represented by the DAG can then run in a daily schedule, for example.

4. The tools for the job

So, we've talked about dependencies and scheduling DAGs, what tools are there to use for us? Well, first of all, some people use the Linux tool, cron. However, most companies use a more full-fledged solution. There's Spotify's Luigi, which allows for the definition of DAGs for complex pipelines. However, for the remainder of the video, we'll focus on Apache Airflow. Airflow is growing out to be the de-facto workflow scheduling framework.

5. Apache Airflow

Airbnb created Airflow as an internal tool for workflow management. They open-sourced Airflow in 2015, and it later joined the Apache Software Foundation in 2016. They built Airflow around the concept of DAGs. Using Python, developers can create and test these DAGs that build up complex pipelines.

6. Airflow: an example DAG

Let's look at an example from an e-commerce use-case in the DAG showed on the slide. The first job starts a Spark cluster. Once it's started, we can pull in customer and product data by running the ingest_customer_data and ingest_product_data jobs. Finally, we aggregate both tables using the enrich_customer_data job which runs after both ingest_customer_data and ingest_product_data complete.

7. Airflow: an example in code

In code, it would look something like this. First, we create a DAG using the `DAG` class. Afterward, we use an Operator to define each of the jobs. Several kinds of operators exist in Airflow. There are simple ones like BashOperator and PythonOperator that execute bash or Python code, respectively. Then there are ways to write your own operator, like the SparkJobOperator or StartClusterOperator in the example. Finally, we define the connections between these operators using `.set_downstream()`.

8. Let's practice!

Ok, let's see how you do in the exercises.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.