1. Airflow DAGs
Welcome back! You've successfully interacted with some basic Airflow workflows via the command line. Let's now take a look at the primary building block of those workflows - the DAG.
2. What is a DAG?
Our first question is what is a DAG?
Beyond any specific mathematical meaning, a DAG, or Directed Acyclic Graph, has the following attributes:
It is Directed, meaning there is an inherent flow representing the dependencies or order between execution of components. These dependencies (even implicit ones) provide context to the tools on how to order the running of components.
A DAG is also Acyclic - it does not loop or repeat. This does not imply that the entire DAG cannot be rerun, only that the individual components are executed once per run.
In this case, a Graph represents the components and the relationships (or dependencies) between them.
The term DAG is found often in data engineering, not just in Airflow but also Apache Spark, dbt, and others.
3. DAG in Airflow
As we're working with Airflow, let's look at its implementation of the DAG concept.
Within Airflow, DAGs are written in Python, but can use components written in other languages or technologies. This means we'll define the DAG using Python, but we could include Bash scripts, other executables, Spark jobs, and so on.
Airflow DAGs are made up of components to be executed, such as operators, sensors, etc. Airflow typically refers to these as tasks. We'll cover these in much greater depth later on, but for now think of a task as a thing within the workflow that needs to be done.
Airflow DAGs contain dependencies that are defined, either explicitly or implicitly. These dependencies define the execution order so Airflow knows which components should be run at what point within the workflow. For example, you would likely want to copy a file to a server prior to trying to import it to a database.
4. Define a DAG
Let's look at defining a simple DAG within Airflow. When defining the DAG in Python, you must first import the DAG object from airflow.
Once imported, we create a default arguments dictionary consisting of attributes that will be applied to the components of our DAG. These attributes are optional, but provide a lot of power to define the runtime behavior of Airflow. Here we define the owner name as jdoe, an email address for any alerting, and specify the start date of the DAG. The start date represents the earliest datetime that a DAG could be run.
Finally, we define our DAG object using a Python context manager, starting with DAG, then the name of the DAG etl underscore workflow, and assign the default arguments dictionary to the default underscore args argument. We alias this with the name etl underscore dag. Note that using an alias is not always required, so you may see DAGs of each type. There are many other optional configurations we will use later on. The remainder of our code would appear in the context manager block - we'll cover those later in the course.
Note, DAG is case sensitive in Python code.
5. Define a DAG (before Airflow 2.x)
It should be noted that there is an older method to define DAGs from before the release of Airflow 2. Notice that in this case, the DAG is assigned to a variable name of etl underscore dag instead of using a with statement as before. These methods behave the same way within Airflow and may be seen at various points in this course.
6. DAGs on the command line
When working with DAGs (and Airflow in general), you'll often want to use the airflow command line tool.
The airflow command line program contains many subcommands that handle various aspects of running Airflow. You've used a couple of these already in previous exercises.
Use the airflow dash h command for help and descriptions of the subcommands.
Many of these subcommands are related to DAGs.
You can use the airflow dags list option to see all recognized DAGs in an installation.
When in doubt, try a few different commands to find the information you're looking for.
7. Command line vs Python
You may be wondering when to use the Airflow command line tool vs writing Python. In general, the airflow command line program is used to start Airflow processes (ie, webserver or scheduler), manually run DAGs or tasks, and review logging information.
Python code itself is usually used in the creation and editing of a DAG, not to mention the actual data processing code itself.
8. Let's practice!
Now that we've covered the basics of a DAG in Airflow and how to create one, let's practice working with DAGs!