Airflow Dags
1. Airflow Dags
Welcome back! You've successfully interacted with a basic Airflow workflow through the web UI. Let's now take a look at the primary building block of these workflows: the Dag.2. What is a Dag?
What is a Dag? A Dag, or Directed Acyclic Graph, has the following characteristics. It is Directed, which means there are dependencies between the components that determine the order in which components should run. It is Acyclic, which means components do not loop or repeat. While you can re-run an entire Dag, components can only be executed once per run. In this context, a Graph represents the set of components and their dependencies.3. Dag in Airflow
Let's look at Airflow's implementation of the Dag concept. Airflow Dags are written in Python, but can use components written in other languages. We define the Dag using Python, but could include Bash scripts, Spark jobs, and others. Airflow Dags consist of components to be executed, such as operators or sensors. Airflow typically refers to these as tasks. We'll cover these in greater depth later, but for now think of a task as a job within the workflow that needs to be done. Airflow Dags contain dependencies that define the execution order of components within a workflow. These can be explicit or implicit, for example, copying a file to a server before importing it to a database.4. Define a Dag
Let's define a simple Dag within Airflow. First, we import the dag object from airflow.sdk. This is called the Taskflow API. We now use a decorator, @dag, with attributes defining the runtime behavior of Airflow. We include a dag_id which is the name of the Dag as it appears to Airflow. We define an email address for any alerting, then specify the start date, which represents the earliest datetime the Dag could be run. Airflow uses the Pendulum library for datetime handling, so we import datetime from pendulum and pass a timezone using the tz argument. Next, we define the function that acts as our entry point to the Dag, in this case etl_workflow. The function name and dag_id might match, but this is not required. We'll cover this function later, but note that it must be called at the end of the Dag file to tell Airflow what to run. Note, dag is case sensitive in Python.5. Dags on the command line
When working with Airflow, you'll often want to use the airflow command line tool, which contains many subcommands that handle various aspects of running Airflow. Use the airflow -h command for help and descriptions of the subcommands. Many of these subcommands are related to Dags. airflow dags list shows all recognized Dags in an installation. airflow dags reserialize will reload all the workflows from the Python Dag files. airflow tasks test will run a specific task within a Dag. When in doubt, try a few different commands to find the information you're looking for.6. Command line vs Python
You may be wondering when to use the Airflow command line tool vs writing Python. In general, the airflow command line program is used to start Airflow processes, like webserver or scheduler, manually run Dags or tasks, and review logging information. Python code is usually used to create and edit a Dag, not to mention the actual data processing code itself.7. Let's practice!
Now that we've covered Dag basics and how to create one, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.