Introduction to Apache Airflow
1. Introduction to Apache Airflow
Welcome to Introduction to Airflow! I'm Mike Metzger, a Data Engineer, and I'll be your instructor while we learn the components of Apache Airflow and why you'd want to use it. Let's get started!2. What is data engineering?
Before getting into workflows and Airflow, let's discuss a bit about data engineering. While there are many specific definitions based on context, the general meaning behind data engineering is taking any action involving data and making it a reliable, repeatable, and maintainable process.3. What is a workflow?
Before we can really discuss Airflow, we need to talk about workflows. A workflow is a set of steps to accomplish a given data engineering task. These can include any given task, such as downloading a file, copying data, filtering information, writing to a database, and so forth. A workflow is of varying levels of complexity. Some workflows may only have 2 or 3 steps, while others consist of hundreds of components. The complexity of a workflow is completely dependent on the needs of the user. We show an example of a possible workflow to the right. It's important to note that we're defining a workflow here in a general data engineering sense. This is an informal definition to introduce the concept. As you'll see later, workflow can have specific meaning within specific tools.4. What is Airflow?
Airflow is a platform to program workflows (general), including the creation, scheduling, and monitoring of said workflows.5. Airflow continued...
Airflow can use various tools and languages, but the actual workflow code is written with Python. Airflow implements workflows as DAGs, or Directed Acyclic Graphs. We'll discuss exactly what this means throughout this course, but for now think of it as a set of tasks and the dependencies between them. Airflow can be accessed and controlled via code, via the command-line, or via a built-in web interface or REST API. We'll look at all these options later on.6. Other workflow tools
Airflow is not the only tool available for running data engineering workflows. Some other options are Spotify's Luigi, Microsoft's SSIS, or even just Bash scripting. We'll use some Bash scripting within our Airflow usage, but otherwise we'll focus on Airflow.7. Quick introduction to DAGs
A DAG stands for a Directed Acyclic Graph. In Airflow, this represents the set of tasks that make up your workflow. It consists of the tasks and the dependencies between tasks. DAGs are created with various details about the DAG, including the name, start date, owner, email alerting options, etc.8. DAG code example
We will go into further detail in the next lesson but a very simple DAG is defined using the following code. A new DAG is created with the dag underscore id of etl underscore pipeline and a default underscore args dictionary containing a start underscore date for the DAG. Note that within any Python code, this is referred to via the variable identifier, etl underscore dag, but within the Airflow shell command, you must use the dag underscore id.9. Running a workflow in Airflow
To get started, let's look at how to run a component of an Airflow workflow. These components are called tasks and simply represent a portion of the workflow. We'll go into further detail in later chapters. There are several ways to run a task, but one of the simplest is using the airflow tasks test shell command. Airflow tasks test requires two arguments, a dag_id and a task_id, along with an optional execution_date. All of these arguments have specific meaning and will make more sense later in the course. For our example, we'll use a dag_id of example-etl, a task named download-file, and an execution date of 2024-01-10. This task would simply download a specific file, perhaps a daily update from a remote source. Our command as such is airflow tasks test example-etl download-file 2024-01-10. This will then run the specified task within Airflow.10. Let's practice!
We've looked at Airflow and some of the basic aspects of why you'd use it. We've also looked at how to run a task within Airflow from the command-line. Let's practice what we've learned.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.