Scheduling the full data pipeline with Airflow

In the previous exercises, you’ve learned about several Airflow operators that can be used to trigger small data pipelines that work with files in the data lake. These were the data pipelines that you learned about in chapters 1 and 2! You’ve also seen how to specify the order of certain steps, using the .set_upstream() and .set_downstream() methods (or using the bitshift operators).

Now it’s time for the frosting on the cake: bring the operators from the previous exercises together and schedule them in the right order!

The operators you could need (SparkSubmitOperator, PythonOperator and BashOperator) have been imported already.

Use the correct operators for the ìngest (a bash task), clean (a Spark job) and insight (another Spark job) tasks.
Define the order in which the tasks should be run.

Ingesting Data

Creating a data transformation pipeline with PySpark

Testing your data pipeline

Managing and orchestrating a workflow

Exercise

Scheduling the full data pipeline with Airflow

Instructions