Templating, idempotency, and backfilling

1. Templating, idempotency, and backfilling

What happens when we re-run a pipeline? Does it produce the same result, or does it break things?

2. Jinja templates in Airflow

We already know the basics of Jinja templating, so let's apply them in a production context. Here we use the @task.bash decorator, which lets us return a bash command as a string from a Python function. Inside that string, double-curly-brace ds is a Jinja template variable that Airflow renders as the logical date in year-month-day format. Each Dag run gets its own date, so the same task definition produces a different output file for every run. One important detail: Jinja rendering only works in templatable fields, which are specific parameters like bash_command and sql that Airflow processes through the Jinja engine before execution.

3. The idempotency problem

But date-awareness alone isn't enough. Look at what happens: the first run inserts three sales rows for March 31st. Then someone re-runs the pipeline for the same date. The INSERT appends the same three rows again, so now the table has six rows, every one duplicated. This is the core problem idempotency solves. A task should produce the same result whether we run it once or ten times.

4. The delete-then-insert pattern

The most common pattern is delete-then-insert. Let's walk through the code. First, the SQLExecuteQueryOperator queries the staging table, filtering by the logical date using double-curly-brace ds. This gives us only the rows for the current run's date. Next, the SQLInsertRowsOperator inserts those rows into the sales table. The key is the preoperator parameter. Before inserting, it runs a DELETE that removes any existing rows for that date. If the task runs again, the delete clears the previous result, and the insert writes it fresh. This means no duplicates and the same result every time. This diagram shows the flow: delete first, then insert. There are other approaches like UPSERT or MERGE, but delete-then-insert is the most common starting point for idempotent pipelines.

5. Backfilling

Idempotent pipelines unlock backfilling. Backfilling means reprocessing a range of historical dates, for example, after a bug fix or a schema change. When we trigger a backfill, Airflow creates one Dag run per logical date. Each run processes its own date independently. The number of runs depends on the schedule: a daily Dag backfilled over three days creates three runs. A weekly Dag over the same range creates fewer, because the schedule only calls for one execution per week. Combined with delete-then-insert, backfilling is safe because we can reprocess any date range without corrupting our data.

6. Backfill in the UI

We can trigger a backfill directly from the Airflow UI. Click the trigger button, select the Backfill option instead of Single Run, then set From and To dates to define the range. The UI offers several options: we can trigger only Missing Runs, Missing and Errored Runs, or All Runs to reprocess everything. We can also control parallelism with Max Active Runs and change the execution order. Once we click Run Backfill, Airflow creates the runs and starts processing them.

7. Let's practice!

Let's make some pipelines idempotent and try a backfill.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.