Robust Dags with failure handling and retries

1. Robust Dags with failure handling and retries

Our pipeline ran perfectly for three weeks. Then one morning, an API we depend on returns an error, and the task fails.

2. Why tasks fail

In production, tasks might fail for reasons unrelated to our code. The task connects to external systems, and each one can fail in its own way. APIs time out or hit rate limits. Network connections drop briefly, or domain name resolution delays prevent reaching the server. Databases and shared storage slow down when multiple tasks compete for the same resource. The question isn't whether tasks will fail, but how our Dag handles it when they do.

3. Retries

The first line of defense is retries. Here we have a task called fetch_weather that calls a weather API. The key is in the decorator: retries=3 means Airflow will try this task up to three additional times if it fails, and retry_delay sets a two-minute wait between each attempt. If the API returns a 503 error indicating the server is temporarily unavailable, Airflow waits 2 minutes and tries again. Transient failures are temporary problems that resolve on their own, like a momentary network timeout or a busy server. For these kinds of issues, retries are often all we need.

4. Exponential backoff

But fixed delays can be too aggressive. If an API is recovering from an outage, retrying every two minutes can make things worse. Adding retry_exponential_backoff=2 doubles the delay after each attempt. So the first retry waits 2 minutes, the second waits 4, and the third waits 8. This gives the external service time to recover instead of piling on more requests.

5. on_failure_callback

When retries are exhausted, and the task still fails, we want to know about it. Here we define a callback function called alert_on_failure. It receives a context dictionary with everything we need. context["dag"] gives us the Dag object with its dag_id, context["ti"] gives us the task instance with its task_id, and more. We attach it to the task using the on_failure_callback parameter. When the task fails after all retries, Airflow calls this function automatically. In a real pipeline, we'd route this to Slack, PagerDuty, or any other alerting tool.

6. Callbacks at Dag vs task level

We can set callbacks at two levels. At the Dag level, on_failure_callback on the @dag decorator fires when any task in the Dag fails. This works as a catch-all alert to the team channel. At the task level, the on_failure_callback on the @task decorator fires only when that specific task fails, overriding the Dag-level one. Most teams use both: a Dag-level callback for general awareness and a task-level callback for critical tasks that need immediate on-call attention.

7. max_consecutive_failed_dag_runs

What about a Dag that keeps failing run after run? Each failure triggers an alert, and suddenly our on-call channel is a wall of noise. The max_consecutive_failed_dag_runs parameter solves this. We set it to 3 on the @dag decorator. As the diagram shows, after three consecutive failed runs, Airflow automatically pauses the Dag. No more alerts for a known-broken pipeline. The Dag stays paused until someone investigates the root cause and manually unpauses it.

8. Putting it all together

Here is the full picture. Let's walk through it top to bottom. The @dag decorator has max_consecutive_failed_dag_runs=3 to auto-pause after persistent failures, and on_failure_callback=alert_team to notify the team channel on any failure. The @task decorator has retries=3 with exponential backoff to handle transient API issues, and its own on_failure_callback=page_oncall to escalate to on-call when retries are exhausted. These three layers work together: retries handle the common case, callbacks alert the right people, and auto-pause prevents alert fatigue.

9. Let's practice!

Time to configure your own failure handling.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.