Organizing complex Dags with Task Groups

1. Organizing complex Dags with Task Groups

A Dag with five tasks is easy to read. A Dag with fifty tasks is a wall of nodes. Task groups let us bring that under control.

2. The complexity problem

As pipelines grow, the Graph view becomes overwhelming. We end up with dozens of tasks sprawling across the screen. The names blur together, the dependencies tangle, and it takes real effort to understand what the pipeline does. New team members stare at it, not knowing where to start. Task groups solve this by letting us organize tasks into collapsible visual blocks.

3. @task_group

Let's look at this example. We import task_group from airflow.sdk alongside dag and task. The @task_group decorator wraps a function that contains related tasks. The group_id defaults to the function name, but can be adjusted. The default_args parameter applies shared configuration to every task inside the group. Setting retries to 3 means both extract_orders and transform_orders inherit that retry policy without repeated code. Inside the function, we define tasks and dependencies as usual. We call the group function just like any other task, and in the Graph view, it appears as a single expandable block.

4. How it looks in the Airflow UI

The difference is purely visual. On the left, without task groups, all tasks sit at the same level in the Graph view. On the right, with task groups, the tasks are wrapped in collapsible blocks. You can click to expand the block and see the individual tasks inside, then click again to collapse it. The scheduler still sees individual tasks with individual dependencies. Task IDs get a group prefix, which keeps them unique and traceable in logs.

5. Nesting and custom display names

Groups can nest, and each level can have its own display name. The outer group uses group_display_name="Process All", which sets a human-readable label in the Airflow UI, independent of the Python function name process_all. Inside, we have two nested groups: Ingest Orders for the order processing stream and Process Returns for the returns stream.

6. The factory pattern

Because @task_group functions are just Python functions, we can call them multiple times with different parameters. Here, process_source defines a generic extract-transform pattern that takes a source name and a file path. We call it three times: once for orders, once for returns, and once for events. Each call creates a separate group in the Graph view with its own task instances. This is the factory pattern, a clean way to reuse pipeline logic without duplicating code.

7. Guidelines for grouping

Here are a few guidelines to keep groups useful. First, group by domain or concern, not by operator type. Put all order-related tasks together, not all Python tasks together. Second, use group_id for clear programmatic names that appear in task IDs and logs, and group_display_name for human-readable labels in the UI. Third, use default_args to share configuration like retries across all tasks in a group, keeping it in one place. And finally, apply Miller's Law: research shows that humans can comfortably hold about seven items in working memory at once. Aim for five to seven top-level items in the Graph view. If there are more, that's a signal to introduce a task group.

8. Let's practice!

Time to organize some Dags.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.