Data quality checks

1. Data quality checks

Your pipeline loaded the data. But is the data actually correct?

2. Why quality gates matter

Before we dive in, a quick note: partitioning gave us one way to coordinate work across pipelines, but the quality patterns we'll cover next apply to any scheduled Dag, partitioned or not. A pipeline that loads data without checking it is a pipeline you can't trust. One upstream change introduces null values or negative revenue, and every downstream dashboard shows wrong numbers. By the time someone notices, the damage is already done. Quality gates catch these problems at the source, right after the data lands, before it reaches consumers.

3. SQLColumnCheckOperator

The SQLColumnCheckOperator validates individual columns in a table. We import it from the same common SQL provider package as the SQLExecuteQueryOperator we already used to load data. The operator takes a conn_id, a table name, and a column_mapping dictionary. Each key is a column name, and the value defines which checks to run. Here we check that total_revenue has a minimum greater than zero and that total_orders has no null values.

4. Column check types

Let's look at the check types in detail. The min check validates the minimum value in a column. Here, greater_than zero catches any negative or zero revenue. The null_check counts null values. We assert it equals zero, meaning no nulls are allowed in total_orders. The distinct_check counts unique values, useful for verifying a column has the expected variety. Each check uses comparison operators like greater_than, equal_to, or less_than to define what passes and what fails. There are 5 checks and 5 comparators available.

5. SQLTableCheckOperator

While column checks validate individual fields, the SQLTableCheckOperator validates properties of the entire table. It takes a checks dictionary where each key is a check name, and the value contains a check_statement. This can be any SQL expression that evaluates to true or false. Here we verify that the row count is greater than zero, ensuring the table actually has data after our load. You can write any SQL condition as a check statement.

6. Chaining quality checks

In practice, quality checks are chained after the load task. Looking at the snippet, we have three tasks. The load task uses a SQLExecuteQueryOperator that runs our aggregate_sales.sql file to populate the daily_summary table. The check_columns task runs column-level validations on that table. The check_table task runs the table-level validations on top. We then connect them in order: load, then check_columns, then check_table. So the data loads first, column checks validate individual values, and finally, the table check confirms overall integrity. If any check fails, the pipeline stops and downstream consumers never see the bad data. This load-then-validate pattern is the standard approach for quality gates in production pipelines.

7. Column vs table checks

Column checks and table checks answer different questions. Use SQLColumnCheckOperator when you need to validate individual values: are there nulls, are values within range? Use SQLTableCheckOperator for aggregate conditions: does the table have rows, is the total within expected bounds? Most production pipelines combine both for thorough coverage. In the exercises, you'll pick the right operator for different data quality scenarios.

8. Monitoring quality at scale with Astro

In production, you might have hundreds of quality checks across dozens of pipelines. Astronomer's Astro Observe gives you a single place to monitor pipeline health. It tracks data freshness against SLAs, visualizes lineage across Dags and tables, and monitors data quality at scale. When something fails, it traces the impact downstream and uses AI to help identify the root cause.

9. Let's practice!

Let's add quality gates to our pipeline.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.