Git-like features with Write-Audit-Publish and Branching and Tagging

1. Git-like features with Write-Audit-Publish and Branching and Tagging

We've just seen how Iceberg is amazingly flexible and powerful when it comes to ingesting your data. This implies that your Iceberg tables will likely change over time, and therefore we need a way to track those changes. Fortunately, Apache Iceberg handles table evolution and experimentation for you. For those of you who are familiar with Git, Iceberg tables can behave in a similar way to a Git repository, allowing you to test changes before making them globally visible. Branch your table to experiment with different partitioning approaches, and even roll back mistakes. These features transform how you work with production data, making it safer and easier to iterate on pipelines and transformations. Let's start with the write-audit-publish concept, commonly called WAP. This is a standard data engineering pattern that Iceberg supports natively. The idea is simple but powerful. When you write a commit to the table, it doesn't have to be immediately visible to all readers. Instead, the commit exists in the table's history, but only specific readers who opt-in can see it. This gives you the opportunity to audit the new data by verifying it meets business requirements, checking data quality, and running validation queries before publishing it to all consumers. Think of it like a staging area where changes are committed but not yet deployed to production, protecting you from questions on data that hasn't been fully confirmed as correct. Let's see this in action with a practical example using our New York City taxi dataset. In this example, pretend we realize that the fare rate is double what it should have been. To fix this, we want to add a new calculated field to the dataset. The data already has trip costs, but we want to add the new correct rate as well. We'll use an update command to backfill the calculation across all existing records. Before we do that, we need to add the column to our schema since update doesn't support adding new columns. Now we can execute our update operation with a WAP ID configured. This will create a new snapshot with our new column fully updated. At this point, something interesting has happened. The data has been written to the table. New data files exist. Manifests have been created. There's a new snapshot in the table history. But if you query the table normally, you won't see these changes. Why is this? It's because the new snapshot isn't published yet thanks to our WAP flow. To audit it, we need to explicitly read from the WAP snapshot like this. Now we can validate our work. Does every row have the correct fare? Are the values correct? Did we handle any nulls properly? This is your chance to catch problems before they affect downstream consumers. If something looks wrong, you can simply drop the WAP commit, and Iceberg will automatically clean up all those data files as if the operation never happened. To do this, we run code like this. If everything checks out, we publish the snapshot, making it visible to all readers. This pattern is invaluable for production systems where data quality matters. We run this code to publish the commit. WAP works great for single commits, but sometimes you need to test a series of changes together. Perhaps you're creating a multi-step transformation pipeline or a complex schema evolution. For these scenarios, Iceberg supports branching and tagging. And yes, it works just like Git. Branching allows you to fork the table's history. You create a branch, and now you have two completely valid Iceberg histories that can evolve independently. Why would you want to do that? Simply put, branching means you can write, modify, and test new pipelines without affecting the main table that production workloads are reading from. When you're satisfied with your changes, you can merge that branch back into main, or if the experiment didn't work out, you can simply delete that branch. So how does this work in practice? You may be familiar with the Git way of branching, and the Iceberg command is similar and looks like this. Now we're working on a new branch. Let's add another month of taxi data to this branch only. We can query this branch to verify the data looks correct, run our quality checks, or maybe even point a BI tool at it temporarily to see how the dashboards render. Meanwhile, the main branch continues serving production traffic completely unaffected. Until we merge back into the main branch, we are free to make additional changes or fixes on our branch, and it won't affect the main branch. In that way, a branch is like your own Iceberg table playground. Once we're happy with all of our changes, we'll merge the branch back to main. All those commits, the data additions, the modifications, become part of the main table's history. It's worth noting that branches can live indefinitely. You don't have to merge them back if you have a purpose for leaving them hanging. Some teams maintain long-lived branches for different environments, like a staging branch that mirrors production structure but with test data, or a QA branch that serves for longer-term testing of specific functionality or features. Iceberg also supports tagging, which, like Git, lets you give specific snapshots memorable names. Instead of referring to a snapshot by its ID, which can be difficult to remember, you can tag it with something meaningful, like end-of-quarter or before-schema migration. Tags are particularly useful for marking checkpoints in streaming jobs, creating reproducible reference points for auditing, or marking important milestones in your table's history. Let's do an example of creating a tag for your snapshots. Now, even with all these safeguards, mistakes happen. Maybe you merged a branch too early and discovered a critical bug. Maybe a pipeline ran with incorrect logic and corrupted data. Iceberg has you covered with rollback functionality. You can revert your table to any previous snapshot, setting it as the current snapshot in the table. Rollback is your safety net. It's an essential tool for disaster recovery, and you'll almost certainly use it at some point. If you find yourself needing to do a rollback, use the rollback procedure like this. Now let's zoom back out and recap what you learned. The main point I want you to understand is that when you put together WAP for single-commit validation, branching for multi-step experimentation, tagging for marking important points, and rollback for recovery, you get a robust framework for safely working with production data. You can test pipeline changes against real production tables without risk. You can validate data quality before exposing it to consumers. This fundamentally changes how you approach data engineering, making production systems much more approachable and reducing the fear of making changes. In the exercises, you'll practice creating branches, using WAP to validate merges, and recovering from mistakes with rollback. These aren't advanced features you'll rarely use, but rather, they're fundamental tools that can be a part of your everyday workflow with Iceberg. Let's get hands-on with them.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.