Schema Evolution for Iceberg Tables

1. Schema Evolution for Iceberg Tables

Now that you've had some hands-on practice with branching and WAP, let's turn our attention to another critical aspect of working with production data, schema evolution. As business requirements change, your tables need to change with them. An iceberg is exceptionally well-suited to handle whatever new form your data needs to take. The structure of an iceberg table can evolve in two major dimensions. The schema itself, which includes the columns, their types, and constraints, and the partitioning specification that determines how data is physically organized. Let's start with schema changes since these are by far the most common day-to-day modifications you'll make. In iceberg, columns can be added, removed, or renamed whenever you need. We saw this in our earlier example using update to add formulaic data. You can mark columns as optional or required, creating contracts that any engine writing to the table must uphold. These aren't just suggestions, they're enforced constraints that ensure data quality and consistency across all the different tools and systems that might interact with your table. This is important because remember, iceberg is an open format and can interact with an extremely wide variety of tools. The syntax for schema changes is straightforward. We add a column like this, we rename a column like this, and we drop a column like this. Here's what makes iceberg schema changes special. Changing the schema is purely a metadata operation. It happens almost instantly regardless of how much data is in your table. You're not rewriting terabytes of parquet files when you add a column, you're just updating the metadata.json file and creating a new snapshot. This means schema changes are less messy for your organization and can keep your normal workflow moving without the need to process schema changes overnight or by locking others out of your table for hours. But the real magic is in how iceberg tracks columns internally. When you add a new column, iceberg creates a metadata entry that ties a specific field ID to that column. The field ID is an internal identifier that's separate from the column name. In the actual data files, columns are referenced by these field IDs, not by names. Why does this matter? Because it means iceberg can unambiguously identify which column is which even if a name changes. The field ID is always incremented and never reused, which has a profound implication. It's impossible to accidentally resurrect deleted data, even if you drop a column and later add a new column with the exact same name. Let me illustrate this with an example. Say you have a column called customer underscore ID with field ID five. You decide you don't need it anymore and drop the column. All the existing data files still contain that data physically. Remember, we didn't rewrite the files. But iceberg schema no longer maps field ID five to anything, so it's invisible to queries. Now, months later, you realize you need a customer underscore ID column again, so you add it back. The new column gets a field ID 23 because the field ID is always increment. When iceberg reads old data files, it sees field ID five in there, looks at the current schema, and finds no mapping for field ID five, so that data remains invisible. The new customer underscore ID column with field ID 23 will only have data in files written after you added it back, and so that is what can be queried. Compare this to purely name-based systems, like some older Hive configurations, where dropping and re-adding a column with the same name would resurrect all the old deleted data. That's a recipe for data quality disasters, and iceberg avoids it entirely. Now, there's a caveat here for data files not originally written by iceberg. Say you use the snapshot procedure we discussed earlier on plain parquet files. Those files don't have iceberg field IDs embedded in them. In this case, iceberg uses name-based matching to assign pseudo field IDs when it first encounters the files, but any new data written by iceberg will always include proper field IDs. When you add new columns to a table with existing data, iceberg gives you powerful tools to handle the mismatch between old files that don't have the column and new files that do via default values. Iceberg actually supports two different types of defaults that serve different purposes. Let's take a look at the first type, initial defaults. Initial default values define what readers should return for the new column when reading old data files that were written before the column existed. For example, if you add a discount underscore applied Boolean column to your sales table, you might set the initial default to false so that all the historical records without this field are interpreted as not having a discount. The initial default can only be set once when you first add the column, and it applies retroactively to all existing data. Remember, if you set it incorrectly, you can drop the column, but re-adding that column will create a new distinct column holding none of the previous data. The second type is the write default. This tells iceberg writers what value to use for the new column if the incoming data doesn't explicitly provide one. This is useful for ETL pipelines or applications that might not immediately be updated to include the new column. Unlike the initial default, the write default can be changed whenever you like as requirements evolve. Initial defaults simply fill in the value for existing rows, and write defaults fill in the data for any future rows. So, to review. Schema Evolution in Iceberg allows you to add, remove, or rename columns instantly. It happens entirely in the metadata with no data needing to be rewritten. And it does all this safely by using field IDs to prevent resurrecting old data. Together, these features let you rapidly evolve your schema and backfill constant values for new columns without rewriting data or breaking existing pipelines. It's an elegant solution to a problem that plagues many data systems and helps ensure your systems keep running and your coworkers can keep working. In the next video, we'll explore partition evolution and see how Iceberg lets you change how your data is physically organized over time.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.