Table Maintenance for Iceberg - The Basics

1. Table Maintenance for Iceberg - The Basics

Now that you understand how to write data to iceberg tables and have gotten some hands-on practice, we need to talk about keeping those tables healthy over time. Just like a car needs oil changes and tune-ups, iceberg tables need regular maintenance to maintain optimal performance. The good news is that icebergs' maintenance operations are well-defined, relatively straightforward to run, and way cheaper than an oil change to do. Iceberg tables generally need two types of maintenance to keep performing well. First, they need to minimize the number of small data and metadata files. And second, they need to keep the size of the metadata.json file from growing unbounded. Let's break down each of these problems and how to solve them. Small files are a common problem in any data lake, but they're especially prevalent with streaming workloads or frequent small commits, like we often see in iceberg. Remember from our last video that every write operation creates new data files? If you're doing micro-batches every minute, you might be creating thousands of tiny files per day. This is bad for performance because most query engines can't open multiple data files simultaneously within a single worker task, creating a bottleneck. More files means more overhead, slower queries, and ballooning metadata. Suddenly, all that time you saved with automatic partitioning and metadata management is gone and replaced with this new problem. Well, there are some forward-looking proposals to reduce the need for maintenance in future versions of iceberg, but for current users at the time of filming, this is still a real issue. The solution is a process called compaction, which is implemented in Spark's iceberg integration as a family of rewrite data file and rewrite manifest procedures. The underlying mechanism is straightforward. Take multiple small files within a partition and rewrite their contents into fewer, larger files. Then, commit a new snapshot that references the consolidated files and marks the old, small files as deleted. Over time, doing these operations will clean up your metadata and snapshots. To begin diving into compaction procedures, let's talk about that major source of metadata bloat that is snapshot accumulation. Every write to an iceberg table creates a new snapshot, and all of those snapshots are tracked in the metadata.json file. This file contains the complete snapshot history. Yes, every commit, when it happened, which manifest list it points to, and so on. In fact, this history of snapshots is what enables iceberg time travel to happen. As you can imagine, over time, an unmaintained JSON file can grow to megabytes or even tens of megabytes. Why is this a problem? Because every write operation needs to read the current metadata.json, make changes, and write out a new version. The larger this file gets, the more expensive those operations become. Parsing and serializing a 50 megabyte JSON file adds significant overhead to every commit and scan. There are some optimizations and REST catalog implementations that can reduce the payload size, but the fundamental issue remains that unbound history growth eventually creates performance problems. The solution is to expire old snapshots, essentially cutting off the table's history at a certain point in time. You may remember seeing this when we were demonstrating removing a write audit published staged snapshot. To set up and execute a procedure to expire your old snapshots, you would use a line of code like this. This operation removes snapshots older than a specific retention period specified by you or your organizational policies. When you expire snapshots, you're not just cleaning up metadata.json, but also enabling Iceberg to remove any data files that are no longer referenced by any retained snapshot. Expiring snapshots can also help you stay compliant with requirements surrounding sensitive data that cannot be kept around for extended periods of time. There's an important tradeoff here. Expiring snapshots prevents time travel queries to before the expiration point. If you expire everything older than seven days, you can't run queries with as of timestamp from 10 days ago. This also would affect long-running queries. For example, expiring all but the last hour of snapshots would break any queries that take longer than an hour to complete. You need to balance your desire for a clean, compact metadata.json against your needs for historical queryability. Many organizations keep seven to 30 days of history for operational time travel and rely on separate archival processes for longer-term retention. One subtle but important point, the expire snapshots procedure is what actually deletes old data files from object storage. When you rewrite files during compaction or delete data, those old files stick around until they're no longer referenced by any snapshot. Expiring snapshots is what explicitly garbage collects those old files. So if you notice your storage costs aren't decreasing after deleting data or running compaction, it's probably because you haven't expired the snapshots that reference those old files. In the next video, we'll look at the other critical maintenance operations, manifest compaction, data file compaction, and handling orphaned files.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.