Ingesting Data into Apache Iceberg

1. Ingesting Data into Apache Iceberg

You know how to build iceberg tables and evolve them safely, but now comes the hard part. We have to keep them performing in production. How do you handle streaming data arriving every second alongside batch jobs running hourly? What happens when multiple writers try to update the same table simultaneously? Why are some of your queries blazing fast while others crawl, even though they're hitting the same table? Performance isn't just about how you model your tables. It's about how you write to them, how you maintain them, and how you handle the reality of concurrent workloads at scale. In this final module, we're going to cover the practical engineering of production iceberg systems, covering the topics of ingestion patterns, concurrency management, maintenance operations, and write optimizations. Buckle up, because we're diving deep into the icy waters. In this first lesson of your final module, we're going to explore the mechanics of writing to iceberg tables in production. We'll look at how different write patterns, including inserts, updates, deletes, and merges actually work under the hood. We'll discuss the performance trade-offs between merge-on-read and copy-on-write strategies, and we'll cover how iceberg manages concurrency and avoids conflicts when multiple writers are hitting the same table at the same time. By the end of this lesson, you'll understand not just how to write data to iceberg, but how to do it efficiently and safely at scale. Let's dive in. Before we get into the specifics, I need to give you an important caveat. Most of what we're going to discuss, particularly around isolation levels, serializability, and conflict detection, is engine-specific. The examples and behavior I'll show are based on how Apache Spark handles these scenarios, but the general principles apply across other engines, like Trino, Flink, and Presto. And you are, of course, free to experiment with those on your own. Each engine makes slightly different decisions about when commits conflict and how to resolve them. If you're not familiar with terms like isolation or serializability, don't worry. It's not critical to understanding the mechanics of what's happening. Just know that different engines may behave slightly differently in edge cases. Now let's talk about the two primary modes of data ingestion, streaming and batch. These represent fundamentally different patterns for how data flows into your tables. Streaming ingestion comes from sources that are constantly producing new data. It's like a rushing river that's constantly turning a wheel and generating electricity, except that the river is your data and the wheel is your table updates. The things like Kafka topics, change data capture from databases, IoT sensor feeds, or application logs are all likely to come to you via a streaming data pattern. Streaming jobs run continuously, processing small batches of records and committing them frequently. They typically use checkpointing for recovery, maintaining state about what data has been processed so they can resume after failures without duplicating or losing data. Batch ingestion, on the other hand, focuses on periodic scheduled jobs with generally larger chunks of data being processed at once. In our river example, this would be like taking a giant bucket of water and periodically dumping it on our wheel to turn it. Examples of batch processing include loading daily extracts from a data warehouse, processing files that land in an S3 bucket once an hour, or running nightly aggregations. These jobs have a clear start and end. Process a defined data set and then terminate. Here's what's interesting. Regardless of whether the data arrives through streaming or batch ingestion, to Iceberg they end up looking exactly the same. Every write operation creates a new snapshot, which means new manifest files pointing to new data files. A streaming job that commits every minute creates a snapshot per minute. A batch job that runs daily creates one snapshot per day. In both cases, you will need maintenance jobs to clean up old metadata, although streaming will obviously require more frequent maintenance than batch jobs. But the underlying structure, aka the metadata hierarchy we discussed, is identical. This uniformity is one of Iceberg's strengths. The same table can serve both streaming and batch workloads simultaneously and without a conflict. Let's start with the simplest write operations, inserts, deletes, and overwrites. These are likely fundamentals you are used to working with, and they serve as the foundation for working with Iceberg tables. An insert operation adds new rows to the table. To insert a row or rows, you execute this command. Under the hood, this writes new data files to the slash data directory and creates a new manifest that references these files. The new manifest is added to a manifest list, which becomes a new snapshot. The metadata.json is updated to point to this latest snapshot. All of this happens automatically and has a built-in failsafe. The entire operation either succeeds and the new snapshot becomes visible with all metadata updated, or it fails and nothing changes. There's no partial state where some files are visible but others aren't. This protects Iceberg tables from the problem of tracking which pieces of data were inserted and which weren't. Performance-wise, inserts are generally the fastest write operation because they're purely additive. No existing data files need to be read or rewritten. The only overhead is writing the new data and updating the metadata. A delete operation with a partition or file-level predicate can also be very efficient. To perform this operation, you execute code like this. If your delete predicate aligns with partition boundaries, such as deleting all the data from a specific day, Iceberg can simply mark those entire data files as deleted in the metadata without rewriting anything. The files still exist in object storage, but the new snapshot manifests don't reference them, so they're invisible to queries and they'll be cleaned up later. A similar optimization can take place if a file can be completely excluded based on column metrics. For example, if there's only a single file with a column x is less than 5, and you delete all records with x is less than 5, that file can be removed at the metadata level. An overwrite operation replaces data matching a predicate with new data. This can be achieved using code that looks like this. This is essentially a delete followed by an insert, and it's atomic. Readers never see the table in the intermediate state where the old data is gone but the new data isn't there yet, thanks to how the metadata process works. Now let's talk about row-level operations of updates, merges, and deletes, which target specific rows within files rather than entire files or partitions. This is where things get more interesting and where Iceberg gives you some strategic choices. A row-level operation is triggered when your predicate can't eliminate entire files. If you run updateTaxiTrips setFairAmount equals to FairAmount times 1.1, where TripID equals 12345 on our New York City dataset, Iceberg has to find which file contains that specific TripID, and unless you're very lucky, that file contains many other rows with different TripIDs as well. You can't delete the whole file, you need to modify specific rows inside of it. These row-level operations are each extremely useful but can have different costs based on what write mode you use to apply them. We'll dive deeper into these modes in the next video.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.