Handling Concurrency in Apache Iceberg
1. Handling Concurrency in Apache Iceberg
Now that we've discussed row-level operations, let's briefly touch on the concepts of conflict detection and concurrency. One of Iceberg's key features is that multiple writers can commit to the same table concurrently. But what happens when two writers try to modify the same data at the same time? Iceberg uses optimistic concurrency control. When you start a write operation, you're working against a specific snapshot of the table. When you commit, Iceberg checks whether any conflicting changes have been committed since you started. What constitutes a conflict depends on the engine and the configured isolation level. But generally, if two writers are modifying different files or different partitions, there's no conflict, and the later writer will just have to retry with new automatically generated metadata. This type of retry is very fast, under a second usually, and you generally will not notice when they happen. If two operations are modifying the same file, one commit will fail, and in Spark, an exception is returned to the client. At this point, it's up to the user to decide what the proper retry behavior should be, since it will require generating completely new data files. The key thing to worry about are operations that touch the same partitions or the same files. If you have two batch jobs both overwriting yesterday's partition simultaneously, one will fail and need to be rerun. If you have concurrent updates hitting the same subset of rows, one will be rejected. How do you avoid conflicts? First, partition your data well so different writers naturally work on different partitions. A streaming job writing today's data and a batch job backfilling last month's data won't conflict if they're in different partitions. Second, make sure that you are exposing predicates to the query engine. For example, in a Spark merge statement, only the clauses in the on portion of the command are used to determine whether two commits are isolated. Third, design your pipelines so that any given partition or file is owned by one writer at a time when possible. In the coming exercise, you'll practice different write patterns and analyze streaming-style frequent small commits versus batch-style large commits. You'll experiment with row-level updates in both merge-on-read and copy-on-write modes and observe the performance differences. Lastly, you'll deliberately create conflicting writes to see how Iceberg detects and handles them. These hands-on experiences will give you intuition for how to design well-crafted write patterns that are both safe and performant. The key takeaway is this. Iceberg gives you flexibility in how you write data, but with that flexibility comes responsibility. Understanding the trade-offs between different write strategies, copy-on-write versus merge-on-read, small frequent commits versus large batches, partition-level operations versus row-level lets you make informed decisions that balance write performance, read performance, and operational complexity for your specific use case. With your new knowledge, you should be well-equipped to handle these weighty decisions. See you in the next video.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.