Copy on Write and Merge on Read

1. Copy on Write and Merge on Read

Iceberg supports two strategies for handling row-level operations, copy-on-write or COW and merge-on-read or MOR. This is a fundamental design choice that affects both write performance and read performance. At a high level, copy-on-write is slower at write time, but faster at read time, while merge-on-read is faster at write time, but slower at read time. In copy-on-write mode, when you update or delete rows, Iceberg rewrites the entire affected data file. Let's say that you're updating 10 rows, but those rows are spread across three data files that each contain 100,000 rows. Iceberg will copy all three files, apply the updates or deletions, and then write out three new files with the modifications, thus copying-on-write. The old files are marked as deleted in the new snapshots manifests. Let's take a look at a small example of this. On our existing dataset, we can run this code and when it's finished, we can view our results. This approach means slower writes because we have to read and then write all of the rows we aren't changing as well as those we are. The upside is that the reads are fast because our query engine can just read the new data files as is. When you query the table, you're just reading Parquet files like normal. In merge-on-read mode, Iceberg takes a different approach. Instead of rewriting data files, it writes small position deletes, either as Parquet in Iceberg V2 or Puffin file delete vectors in V3 that track which rows in existing data files should be ignored. For our same update of 10 rows across three files, Iceberg would write delete files marking those old versions of the rows as deleted, then write a small data file with the 10 new versions. Just like we did before for copy-on-write, let's look at an example of merge-on-read by running the following snippet. With it finished, we can look at the results. We can also jump over to the metadata table and see that the position deletes have been listed there. This makes writes much faster because you're only writing the changes, not rewriting entire files. We can also benefit from storage that is caching since we are not changing the files entirely, just adding an attachment. The trade-off is that reads become slower because the query engine has to merge the data files with the delete files at read time, thus merge-on-read, filtering out the deleted rows and applying the updates. It's doing extra work on every scan. So which mode should you use? Well, it depends on your specific workload. The easiest decision point is to look at what portion of the data you are changing in a given command. A good rule of thumb is that if you are changing more than 20% of a file, it makes sense to use copy-on-write. This is common in European GDPR situations or modifying the state of a large portion of your table, like we did in our branching examples in previous lessons. If you're doing near real-time or smaller updates like IoT sensor data ingestion or change data capture transactions, merge-on-read can give you much better write throughput, but reads will be slightly slower. Many systems default to merge-on-read and then periodically compact the delete files back into the data files during maintenance windows, getting the best of both worlds. Each of these strategies is perfectly valid in its own right, but be sure you think through your needs before choosing one to implement, or you'll waste time and money and potentially leave your table users frustrated.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.