Writing efficiently to Iceberg Tables

1. Writing efficiently to Iceberg Tables

We've talked a lot about how Iceberg works under the hood and how to maintain tables. But now let's discuss something that has perhaps the biggest impact on real-world performance. How you model your tables for efficient writing. The way files are laid out in your Iceberg table fundamentally determines how well it performs read and write actions over time. Iceberg gives you several options to control this layout. And choosing the right settings for your use case can mean the difference between a table that can have quickly executed queries and one that crawls. Let's start with the most fundamental concept, file size. This might seem simple on the surface, but it's a crucial piece of your optimization framework. Most query engines available today still cannot open multiple data files simultaneously within the same worker thread. This means that keeping file sizes reasonably large is critical for scan performance. Why is this? It's because the cost of opening a file, which means establishing the connection, reading the footer, and parsing metadata, is generally much more expensive than continuing to read from an already open file. Think about it this way. Would you rather open and read 100 files that are one megabyte each or open and read one file that's 100 megabytes? The total data is the same, but the first scenario involves 100 file open operations with all their overhead, while the second involves just one. If your engine has to open files serially, the 100 files operation is much more expensive. The ideal file size is another aspect that is unfortunately dependent on use cases and engines involved. But as a rule of thumb, fewer files is better than more. Unfortunately, the actual size of files written is usually engine dependent, and this is where things get complicated. Different engines have different mechanisms for grouping and distributing data before writing, and these decisions affect the resulting file size. Let's focus on Spark as we've been doing previously in this course, but keep in mind that Trino, Flink, and other engines have their own approaches. In Spark, Iceberg uses something called distribution modes to tell Spark what to do with data prior to writing it. Iceberg attempts by default to match the distribution to the tables configuration and write optimally sized files, but this has important performance implications on the write side as well. For partition tables, the default distribution mode is to request that data be grouped by partition value before writing. This makes intuitive sense. If you're partitioning by date, you want all of the data for January 15th together, so it can be written into the smallest number of files possible. If the data was split amongst multiple tasks, each would write its own separate files. But here's the catch. This grouping requires a shuffle operation in Spark, which redistributes data across the cluster. And shuffles are expensive because they involve network IOs, serialization, and potentially spilling to disk. The benefit is that this approach writes the minimal number of files per partition. However, it's still limited by Spark's internal shuffle size configuration. In newer versions of Spark, Iceberg could suggest a shuffle partition size large enough to get close to the table's target file size, but this behavior varies from version to version and isn't always perfect. You might still end up with multiple smaller files per partition if the shuffle partition size is configured too small, or the data is very compressible on disk, leading to a less-than-optimum result. Now, if your partition table also has a sort order defined, which we'll talk more about in the next video, the default distribution mode becomes more sophisticated. Iceberg requests that the data be grouped not just by partition, but also sorted within each partition according to the defined sort transforms. This ensures you're writing the minimal number of files per partition, while also ensuring that those files have the most selective min and max statistics possible for the sort columns. Why does this matter? Remember that Iceberg uses column metrics for file skipping. If you have a sort order on customer ID and write data in sorted order, each file will have a tight range of customer IDs. Maybe file 1 has customers 1 through 1,000, file 2 has 1,001 through 2,000, and so on. When you query for a customer ID 1,500, Iceberg can skip files 1 and 3 entirely. But if the data wasn't sorted, each file might have customer IDs scattered throughout the entire range, and Iceberg would have to scan all of them. The tradeoff is that sorting data before writing is expensive because it requires additional processing and potentially more shuffling to get everything in the correct order. If you are familiar with computer science sorting algorithms, this is the same base concept except distributed. This is why sort orders should be chosen carefully based on your actual query patterns. The last distribution mode in Spark is None, which tells Spark to skip all the grouping and sorting and just have each task write directly to files without further processing. If your data is already well organized within Spark, this is the fastest way to write because you're avoiding expensive shuffle operations. This could happen if your data is being ingested from another source that's already partitioned and sorted correctly. But here's the danger. If your data isn't already well organized, distribution mode None can be a disaster for small file generation. In a worst-case scenario, each Spark task might write a tiny bit of data to every possible partition. If you have 200 Spark tasks and 100 partitions, you could end up with 20,000 tiny files in a single write operation. This is exactly the kind of small file problem we talked about needing compaction to fix. All of these distribution options are strongly influenced by how you configured your tables partitioning and sort orders. In the next video, we'll explore sort orders and partitioning strategies in depth and learn how to choose the right configuration for your use case.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.