Sort Orders
1. Sort Orders
Now let's dive into sort orders and partitioning, which are two of the more powerful tools for optimizing Iceberg table performance. Sort orders are key to ensuring that certain ranges of values are isolated to particular data files, making file skipping more effective. When you define a sort order on a table, you're telling Iceberg that data should be organized within files according to that ordering. This generally makes writes more expensive because data needs to be sorted before writing, but it can dramatically reduce the number of files that need to be scanned for queries that filter on the sorted columns, thus speeding up any read actions. Let's look at a concrete example of what it looks like to define a sort order. Say you're building a table of user events, and you know most queries filter by user ID and timestamp. If you define a sort order on user ID timestamp, files will be written with user events grouped together and ordered chronologically within each user. When someone queries for a specific user's events in a time range, Iceberg can use the min-max statistics to skip files that don't contain the user or don't overlap the time range. Sort orders also include more advanced concepts like clustering, though this is still evolving in the Iceberg library at the time of this recording. Techniques like Z-order curves or Hilbert curves are types of multidimensional ordering that can isolate files based on correlated values. If queries frequently filter on both region and product category simultaneously, a Z-order, which is a space-filling curve that groups values close in multidimensional space together, can be used on those columns to create files that are selective for queries on either column or both together. However, sort orders have key limitations you need to understand. They tend to need either a huge amount of correlated data to be selective or they need smaller file sizes to create enough boundaries between different value ranges. Unfortunately, as we noted earlier, most engines are still not performant with small files. This creates a tension. Sort orders work best with many small files, but query engines work best with fewer large files. The current recommendation is to only use sort orders when you have partitions with several full-sized data files. If a partition only ends up with a single file, sort orders are useless because the whole range for the sort columns will be present in a single file. So the min and max, which are already calculated for free, are sufficient to know whether the file can be skipped. In this case, you're paying the write-time cost of sorting for no read time benefit. This is why sort orders are typically more useful on high volume tables with large partitions rather than on small tables or tables with very granular partitioning. Choosing a sort order should only be done if you have strong evidence of your query patterns. You're accepting slower writes in exchange for faster reads, so you need to be confident the trade-off is worth it. A column or set of columns should only be chosen for a sort order if it's discriminatory for the kinds of queries you want to run. Let's look back at our New York City taxi data and create a sort order for it. We can create a table with a sort order on the pickup location and the pickup time. We then can insert some data and show by looking at the files table that we've now created files that do not have overlapping ranges for particular values. This means when we query the table later, we'll be able to pick out one of these files without having to read any of the others because they all have independent ranges for pickup location. Now let's talk about partitioning strategy. We touched on this in the first module, but it deserves a deeper look from a performance perspective. Partitioning is not just important for isolating queries, but also for write performance and creating appropriately sized files. The key insight is that partition cardinality or how many distinct partition values you have directly affects the number of data files created. High cardinality partitioning means more partitions, which means more data files, which as we've discussed increases planning costs, decreases individual file sizes and decreases read performance. Low cardinality does the opposite. Here's a concrete example. 100 megabytes of data in two partitions will most likely result in two files of about 50 megabytes each. The same 100 megabytes in 100 partitions might result in 100 files of about one megabyte each. The first scenario is far more efficient to query because you're opening two files instead of 100, even though you're reading the same total amount of data. There's a trade-off here. Yes, partitioning allows you to skip a relevant data, which is valuable, but there's a point at which the cost of opening and processing many small files outweighs the benefit of being able to skip some of them. If your query needs to scan 90 out of 100 partitions anyway, you haven't gained much from granular partitioning and you're paying the cost of dealing with 90 small files. So how do you choose the right partition granularity? Start by analyzing your query patterns. If queries typically filter on date and most queries look at a full day or more of data, partitioning by day makes sense. If queries typically look at a single hours, hourly partitioning might be more appropriate. But if queries often span multiple days or weeks, daily partitioning might already be too granular and will result in too many files being created. We can see the difference your partitioning choice makes in a simple example. First, let's partition our New York City taxi dataset based on the month of passenger pickup. As you can see, that created relatively few files, but each file contains larger amounts of data and will result in lower ability to skip files. Let's do the same exercise, but partition by day. Now we have a completely different result in terms of the number and size of files. Same data, different way it's stored, and in the end, different query performance. Another consideration is data volume per partition. A good rule of thumb is that each partition should contain at least a few hundred megabytes of data, ideally multiple gigabytes. This ensures you're writing reasonably sized files. If your partitions are consistently small, that's a signal you're over partitioned for your data volume. You may also need to consider partition evolution over time. Maybe today you're ingesting one gigabyte per day, so daily partitions make sense. But if your data volume grows to a hundred gigabytes per day, you might need to switch to hourly partitioning to keep partition sizes manageable. The good news is that Iceberg supports partition evolution, as we discussed in module two, so you're not locked into your initial decision forever. There's also an interaction between partitioning and sort orders that I feel is worth mentioning. If you partition by date and sort by user ID within each partition, you're creating files that are selective on both dimensions. A query filtering on both date and user ID can skip based on partition, eliminating wrong dates, and then based on sort order, eliminating wrong user IDs within the right date. This layered selectivity is powerful, but requires careful planning. Let me give you a practical framework for making these decisions. Start by profiling your query workload. What columns appear most frequently in the WHERE clauses? What's the selectivity of those filters? Do they typically match 1% of the data, 10% or 50%? How much data do queries typically process? Next, consider your write patterns. How frequently are you writing? What's the write volume? Can you afford expensive shuffles and sorts during writes? Or do you need to get maximum write throughput? It's worth noting that the compaction command has the ability to reapply the sort order or add a brand new one to all data files, not just the ones being written. If sorting on write isn't worthwhile because of the small volume of data you insert at a given time, you can always trigger the sorting after accumulating a significant amount of data. With this information, you can make informed decisions. If queries are highly selective on timestamp and you're writing large volumes frequently, daily partitioning with a sort order on another commonly filtered column might make sense. If queries are less selective and scan large portions of the table anyway, simpler partitioning without sort orders might be better to keep writes fast. And remember the beauty of Iceberg. You're allowed to experiment. Create a test table with one partitioning and sort strategy, run representative queries, measure performance, then try a different strategy and compare. Iceberg's branching feature makes this kind of experimentation safe and easy. With this information, you should be able to make good first choices for your Apache Iceberg tables. You won't always get it perfect on the first try, but that's okay because Iceberg's evolution capabilities mean you can adapt as you learn. The exercises will give you hands-on experience with different modeling and configuration decisions and let you see their effects on both write and read performance. Try different partition granularities, experiment with sort orders, and observe how distribution modes affect file layouts. This practical experience will build your intuition for making these trade-offs in a production system.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.