Micro-Partitions and Data Clustering

1. Micro-Partitions and Data Clustering

Welcome to Chapter 3. In the last video, we covered the different ways you can load data into Snowflake. Now it's time to take a look underneath the hood at what's actually happening when Snowflake stores that data and why it matters for how your queries perform.

2. Querying a Large Table

Let’s start with an example. At month end, the team at Snowy Peak are running a query to pull the last month’s subscription sign-ups for a commercial report. This table contains three years of event data, with over one billion rows. Everytime it’s run, the query has to decide how much data do I need to read from this table? Traditional databases use index, but Snowflake doesn’t - so what method do they apply?

3. Micro-Partitions

Enter micro-partitions. When data is loaded into Snowflake, it automatically gets divided into micro-partitions, which are chunks of on-disk storage on the order of 50 to 500 MB of uncompressed data The actual size on disk is compressed and smaller. Within each partition, the columns are stored and compressed separately. This columnar layout is one of the primary reasons why Snowflake queries can be fast on large tables because it only reads columns that a query actually needs. More importantly, this all happens automatically in the background.

4. Metadata

For every micro-partition that Snowflake creates it stores metadata including min and max value for each column, the row count, null counts and distinct value counts. This metadata is stored in the cloud services layer and is maintained automatically as data is loaded and updated.

5. Partition Pruning

Now let's look at example, let's say that we want to filter a sales table based on the month being May.

6. Partition Pruning

Snowflake checks the min and max date values stored for each partition, and any partition where the date range does not include the May data will get skipped. This engine never touches those rows, this concept is called partition pruning.

7. Clustering

Pruning only works well when the column you are trying to filter on is well-clustered, meaning that the rows with similar values are grouped together in the same micro-partitions. If Snowy Peak’s events are loaded in date order, date-based queries will prune well because the data’s natural sort order does the work. However, if you filter on region or subscription plan and those values are scattered across partitions regardless of load order, pruning may not help as much. You can define a clustering key so Snowflake organizes micro-partitions around that column for future maintenance passes.

8. Clustering Key: When to Use

A clustering key makes sense when three things are true: the table is large enough that pruning actually matters, think hundreds of gigabytes; you're consistently filtering on the same column; and that column is poorly clustered, meaning pruning isn't working for it.

9. Clustering Key: Cardinality and Cost

Cardinality matters here too. A column with very few distinct values, like a boolean flag, won't create meaningful partition boundaries, so it's not a good candidate. There is one more thing worth knowing: Snowflake reclusters automatically over time as new data arrives, using its own serverless compute rather than your virtual warehouse. That's useful, but it does add to your bill. So clustering keys need to be a deliberate, cost-conscious decision. Not something you apply by default.

10. Let's practice!

Let's put your knowledge of micro-partitions and data-clustering to the test!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.