Get startedGet started for free

Advanced column-oriented database techniques

1. Advanced column-oriented database techniques

In this video, we'll talk about a few of the advanced tools that Snowflake offers to help improve both the overall performance of a data warehouse, as well as query performance.

2. Micro-partitioning data with Snowflake

Behind the scenes, Snowflake uses micro-partitioning to create smaller "chunks" of rows, while still storing data in a columnar fashion. For example, if a table has one hundred thousand rows, Snowflake may create one hundred micro-partitions, each consisting of one thousand rows. For each of these micro-partitions, Snowflake stores metadata about the rows stored within it. This includes ranges of values for each column and the number of distinct values for each field. Rather than having to scan each row within a table when a query is executed, Snowflake can first check each micro-partition's metadata and determine if that micro-partition contains records that match the parameters of the query, or if it can be ignored. This is known as "query-pruning" and reduces the amount of data processed when a query is run. Micro-partitions also allow for DML to be more efficiently executed. Let's take a look at an example now!

3. Micro-partitioning data

Here, we see the "books" table before and after partitioning. On the right, the table is grouped into two micro-partitions, each containing three rows. While not explicitly shown, each micro-partition contains metadata about the group of rows it is comprised of, including each column's unique values, as well as the range of entries. These records are still part of the same table, but now belong to a micro-partition as well.

4. Data clustering with Snowflake

Once data has been micro-partitioned, Snowflake then organizes this data within the micro-partition. This is known as data clustering and is automatically performed when data is loaded to a table. By organizing data within micro-partitions, and storing the metadata, the amount of data accessed during query execution, can be be reduced. This helps to improve query performance. While this is done automatically by Snowflake, specific subsets of columns in a table, or "cluster keys", can be specified by a user, to cluster data in a custom manner.

5. Data clustering

Here, we see two micro-partitions created from the "books" table. On the right are these same micro-partitions, after having been clustered by the "price" field. In this example, "price" is the only cluster key, and the records are sorted in the micro-partition by the price of each book.

6. Query pruning

Combining Snowflake's use of micro-partitions and data clustering allows Snowflake to reduce the amount of data that is processed when a query is executed. Here, we've written a query to return the title, author, and price of all records from the books table with a price greater than twenty-five dollars. When executed, Snowflake first looks at the metadata for each micro-partition. Since the range of prices in the second micro-partition doesn't contain values greater than twenty-five, it's "pruned", and only the first micro-partition will be processed. Within the first micro-partition, data is sorted by the "price" field in ascending order. After the second value is scanned, the remaining values are returned since they must be greater than or equal to 49.99. The corresponding values in the "title" and "author" columns are then added to the result set and returned. In this example, only two values in the price column were scanned to generate the result set, thanks to a column-oriented design that uses micro-partitions and data clustering. As the amount of data in a Snowflake table grows, these tools can reduce the amount of data processed when a query executes dramatically, reducing time-to-insights, as well as cost.

7. Let's practice!

Now that we've explored some of Snowflake's more advanced techniques, let's practice with a few exercises.