Optimize compute resources for efficient batch ingestion
1. Optimize compute resources for efficient batch ingestion
Until now, we haven't explicitly discussed Snowflake's compute resources, although we've been actively using them to ingest data. They're incredibly important to the batch ingestion process and to so many other general operations within Snowflake. So what exactly are they? Snowflake's compute resources are known as virtual warehouses. Under the hood, a virtual warehouse is a cluster of compute resources in the cloud. For now, let's simply refer to them as compute clusters. Compute clusters can be made up of one or more nodes. Each node is a virtual machine that provides the CPU, memory, and temporary storage to execute SQL and other operations against your data. A single node is able to perform multiple data operations in parallel using threads. You don't need to understand the technical details behind threads, but this specific capability of nodes being able to run threads in parallel is key to understanding how to optimize virtual warehouses for the most efficient batch ingestion. In short, understanding the concept of parallel threads will help you understand how to get the most out of a virtual warehouse. Okay, so that was a little abstract. How exactly does this impact ingestion? Well, let's get a little more concrete. As you may have seen within the UI, Snowflake virtual warehouses come in different sizes. They're specifically t-shirt sized. The smallest size for a virtual warehouse is extra small, scaling up to as large as 6XL. The important takeaway for you to understand is that there's a direct relationship between the size of the virtual warehouse and the number of threads available to you in that warehouse. Simply put, the larger the virtual warehouse, the more threads you're able to use. In fact, each subsequent warehouse size indicates a doubling of the number of threads provided by the virtual warehouse. So an extra small virtual warehouse provides 8 threads, a small virtual warehouse has 16 threads, a medium has 32 threads, and so on. And keep in mind, these numbers are for a single node in the compute cluster for that virtual warehouse. And since there can be multiple nodes in a compute cluster, to get the total number of threads available to you, you'd simply multiply the number of threads in a single node for a given virtual warehouse size times the number of nodes in the compute cluster. For example, we know that an extra small virtual warehouse with a single node provides you with 8 threads. So adding a node to that compute cluster would bring the total number of threads in the extra small virtual warehouse to 16. And a 3-node compute cluster would bring the total number of threads to 24. Okay, so what exactly do these threads have to do with ingesting data? Well, remember, threads are able to operate in parallel. So to fully utilize a virtual warehouse, all threads should be doing some amount of work during ingestion. That is, there are no idle or unutilized threads during ingestion. That's the key takeaway for optimally using compute resources for batch loading. As an example, to fully and optimally utilize a small virtual warehouse during ingestion, you'd ideally upload 16 data files at once. And that's because a small virtual warehouse with a single node has 16 threads. Each thread can handle ingesting a file, and the 16 threads can operate in parallel. You can, of course, always ingest fewer than 16 files on a small warehouse, but you simply would not be fully utilizing the warehouse. Say, for example, that you want to ingest one large file. The ingestion in this case would run on one thread, while the other 15 threads would sit idle. That's not very efficient. Okay, so now you know how to fully utilize a warehouse by making sure each thread in the warehouse is doing some work. But you can optimize ingestion even further. For the most efficient ingestion, each thread has a sweet spot, a preference, really, for the size of the data file. In fact, the optimal size for a file is between 100 and 250 megabytes. Now, naturally, I hear the following question a lot. My data files won't always fall within that range in file size. So what should I do? You should split your large files into smaller files, such that each of the smaller files fall into the recommended 100 to 250 megabyte range in file size. Now, that might sound like a huge effort, but there are lots of tools and handy command line utilities out there to greatly expedite this. And the effort is totally worth it, because ingesting large files on a single thread is far less efficient than ingesting the contents of that same file across multiple appropriately sized chunks. For example, with an extra small virtual warehouse, it's more efficient to ingest eight 128 megabyte files that were created as chunks from a one gigabyte file rather than ingesting that one gigabyte file in whole. And that's because smaller files can be processed in parallel more efficiently, which speeds up the ingestion process and ensures a more even distribution of work across the warehouse's compute resources. So keep the following in mind when working with virtual warehouses for batch ingestion. Always aim to utilize all of the threads of the virtual warehouse you're using. The optimal file size of a data file is between 100 and 250 megabytes. Always aim to ingest data files in this range on each thread of the virtual warehouse. You should split large files into chunks of this size if necessary. And you can go even further by compressing the files to get them to be smaller. Snowflake will handle the decompression for you during ingestion. If cost effectiveness is a constraint, consider starting with an extra small virtual warehouse and scale up only if needed. Personally, I always start with an extra small virtual warehouse myself. Okay, we covered a lot of ground in this video. Let's recap what you learned. You learned that a virtual warehouse is a cluster of cloud-based compute resources. You learned that clusters are made up of one or more nodes and that a node is a virtual machine that offers memory and computing power. Nodes use threads to perform operations. You learned that a single node offers eight threads and that an increase in size for a virtual warehouse corresponds to a doubling of the amount of available threads. You learned that files are ideally between 100 and 250 megabytes in size. In the next video, we'll learn how to load data using Snowflake's command line interface.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.