DE - Snowflake Data Engineering Overview - Part II
1. DE - Snowflake Data Engineering Overview - Part II
In continuing our discussion about Snowflake Data Engineering, we’re now going to cover data transformation, orchestration, and observability. First let’s talk about data transformation. We already did a video on Snowpark dataframes, so you should be at least somewhat familiar with that – though I recognize we moved through that quickly, so no worries if you didn’t retain much. We’ve also covered UDFs and UDTFs, as well as Stored Procedures, which are useful tools as you try to convert your data from a raw state into a more usable final form. And of course, SQL is a powerful tool for data transformation – It has a ton of built-in functions, and I wanted to include it here just so we don’t overlook one of the most critical tools of them all just because it’s so familiar to us. The major item here that we haven’t covered much yet is Dynamic Tables, though we touch on it very briefly in our video on views, since they are similar in many ways. In fact, I think the easiest way to think about Dynamic Tables is to think of them as sort of similar to materialized views, but much, much more flexible. Where materialized views update right away when a table on which they’re based changes, Dynamic Tables let you specify a refresh rate of 1 minute or more. And crucially, dynamic tables can handle much more complex transformations with automated orchestration. It only recently came together in my head that Dynamic Tables pair really well with streaming ingestion methods, like Snowpipe, because you can get really fast refresh rates for your data ingestion *and* really fast refresh rates for your downstream tables that depend on that data, and everything happens without your needing to set up and maintain a lot of other orchestration. For this reason, we could probably also have covered Dynamic Tables in our section on orchestration – it’s like a table with in-built orchestration functionality. Okay, so here are some code snippets of these, many of which I took from earlier videos in the course. This first one is of a Snowpark Dataframe transformation, and I took it directly from the section of our course on Snowpark dataframes. It’s just selecting specific columns from a dataframe. In this next one, you can see that we’re creating a Dynamic Table with a refresh rate of 1 minute. Snowflake is also working to drive this latency even lower, and the best part is you don’t have to rebuild anything in your pipeline, simply adjust a single target lag parameter here and your streaming processing will be all set. Then you’ll see a stored procedure we created earlier in the course (“CREATE PROCEDURE”), a UDF we created earlier in the course (“CREATE FUNCTION”), and a SQL query we ran in our section on semi-structured data. Now let’s talk about some of the ways Snowflake helps you do orchestration. There are a ton of open-source tools you can use to orchestrate your Snowflake pipelines – like Airflow, Prefect, or DBT – but here we’re going to focus on the Snowflake-native tools: Streams and tasks. The key thing to know about streams is that you create a stream *on* a table, and this stream keeps track of changes to that table (appends, deletes, updates), so that you can update downstream tables based only on those incremental changes. I like to think of streams as a tool that keeps track of what’s happened recently inside a table. Tasks are different – tasks can actually *do* something by running stored procedures, SQL queries, etc., and they can do this on a particular schedule, at a particular interval, or based on when another task completes. Streams and tasks are *really* powerful when combined, because you can use a task to insert only those rows it pulls from a stream (meaning, those roles that were identified by the stream as being newly added to a table, for example). So you can have a task that checks every once in a while for updates to another table *by looking at the stream for that table*, and then it can insert those newly updated rows into a downstream table, or do something else with them. A great structure for efficient, incremental data processing. Here are some code snippets that might make this more concrete. The syntax for creating a stream is REALLY simple: You just use the “CREATE STREAM” command followed by the stream name, and then you specify the table on which you’d like to create that stream. Creating a task is also pretty simple, though there are a few more components to it – You just use the “CREATE TASK” command. Then you specify the warehouse you want to use (there are also serverless tasks, which don’t require you to specify a warehouse, but we’re not covering those in this course). You’ll notice this task is set to run every minute, and what it actually does is look at the stream, pull the new rows from the associated table, and insert them into this new_sales table. And finally, let’s talk about some of the ways Snowflake helps you with observability. Snowflake lets you set alerts and notifications. An [alert](https://docs.snowflake.com/en/user-guide/alerts) takes an action you specify when a condition is met. So you could check to see if a query is taking too long, and then record that in a table somewhere. Or you could have it initiate sending you an email. Notifications are fairly similar – it’s not a distinction we need to dig into here. Snowflake also has [logging](https://docs.snowflake.com/en/developer-guide/logging-tracing/logging) functionality – One convenient way to store and access logging is through Snowflake Event Tables, which let you record messages like warnings or errors that are generated from UDFs, Stored Procedures, etc. written in any one of multiple languages (Java, Python, etc.) Similarly, you can use Event Tables to store [Trace](https://docs.snowflake.com/en/developer-guide/logging-tracing/tracing) events, which can hold really flexible event types – so say you’re running some calculation, and you want to record an intermediate value to your event table. Trace events can help you do this. And if you have tasks that trigger other tasks, Snowflake lets you view the connections between your tasks as a DAG (a directed acyclic graph). Some quick examples – You can see a basic alert here, which checks every once in a while to see if the values in a table are exceeding some threshold, and then it stores those values in another table. You can see a call to the SYSTEM$SEND_EMAIL procedure, which will send an email to a particular account with the subject line and body you specify. You can see how to create an event table, though you add values to that table through the body of the relevant UDF, Stored Procedure, etc. And then you can see an example of a Dag that helps you debug a task graph run. Whew! That was a lot. I don’t expect you to have internalized what we just discussed, but I’m hoping that you’ll come away from this topic with a few high-level takeaways: First, Snowflake provides you with ways to handle ingestion, transformation, orchestration, and observability. Second, Snowflake’s platform natively supports both batch and streaming ingestion, and several of the major ingestion methods are Snowpipe auto-ingest, COPY, standard Snowpipe, Snowflake Native Connectors, and data shares. Third, Snowflake lets you transform data with dynamic tables, Snowpark dataframes, Stored Procedures, UDFs + UDTFs, and SQL functions. Fourth, Snowflake lets you do data orchestration with streams and tasks. Fifth, Snowflake helps you monitor your data pipelines with alerts + notifications, logging + tracing with event tables, and the task DAG. Now that we’ve done that whirlwind high-level overview, let’s put our feet back on the ground and see what it takes to set up a standard Snowpipe to ingest some data.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.