Efficient transformations with streams

1. Efficient transformations with streams

So far, you've learned how to perform transformations against your data using some of Snowflake's core features, SQL, Snowpark, and UDFs. You'll run into these concepts and features over and over again, and it's good to have a baseline knowledge of them as you learn more about building data pipelines. In this video, we'll level things up a bit and start focusing on efficient transformations. If you've noticed, in all of the exercises that we've completed on transformations so far, we've performed our transformations and calculations against entire sets of data. Every time we wanted to modify our views, like Daily City Metrics V, for example, we reprocessed every single row in that view to do so. But what if we could focus on only processing rows with changes, rather than all of the rows within a table or view? That would be far more efficient. For example, let's say you have a table with 20 million rows. You need to calculate an aggregate value, like a sum, using a column in that table. And you need to recalculate this value daily, because new data is added to this table every day. If, say, only 1,000 rows are added to the table overnight, do you really need to reprocess the other 20 million rows to perform your aggregate calculation? Thankfully no, and this is where streams come in. So what are streams? Streams are objects in Snowflake that are able to keep precise track of all of the changes that have been made to a table. They can actually keep track of changes to views too, but we're going to focus on how they're used with tables in this video. If you've created a stream to track a table, you can query the stream to view the latest changes made to that table. After querying the stream, it essentially resets and starts recording subsequent changes from that point forward. Pretty cool, right? This makes streams great for processing incremental changes to tables and making transformations more efficient. You can see how you might start extending this behavior for more complex transformations. For example, rather than perform an aggregation by reprocessing an entire table when new data is added to it, you could perform the same aggregations by processing only the new rows, and then combine that value with whatever previously aggregated value was calculated for the table. Let's create a stream that will help us make sure our views stay up to date. Now is a good time to pause the video to log into your Snowflake account. We've already created a view that tracks the weather for Hamburg, Germany. We can use that view to keep an eye on the weather there. But just to be exhaustive, I'd also like to keep an eye on the daily sales there. Take a look at the order header table in the raw POS schema. If you explore this table, you'll note that it contains data about orders. It also contains a location ID column, which can be used to look up the city where the sale occurred. This is perfect for what I'm trying to do. I'll create a stream on the order header table, and track any new sales that are added to the table. Later, I'll write logic to determine whether or not that sale occurred in Hamburg. Okay, open a new SQL worksheet. Set your context. Next, type create or replace stream order header stream on table tastybites.rawPOS.orderheader semicolon. Run the file. The stream should be successfully created. Okay, that's it. The stream is now tracking changes to the order header table. If you're not convinced, let's test it out with some dummy data. Navigate to the order header stream SQL file in the module 3 folder. Copy everything starting at the insert into statement all the way to the bottom. Paste it into the SQL worksheet. Run only the insert statement. Okay, if all went well, I should be able to query the stream directly and confirm the addition of the new row. Run select star from order header stream. There it is. That's the row we just inserted. I want to show you something really important. We have a lot of columns in this table, so it's actually not apparent at first glance. Scroll all the way to the right in the results pane. Take a look at the last three columns here. Metadata action, metadata is update, and metadata row ID. These columns provide more information about the changes to the table. For example, metadata action lets me know that an insert operation occurred on the table, which is correct. We use an insert statement to add our dummy data. Metadata is update is false because this row wasn't part of an update operation. Okay, let's remove the dummy row that we added. Optionally, you can run the final statement in the SQL, but make sure you read the note in the comment to understand why the deleted row won't be returned in the query. Okay, that's it. We'll use a stream to keep an eye on sales in Hamburg, and we'll set up the rest of the logic to do that in the next video. For now, let's recap what you learned. You learned that streams are objects that keep track of changes to tables or views. Streams can be queried to view these changes, and streams can help you perform efficient transformations by allowing you to process only incremental changes to a table or view. This gives you a lot of granular and efficient control over your transformations. Coming up, I'll show you how to use stored procedures to build out more complex procedural logic.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.