DE - Snowflake Data Engineering Overview - Part I

1. DE - Snowflake Data Engineering Overview - Part I

I was never a data engineer – I was a data scientist. But back when I was a data scientist, data engineers saved my life many times by helping undo mistakes I’d made or by creating high-quality tables for me to use. When data engineers are around, I feel warm and safe. Like everything is going to be okay. I don’t drink alcohol, but if I did, and a data engineer were sitting right here next to me, I would raise my glass to that data engineer, and say: “Data engineer, I dedicate this video to you.” So it’s with a great deal of respect that I now cover how Snowflake supports data engineering workloads. You might recall that in the video on Snowpark Dataframes, we talked very briefly about the ITD framework – where the I, T, and D stand for Ingestion, Transformation, and Delivery. Here we’re going to focus on ingestion and transformation – we’re not going to talk about the delivery part – and we’re going to add in two other subjects we haven’t explicitly covered in much detail yet, but are important for data engineering: Orchestration and Observability. Let’s dive in. Okay, so this slide gives you a sense of the Data Engineering landscape at Snowflake. It’s not everything, but we’ve got many of the key elements. As a refresher, this first column, Ingestion, refers to how you collect data and get it into Snowflake. You can’t do any data work without having data in your system. The second column, Transformation, refers to cleaning and preparing that data so it’s useful for downstream purposes. The third column, Orchestration, refers to scheduling how and when you run specific steps in your data pipeline – so when you ingest the data, what triggers transformations to occur, etc. And the fourth column, Observability, refers to how Snowflake helps you make sure everything’s running as expected and debug problems when something’s gone wrong. I’m a big believer that data practitioners get nowhere near enough opportunities for group therapy. So in the spirit of data therapy, I wanted to say that the world of data tooling is so vast that there can be concepts you’ve heard a bunch about but don’t understand, so you get anxiety when they come up. For me, observability was long like this. I’d hear “observability,” and panic a little, thinking that I wasn’t competent because I didn’t have a clear sense of what this meant. You don’t need to panic. I don’t need to panic. We’ll all figure it out bit by bit. The thing that tends to help me go most quickly from that sense of uncertainty, to feeling comfortable and grounded, is seeing concrete code examples. As I come to understand what that code is doing, I think: “Okay, I see what you’re doing here,” and by extension I come to understand what people mean by the broader category that term fits into. So if you feel uncomfortable with any or even all of these four categories, once we look at some code snippets, I think you’ll feel more grounded. Okay, so let’s focus on ingestion for a moment. These first two types of ingestion are 1) streaming, and 2) batch. I’ve heard lots of debates about the difference between streaming and batch, and I’m not going to get too academic here. My rough definition is that if you’re ingesting your data in a streaming fashion – like with Snowpipe streaming, or the Kafka connector – it means that your data is updating from an external source very quickly. So say you have data coming from Kafka, and that data gets updated. With streaming ingestion, your Snowflake data would reflect this with low latency. We think of streaming as a spectrum. If you’re ingesting your data in a batch fashion – like with COPY INTO – it means that your data is updating less frequently. So the data in your S3 bucket might get updated, and then have another update, and on some cadence – every hour, every day, etc. – the data will get pulled into Snowflake. You’ll see Snowpipe listed here under streaming. An important thing to know here – and something that confused me for a while – is that there are a few different flavors of Snowpipe at Snowflake. There’s the Snowpipe Streaming API, which you can access through the Java SDK. And there’s regular Snowpipe. We’ll get into the details of regular Snowpipe in the next video, but we won’t cover the Snowpipe Streaming API in this course. Okay, so the next group here are the Snowflake Native connectors. Snowflake provides lots of ways to connect through other systems, through [interfaces](https://docs.snowflake.com/en/user-guide/ecosystem-lang) like Python, ODBC, SQLAlchemy. Snowflake also has native connectors to connect to ServiceNow and to pull raw or aggregated Google Analytics data, for example. If this is confusing to you – you don’t know what ODBC is, say – don’t worry about it. This is not the focus of this course, but I wanted to mention that these exist. And finally, there’s a whole world of data sharing that Snowflake supports. You can share data from your account in a zero-copy way – meaning, they can access the data you’ve shared in a read-only fashion directly from the source – both in the Snowflake data marketplace, and through direct shares. So here are some quick examples of code snippets to make this concrete – If you want to use regular Snowpipe, you can just use the CREATE PIPE command, and then set up your pipe to copy data from a stage to a table. It’s not rocket science, and you don’t need to be intimidated. Nothing that different from what we’ve already covered. Below that I’m showing a COPY INTO command we already ran in this course when we copied the TastyBytes menu data from a stage into the target table. This code here is an example of how you can go to a Python interface, and connect to Snowflake through SQLAlchemy. After this there are ways you can push data from another source to Snowflake using SQLAlchemy. And down below is an example of creating a database from a share that another account has created. So, we talked about some of the ways the Snowflake platform lets you do data ingestion, and coming up we’ll cover Transformation, Orchestration, and Observability. Let’s get to it.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.