What is data ingestion?

1. What is data ingestion?

80% of the battle is going to be getting all of the data from all of the different sources into one platform. These are wise words from my colleague, Jeremiah Hansen, and I think they're just the right way to kick off our exploration of the first phase of our data engineering framework, data ingestion. In the context of building data pipelines, ingestion refers to the gathering, collecting, or loading of raw data, often into a central platform. In this course, Snowflake will be that central platform. Out of the three phases in our ITD data engineering framework, I find ingestion to be the most interesting, mainly because the approaches to ingesting data can vary so widely. And the reason these approaches vary so much is because there are some pretty big challenges associated with ingesting data, like scale, meaning how much data will need to be ingested. Frequency, at what rate does the data need to be ingested? Is your use case satisfied with, say, a daily ingestion routine, or will you need to ingest data on a real-time basis? Sources, where is data coming from? How many sources are there? My colleague Jeremiah has some more wise words on this topic. Data sources are a very strong dictator of approach to ingestion. And finally, there are challenges around data formats. For example, what format will the data be in? Will all of it be in the same format? Or will different sources produce data in different formats? Or maybe in different shapes, like columnar versus document-based, as an example. And those are just a sample of the challenges you might encounter as you think about how to ingest data with your pipeline. The great news is that Snowflake is excellent at addressing challenges like these, namely because Snowflake can easily ingest data at massive scales. I'm talking petabytes and even much larger. Snowflake excels at ingesting data at different frequencies. It's great for batch loading of data and equally great for near real-time ingestion of data. Snowflake also plays really well with lots of different data sources, like cloud object storage or Kafka architectures, as an example. And finally, Snowflake can ingest all sorts of data formats, CSV, JSON, Parquet, just to name a few, and all sorts of compression formats as well. What's also neat is that Snowflake can ingest structured and semi-structured data directly. And with newer features, it can actually extract data from unstructured data formats, like PDF documents, for example. In any case, keep in mind that your approach to data ingestion is going to depend on your specific use case, but that it'll likely be impacted by things like number of data sources, scale of data, latency requirements, and more. Finally, it's important to note that in this course, ingestion of raw data refers to a one-to-one copy of the source data. That is, we won't alter or change raw data during the ingestion process. Data loaded into Snowflake will simply represent a direct copy of the source data. Okay, let's learn a little more about how to perform ingestion using Snowflake.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.