Streams and Change Data Capture

1. Streams and Change Data Capture

In this chapter we move from data ingestion to orchestration - how you build pipelines that process data reliably after it lands. The first challenge is fundamental: how do you know what changed? Harbr's shipments table is updated continuously throughout the day. A downstream reporting table needs to reflect those changes, but reprocessing everything on every run is expensive and slow. That's the problem Change Data Capture solves, and streams are how Snowflake implements it.

2. Processing Only What's Changed

Picture Harbr's shipments table with ten million rows. Without Change Data Capture, every pipeline run reprocesses all ten million - slow and costly. The alternative is tracking changes with timestamps, but that's fragile and easy to miss things. With CDC, instead of asking what's in the table, you ask what changed since you last looked. Snowflake streams are built exactly for this.

3. Streams

A stream tracks every insert, update, and delete made to a table since it was last consumed. It doesn't duplicate the data — it maintains a running change log. When Harbr's pipeline reads the stream, it sees exactly which rows changed since the last run. Once consumed, the stream's offset advances and the next read starts fresh.

4. Stream Types

Snowflake has three stream types. Standard streams capture all inserts, updates, and deletes - the right choice for tables like Harbr's shipments where any row can change. Append-only streams track only new rows, ignoring updates and deletes. For Harbr's delivery events table - records written once and never modified - this is more efficient. Insert-only streams are designed for external tables and directory tables. A directory table surfaces metadata about files in a stage - every file's name, size, and last modified timestamp. Since files are added but not updated in place, insert-only is the right fit.

5. Creating a Stream

Creating a stream is straightforward. Use CREATE STREAM, give it a name, and specify ON TABLE. For an append-only stream, add APPEND_ONLY equals TRUE. From that point the stream tracks changes and can be queried just like a regular table - but what you get back is the change log, not the current state.

6. Stream Metadata Columns

When you query a stream, three metadata columns appear alongside the table's own columns. METADATA ACTION tells you whether the row was inserted or deleted. METADATA ISUPDATE flags whether this is part of an update. METADATA ROW_ID is a unique identifier for the physical row, but when you update it the ROW_ID stays the same. Importantly, Snowflake represents updates as a DELETE of the old row plus an INSERT of the new one - both with METADATA ISUPDATE set to true. Understanding this pattern is essential for writing correct downstream logic.

7. The Stream Offset

Every stream has an offset - a marker tracking how far through the change log it has read. When a stream is created, the offset starts at now. As changes happen, the stream accumulates those records. The offset only advances when the stream is consumed in a DML transaction - an INSERT, UPDATE, or DELETE that uses the stream's data to modify another object. A Create Table As Select, or CTAS, counts too. Simply querying the stream doesn't move the offset. Take Harbr's shipping stream. When the pipeline inserts new shipping records into the target table using the stream's data, that INSERT is the DML transaction that advances the offset. If the pipeline fails mid-transaction, the transaction rolls back and the offset stays put, meaning the same shipping changes are waiting on the next run, with nothing lost.

8. Streams in a Pipeline Overview

Streams are rarely standalone. In practice they're paired with a task - a Snowflake object that runs SQL on a schedule. The task fires, checks whether the stream has data, and applies only the changed rows to the reporting table. With ten million rows in shipments, processing only what changed is the difference between a two-second job and a two-minute one.

9. Streams in a Pipeline Query

Streams become even more powerful when paired with tasks - Snowflake objects that run SQL on a schedule. Here, Harbr creates a task called sync shipments. It runs every five minutes, but only when the stream actually has data - that's what the SYSTEM$STREAM_HAS_DATA check does. No new shipments, no compute used. We'll cover more on tasks in the next video.

10. Let's practice!

You've covered how streams track changes to Snowflake tables, the three metadata columns, the difference between stream types including directory tables, and how the stream offset advances. Let's test your knowledge!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.