1. SCD2 with dbt snapshots
Welcome back! Let's cover a more advanced topic in dbt - snapshots.
2. What is a snapshot?
In general, a snapshot is a look into the changes of a dataset over time. Put another way, this illustrates the various states of an object and the times those states were valid.
Consider something like a fast-food order status, the state of production for a car, or the shipping status for a book purchased online.
3. SCD2
Multiple methods exist to monitor data changes/updates/etc within a data warehouse. One primary method is using what's known as a slowly changing dimension, specifically type 2. This is often abbreviated as SCD2, and is a common component of a Kimball-style data warehouse. You can refer to DataCamp's Introduction to Data Warehousing for more information.
Primarily, SCD2 is a standard method to track changes over time. Snapshots are used in dbt to implement SCD2.
4. SCD2 example
Let's consider an example: order status, where we have three available states: Received, Packed, and Shipped. Normally, if we only update a given row, we lose the additional information about the states in between. If we show rows with a state of Packed, we're unsure how long it takes to complete the packing process once received. With Shipped alone, we can no longer determine how long that took place.
Here we see a simple row showing the order id, its order status, and the last time it was updated.
We'd love to see the following to easily identify when a row had a specific state without requiring changes to our application.
5. SCD2 in dbt
dbt can implement SCD2 using snapshots. It can automatically track our changes and add extra columns to our output. These columns are dbt_valid_from and dbt_valid_to, which track the datetimes of when the individual values for a row are valid.
From our earlier example, you can see an updated version of our order_status information showing the state for Received, Packed, and Shipped along with the datetimes the rows were current. This is the most recent row if you see a null value in dbt_valid_to.
6. Implementing dbt snapshots
To implement a snapshot in dbt, we must define another file representing the data we want to snap. This is contained in a SQL file in the snapshots directory named snapshotname.sql, such as snapshot_orders.sql.
The actual file content has several parts. First, we start with a Jinja snapshot command and the snapshot's name.
The file ends with the endsnapshot directive.
We use a Jinja config function within the directives, taking several arguments. The first is the target schema. This should be a separate schema from your default one, as snapshots can break existing tools due to the additional columns present.
Next is the strategy used. Usually, this is timestamp, meaning that dbt should check the timestamp to determine validity. There is another strategy, check_cols, that we won't cover here. Basically, it checks a set of columns to determine what changed since the last snapshot.
We then add a unique_key entry, the column's name representing the row, usually a primary key.
The last argument is the updated_at entry, which takes the column's name, showing the datatime of the change.
After closing the config entry, we need to add our SQL command to query against. This is often a dbt source, as a snapshot should be fairly early in the data process. Therefore, we can query for select * from our raw orders source. Note that we could use ref as well to access a model if preferred.
There are other options available for snapshots and you can refer to the dbt documentation for an exhaustive list.
7. dbt snapshot
To create the snapshots, we need to use the dbt snapshot subcommand. Then, we can create or update a model to query against the snapshot_orders table using the ref Jinja command.
Finally, we must run dbt snapshot often to see and maintain the potentially changed data. If we run the command infrequently, we may miss the changes in the data, so it's best to schedule the command to run automatically.
8. Let's practice!
We've covered a considerable amount here, let's practice what we've learned!