Setting up the dbt project and loading data

1. Setting up the dbt project and loading data

Welcome back! It's time to set up the dbt project and load in the data.

2. Review: dbt set up and initialization

At this point, we have done the following in our IDE exercises: Installed dbt. Initialized a dbt project called `looker_ecommerce`. dbt does a lot of the work for us by auto-generating the file directories inside the project. Lastly, we verified that there are no issues with our dbt set-up using `dbt debug`.

3. Getting familiar with the data: distribution centers

With the set up done, it's now time for data loading. This means we need to make some decisions on how to load the data. Our company is an e-commerce business and requires distribution centers as a way to store our goods before shipping. We have 10 such centers, which is why the distribution center file only has 10 rows. The file contains columns: ID, name, and location by latitude and longitude.

4. Getting familiar with the data: orders

By contrast, the orders file is large and constantly updating because e-commerce sells a lot of things. The orders file contains 125,000 rows and 9 columns. Each column contains data related to shipping and handling, including order and user ID, status, timestamp information, and the number of items.

5. Getting familiar with the data: orders

Here is a preview of the orders data file.

6. Setting up raw source and seed data sources

As dbt owners, we make the decision on whether to use `dbt source` or `dbt seed` to load our raw data. This decision is based on the nature of the dataset. Because distribution center data is small and will rarely change, we load this with `dbt seed` as a one-time file. Because the order data is large and rapidly changing, we load this with `dbt source`. `dbt source` connects dbt models to raw data in a live database. As a reminder, we are using DuckDB as our database.

7. Setting up raw sources and seed data sources

Okay, we've made our decisions on how to load our data, but how do we implement them? When we ran `dbt init`, it auto-generated a bare-bones structure, like so. The directory structure is empty. We will populate it in with SQL files and yaml files. The folder structure will help us keep track of the data flow.

8. Setting up raw source and seed data sources

For `distribution_centers`, move the csv into the `seeds` directory and create the model under `models`, referencing the dbt seed.

9. Setting up raw source and seed data sources

For `orders`, create the staging SQL model, referencing the dbt source.

10. Documenting sources and staging models

To document dbt sources like `orders`, use a source-specific yaml file, like so.

11. Documenting sources and staging model

To document dbt models, use a model-specific yaml file under the same `models` directory.

12. Sources, seeds, models, and yaml

Altogether, our repository now looks like this.

13. Review: dbt subcommands

There are seven files in total. We will be loading them all this chapter. Here is a recap of the dbt subcommands we will be using.

14. Review: best practice guides

Lastly, we will adhere to dbt's style guides. Here are some naming convention guidelines.

15. Let's practice!

Let's dive right in!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.