Get startedGet started for free

Creating a dbt project

1. Creating a dbt project

Welcome back! We've learned a little about dbt as a command-line application, but let's now discuss dbt projects and how they're used.

2. What is a dbt project

So what is a project in the context of dbt? Projects encompass all the components for working with data within dbt. It is a structured collection of files that define how to transform and organize your data. The project configuration includes the project name, folder names, etc. Data sources and destinations, such as where the source data comes from, and any destination data warehouses. dbt projects also include the SQL queries and templates that define how to access and transform the data into the desired formats. It can also include documentation for the data and the relationships within it. A project is implemented as a folder structure, easily copied, modified, or placed into source control as needed. We'll cover each of these further in this course.

3. How to create a new project

To create a project within dbt, we use the dbt init subcommand. When running dbt init, it asks two questions: the project name and which database or data warehouse type you'd like to use. Using dbt init projectname, it only asks for the database type. dbt init creates the top-level project folder, subfolders and configuration files for the project. Here is an example running dbt init with a project name of test_project and duckdb as our database type.

4. Defining configuration with project profiles

The next thing to understand about dbt projects is the profile. Within dbt, a profile is like a deployment scenario. This can include development, staging or testing, and production. A dbt project can have multiple profiles, allowing for different warehouse configurations per deployment scenario. These profiles (or configurations) are defined in the profiles.yml file, which must be created for new projects. This is an example profiles.yml file with two deployment types (dev and prod). The option target defines the default, in this case, dev. You may also wonder why to select DuckDB vs Snowflake. DuckDB is useful for development and testing locally, while Snowflake would be better used in production as other users will likely need to access the data.

5. YAML

You may be wondering what YAML is. YAML stands for Yet Another Markup Language. It is a text based file format, where whitespace indentation matters, much like Python. YAML is used in many development scenarios for configuration, due to its relatively human-readable format. Writing or modifying YAML can be tricky, as you must maintain indentation as illustrated. In the profiles.yml example, dev: and prod: are at the same level of indentation. A YAML skeleton will be provided in this course, but be aware of the formatting requirements when creating one from scratch.

6. DuckDB

A word about DuckDB - DuckDB is an open-source serverless database, like sqlite. There is not a server process required, unlike to postgresql or mysql. It is designed for analytics, ie, data warehouses, and is fast due to its vectorized nature. We're using DuckDB in this course as it's easy to use, and works with dbt. We won't cover specifics about DuckDB, but you can access DuckDB easily on your computer or in a DataCamp Workbook.

7. Let's Practice!

Let's take what we've learned and create our first dbt project in the exercises ahead.