Production pipelines with workflows

1. Production pipelines with workflows

We've built transformations across this course: cleaning, aggregating, and streaming data. Now let's make them production-ready.

2. Why Delta Lake?

Everything we've built so far lives in memory. It disappears when the cluster shuts down. Delta Lake is the storage layer that makes our data permanent. It gives us three guarantees that flat files can't: ACID transactions roll back failed writes, schema enforcement blocks mismatched data types, and every change is versioned.

3. Writing to Delta

Here's how we persist a DataFrame. saveAsTable creates a new managed Delta table in Unity Catalog, and it appears right away in the catalog browser. Thirty-three thousand clean rows, safely stored and queryable from any notebook in the workspace.

4. Notebook tasks

Now let's chain multiple notebooks into a pipeline. Our first notebook, task1_ingest, loads the raw CSV from a Volume, applies our cleaning steps, and writes the result as a Delta table called transactions_clean.

5. Notebook tasks

task2_metrics reads the clean table, groups transactions by Category, and computes revenue and count for each one. The output goes to a new table called category_metrics.

6. Notebook tasks

task3_customers reads the same clean table but answers a different question: which customers spent the most? It orders them by total spend and saves the leaderboard to a top_customers table.

7. Creating the job

To run our notebooks as a pipeline, we open Jobs and Pipelines and create a new job. We add each notebook as a task and set the dependencies: task2 depends on task1, and task3 depends on task2. Beyond manual runs, we can schedule jobs, trigger them on file arrival, and configure alerts. Let's run it manually first.

8. Running the job

After clicking Run, we watch the DAG. task1_ingest goes green. task2_metrics goes green. But task3_customers turns red.

9. Running the job

We click into the failed task and see the error: the code references customer-id with a hyphen, but the actual column is Customer_ID with an underscore.

10. Running the job

We open the notebook, fix the column name, and click Rerun. This time, all three tasks turn green. The Timeline view shows each task running in sequence: twenty seconds for ingest, about twelve seconds each for metrics and customers.

11. What is Lakeflow?

Jobs are great for chaining notebooks, but we still manage each step ourselves: writing each task, wiring each dependency. Lakeflow takes a different approach. We declare what each table should look like, and Databricks figures out the execution order, handles retries, and manages compute automatically. We describe the what, Databricks handles the how.

12. The @dlt.table pattern

Here's the pattern. Each function decorated with @dlt.table defines one table. Bronze loads the raw CSV with no transformations. Silver calls dlt.read on transactions_bronze, which tells Databricks to build bronze first. It then cleans the data. Gold reads from silver and aggregates revenue by category. Three functions, three layers.

13. Pipeline run

We create a new pipeline, paste our script, and Databricks runs each table in order: bronze, then silver, then gold. Each one is created as a materialized view in Unity Catalog. A hundred thousand raw rows flow in, thirty-three thousand survive cleaning, and six category summaries come out at the end.

14. Notebooks, Jobs, or Lakeflow?

So when do we use each? Run a notebook directly when exploring or prototyping. It's quick and interactive. Use a Databricks Job when you need multi-step pipelines with scheduling and dependency management. And choose Lakeflow for fully managed, declarative pipelines that refresh automatically.

15. Let's practice!

Now it's your turn to build production pipelines!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.