Get startedGet started for free

Data orchestration in Databricks

1. Orchestration in Databricks

Welcome back! Previously, you learned how to manage data and perform powerful transformations in the Databricks Lakehouse Platform. Now we will learn how to automate and orchestrate those transformations.

2. What is data orchestration?

Before discussing specifics, we should address the foundational question: "What actually is data orchestration?" Simply put, data orchestration is a way to automate your work! As a data engineer, you must manage various tasks to provide a usable data product for your downstream users or use cases. Data orchestration is a set of processes that automate each of these steps, ranging from ingestion to serving.

3. Databricks Workflows

In the Databricks platform, data engineers can orchestrate their data using tools lumped into an offering called Databricks Workflows, where you can automate every task you can achieve in Databricks with built-in capabilities at no additional cost. Here is a diagram showing a high-level potential pipeline you could orchestrate in Databricks Workflows. This pipeline reads in two different datasets (one batch, one streaming) and uses Delta Live Tables to process and eventually join these datasets. From there, different data persona could do further analytics processes, such as ML model training and dashboard serving.

4. What can we orchestrate?

Databricks Workflows is designed to orchestrate and automate nearly everything you can do with the Databricks platform, so what exactly can we include in our Workflows? If you are a data engineer or data scientist, you will be able to automate all of your programmatic work like Databricks notebooks or Delta Live Table pipelines. Users can also include external .jar files, Spark submit jobs or Java applications. Databricks also allows you to include external orchestration tools in Workflows. For example, dbt is used to schedule and test SQL queries. As a data analyst, you can automatically include any queries, dashboards, or alerts as part of a Workflow.

5. Databricks Jobs

Databricks Jobs are the core of Workflows and define a particular step in the overall Workflow pipeline. For example, a single Job could join order and customer datasets together. In Databricks, there are multiple ways to create a Job. In the UI, users can create a job quickly, no matter where they are. Users can create a job in the context of a notebook or go to the Workflows area of the UI, where you can create and manage your Workflows and Jobs together. Here is a screen shot of the pop-up box to create a Job directly from the context of a notebook.

6. Databricks Jobs

Databricks also provides programmatic approaches to orchestration through a command-line tool or REST API. These are useful when you already have a series of scripts that perform tasks across your data ecosystem. Here is a high-level example of the structure of the Jobs REST API. In this code, a user could provide a variety of configurations, such as the tasks and the kind of cluster, to ensure the job runs correctly.

7. Delta Live Tables

With Delta Live Tables, data engineers can declare what the resulting dataset must look like, such as schema and data formats. Databricks will create the underlying data pipeline for the user. This has many benefits:

8. Delta Live Tables

Firstly, with this approach, creating new data pipelines is quick and easy to start, especially since it uses the same API for batch and streaming datasets.

9. Delta Live Tables

Secondly, it is also easy to maintain data quality through declared data schema and expectations, and you can refresh single portions of these pipelines directly.

10. Let's practice!

This video showed an all-around introduction to Databricks Workflows. Let's jump back into the platform and practice orchestrating our data!