Get startedGet started for free

Transforming data

1. Transforming data

Hi there! Welcome back. In this video, we will be going over how to transform data using Databricks SQL and perform some data engineering tasks.

2. Motivation

In our previous exercise, we learned how you can ingest data into the lakehouse architecture using different components of the Databricks SQL product. This allowed us to create our initial Bronze data layer.

3. Transformation in the lakehouse

The next step of our data engineering process is to clean and transform our data. In the context of the medallion architecture, this means taking data from the Bronze, or raw, data layer, and prepare it as our Silver, or analytics-ready, data layer.

4. Cleaning and transforming data

This step is where we apply our business knowledge, leveraging a deep understanding of the data to prepare it for analysis. While this phase can include many processes, the most common tasks are cleaning the data by removing NULL values, standardizing entries, and adjusting data types for consistency. In the code example provided, we demonstrate a SQL query that defines a silver layer table. Within a single query, we handle several key tasks: removing NULL and duplicate values using the DISTINCT keyword, applying formatting or rounding as needed, and enriching the dataset by joining it with another table.

5. Aggregating data

Now that we have clean, analysis-ready datasets in the Silver layer, it’s time to take things a step further. In the medallion architecture, this means moving to the Gold layer, where we aggregate datasets for specific purposes—often tailored for a single BI report. The Gold layer is also a perfect spot to create some views! When building a Gold layer dataset, we’ll usually drop unnecessary columns, aggregate data across key dimensions, and calculate specific metrics as needed. In the code example, we’re taking our Silver layer data and creating a view that aggregates sales revenue for a single fiscal quarter. This makes the data more focused and ready for reporting.

6. Automating tasks

All of these tasks are important to an overall data pipeline, and can be fully automated using Databricks Workflows. In these workflows, users can automate many different tasks, such as Databricks Notebooks, SQL queries, and dashboard refreshes.

7. Let's practice!

Now, let's practice transforming data using Databricks SQL.