Get startedGet started for free

Common data engineering patterns

1. Common data engineering patterns

Welcome back! In this lesson, we will be covering some of the more advanced topics regarding data engineering in Databricks SQL.

2. Motivation

In previous lessons, we have talked about the importance of transforming our data into the correct format so our other data teams can analyze them. The examples we went through were fairly basic, and in reality, data engineering pipelines can be rather complex depending on how the source data system behaves.

3. Handling incoming data

When handling incoming data, there are two main approaches you can take. Firstly, you may want to incrementally append all of the data that comes in. In this strategy, we merely add all incoming data to the end of our existing tables. We are assuming that all data is new, and are not considering that existing data might be changing. In Databricks, we use the INSERT syntax to accomplish this. Another strategy for ingesting data is to implement Change Data Capture, or CDC. With this approach, we are trying to integrate the new data into our existing datasets. We will either append new data, or will update existing rows with the new information. In Databricks SQL, we use the MERGE syntax to accomplish this.

4. Data optimizations

Another way we can improve our data pipelines is through optimizing the underlying data storage. While storing data in the Delta format already addresses many potential problems with data formats, there are many different techniques we could implement to make our data more performant for analyzing. The first, and most straightforward, technique is using the OPTIMIZE function. Essentially, OPTIMIZE will take several different data files and compact them together into a fewer number of files. This reduces what is known as the small file problem, which can significantly impact your data processing. Secondly, we can implement a Z-Order. Z-Ordering is similar to indexing in traditional database systems, and it allows you to put related data into the same file, which is helpful when querying that data. Here are some code examples in which we can implement OPTIMIZE on an entire table like the first statement, on a subset of data like the second statement, and also using ZORDER with OPTIMIZE in the third statement.

5. Let's practice!

Now, let us practice how to improve our data pipelines using Databricks SQL!