1. Data Engineering foundations in Databricks
Hello, and welcome back! In this video, we will discuss some of the foundational aspects of data engineering in Databricks.
2. Medallion architecture
At the core of the lakehouse architecture is the idea of the Medallion architecture. This logical design allows us to think about our data in the various stages it will be in. Data starts in Bronze tables, which are our raw ingested data. Once data has been cleaned, transformed, and joined together, we have our Silver tables, which are ready for most analytics applications. Data usually ends in Gold tables, which are aggregated for BI reports or dashboard use cases.
3. Reading data
The start of any data engineering data pipeline is reading in some data. Since Databricks runs on Spark, data engineers have access to a very flexible and performant data processing framework. Spark can read from nearly every possible data source and format.
Some of the most common sources, at least in the lakehouse architecture, will be Delta tables, the default in Databricks, and your existing open file formats like CSV, JSON, and Parquet. These will be the most common, as most systems write data out to file formats into a data lake. Sometimes data exists in another database system, such as a MySQL or Postgres database. Data can also be streaming in, where Spark can tap into a high-velocity flow of data. Data can even be more complex such as image or video data!
4. Reading data
Here are some examples of how to read in data from a Delta table, a collection of CSV files, and how one could connect to a Postgres database through a JDBC connection. All these methods and more are supported in the Databricks environment.
5. Structure of a Delta table
Since Delta is the default storage format for Databricks, let's try to understand the Delta format. Delta provides a table-like interface for your program to read while relying on open-source file formats.
Here we see an example of a Delta table that comes out-of-the-box with Databricks. At the top, we can see a _delta_log file, which is a JSON file that tracks all of the changes that happen to this particular table. After that, a collection of Parquet files store the actual data in the table. When reading a Delta table, Spark and other programs know how to organize and sift through these Parquet tables to only show the current state of your data table.
6. Explaining the Delta Lake structure
Let's use an analogy to help the Delta table structure make more sense. Imagine we are reading a food blog and we find a recipe that intrigues us. When we look at a picture of the food from the recipe, we see the finished product. This is exactly how the Delta table looks to the end user who is reading the data.
Now imagine we read a bit further down in the food blog and find the section on how to actually make the food, where we see a list of ingredients required and the steps we would have to follow to recreate the meal. This is the same idea as the JSON transaction log and the parquet data files held within the folder.
7. DataFrames
When handling data in Databricks, more often than not, you will be using DataFrames, which are two-dimensional representations of data. You can think of a DataFrame as a programmatic interpretation of a standard table with rows and columns like we have here on the right of the screen. The DataFrame concept is standard across the different data processing frameworks like Spark, pandas, dplyr, and SQL, each with a slightly different structure. In Databricks, Spark DataFrames will be the default and the underlying construct underneath almost all your data processes in Databricks.
8. Writing data
Data engineering aims to transform data from one form into another. As a final step, we must write that data out somehow. In Databricks, we think about writing two main kinds of tables.
The first kind is Managed Tables. These are the default kinds of Delta tables managed by Databricks. They will be stored in the same location as where you have defined your Unity Catalog. As we see in this code snippet, Managed Tables can be quickly written without any additional configuration needed in both Python and SQL.
The second kind of Delta table is the External Table. While these are still Delta tables, these are written into a separate location, typically a data lake bucket. To write an External Table, we need to set a LOCATION parameter when writing the table. The customer manages External Tables. The following code snippets show that the code is very similar except for the location parameter on each.
9. Let's practice!
Let's review some of these data engineering fundamentals in Databricks!