1. Databricks Architecture
In this video, we will be discussing the Databricks architecture at a high-level. After this video, you should have a better understanding on how Databricks works, and how you can get started in your own environment.
2. The data persona
To start out, the first fundamental piece we need to consider in the architecture is the user, or data persona. This person has data they want to analyze, a skill in some kind of programming language, and the analytical capability to derive insights. Usually, people start to run out of processing power, so they look for a scaling mechanism. Thus, this person would like to adopt the Lakehouse architecture.
3. Data in the cloud
For most organizations these days, scaling your analytics capabilities starts by putting data in the cloud. In most areas of the world, there are three major cloud vendors: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Since this person will be implementing a Lakehouse architecture, they have put their data into a data lake in the Delta format.
4. Databricks service
Next, our data persona will look to start using Databricks. Databricks is a cloud-based SaaS application, and is hosted in all major cloud vendors around the globe. When signing up for a Databricks account, you have secured access to the Databricks service, which is maintained in the Databricks cloud environment.
5. Databricks workspace
Once our data persona creates their first Databricks workspace, they have now setup the connection between the Databricks service and their own cloud environment. At a very high level, the Databricks service will act like the "brain" of your analytics platform, creating and orchestrating different computations on your data to provide the insights you are looking for.
6. The Databricks Architecture
Here is a much more detailed diagram of the Databricks Architecture.
The Databricks portion of this architecture is known as the Control Plane. This is a cloud environment owned by Databricks, and is the "brain" we just referred to. In the Control Plane we will find the Databricks web UI itself, alongside the notebooks, queries, and jobs that we run. We will also find the controller that creates different compute nodes to perform your analytical processes.
The customer portion of this architecture is known as the Compute Plane. This is your cloud environment, and is where you will store your data in the lakehouse. Databricks is a compute engine that operates on top of the data. In the classic model, which we will cover in a separate chapter, the Control Plane will create clusters in the Compute Plane in order to reach the data. By having data and compute in the Compute Plane, you can ensure that all of your networking, applications, or external systems in the cloud will continue to work as expected.
7. Let's review!
Let us quickly review the Databricks architecture to ensure we understand it.