1. Core features of the Databricks Lakehouse Platform
Hello, and welcome back! In this video, we will be discussing some of the core features and concepts of the Databricks Lakehouse platform.
2. Apache Spark
Databricks is built on the Apache Spark computing framework, an open-source framework for processing Big Data. The founders of Databricks created Apache Spark and brought their expertise into the platform. We won't be diving into the specifics of Apache Spark in this course. If you want more details, please refer to one of the different Spark courses on DataCamp.
3. Benefits of Spark
Using Spark under the hood brings a few key benefits. Firstly, Spark is a very extensible framework, which means you can support many data sources, data volumes, and use cases in a single framework, all while using whatever programming language you want.
Secondly, Apache Spark has a very large and active developer community, meaning you can learn from their collective experience and apply that to your situation.
Thirdly, Spark is very performant when processing Big Data and will scale with your data.
Finally, Databricks has included a variety of optimizations into the platform-specific version of Spark, which results in significantly better performance and some quality-of-life improvements.
4. Cloud computing basics
Now how do we actually use the Spark framework? Before we dive into that in a Databricks context, we need to understand cloud computing. In a more traditional setup, users will be performing analytics and will use whatever processing power they have available on their computer. This can be somewhat scaled up, but ultimately will struggle to keep up with the growing amount of data we see.
This is where cloud computing can come into play and solve the problem. With cloud computing, users take their same analysis on their computer, but can then run those processes on resources in the cloud, which come from large servers at your disposal. This allows you to easily and nearly infinitely scale up your work.
5. Databricks Compute
In the Databricks environment, there are a few different options to provide computing power to the Spark framework. The default compute is Clusters, which are collections of virtual machines that Databricks installs its Runtime. These clusters can be used for any data workload, use case, and run any of the supported languages. Different kinds of clusters come with different SKUs depending on whether your work is interactive or automated. You can use any of the supported languages in Databricks (Python, R, Scala, and SQL) on clusters.
SQL Warehouses are relatively new to the Databricks platform and provide a SQL-optimized engine for your Business Intelligence use cases. Photon is one such optimization, which is a rewritten Spark engine that significantly increases SQL performance.
6. Cloud data storage
Now that we have some compute power, let's talk about data storage. In the cloud, there historically have been two main ways to store data. First, many organizations put data into various databases hosted in the cloud, which provide a rigid location for their structured data. These are great for traditional data needs, but lack the flexibility that many datasets need to grow with the business.
Organizations can also store data in various file formats, such as CSV, JSON, and Parquet. These formats are usually open-source and much more flexible, but require more work to maintain data quality.
7. Delta
With Databricks, we recommend storing data in Delta Lake tables, an open-source file format built around Parquet. By storing data in Delta, you will gain the performance and reliability of traditional data warehouse tables but stored in the flexible and scalable data lake. ACID transactions ensure that data is only written once, and can be unified across batch and streaming workloads. Delta maintains the flexibility of the data lake and allows for schema evolution and restoring previous versions of data through time travel.
8. Unity Catalog
Unity Catalog is the centralized governance solution for the Databricks Lakehouse. Using straightforward SQL GRANT statements, administrators can control access to all data assets in the Lakehouse, even beyond just the data. Governance covers data catalogs, notebooks, clusters and warehouses, even the Databricks Feature Store and any machine learning models.
9. Databricks UI
The Databricks Lakehouse platform aims to unify the developer experience for all data personas, and the Databricks UI exemplifies that goal. In the UI, the on-screen menu provides access to different platform capabilities based on your data persona. While everyone needs access to data and compute resources, you can specifically pick the components you want to interact with. As a data analyst, you'll want to leverage the familiar SQL scripting environment for your queries. As a data engineer, you can leverage the power of Delta Live Tables and create robust pipelines for your teams. As a data scientist, you can develop, train, and deploy your models with the native platform capabilities.
10. Let's review!
With all of that, let us quickly review the platform's key components!