Get startedGet started for free

Data Intelligence Platform - Compute

1. Data Intelligence Platform - Compute

Welcome! This video will discuss how compute power works in the Data Intelligence Platform.

2. Why do organizations care about compute?

For organizations, managing compute power is a significant concern. In fact, having adequate compute power is critical to achieving the best analytics insights. Without adequate computational power, your organization relies on slow, unscalable systems. However, with sufficient computational resources, you can process data quickly, unlocking hidden insights and gaining a competitive advantage.

3. Apache Spark

Databricks is a platform built on the Apache Spark framework. Spark is an open-source framework created by the Databricks co-founders. It is highly efficient and allows you to distribute work across different computing resources, use any language you want, and use it for many different use cases. We will not be diving deeper into Spark in this course, but if you would like to know more, feel free to search for an Apache Spark course on DataCamp, where you can learn more about the specifics.

4. Cluster Types

In the Databricks environment, clusters process data and perform analytics. These clusters are collections of resources that Databricks creates, manages, and terminates based on your needs. There are two ways to create and implement clusters. The first method uses the platform's classic architecture, which you saw previously. Databricks creates clusters in your cloud environment by sending instructions to the cloud provider. This method allows you to control everything and leverage existing resources. However, it has a slow cluster startup time since all resources are created from scratch.

5. Cluster Types

The second and newer method is the platform's serverless architecture. In this design, Databricks will instead create the compute resources in the Control Plane, providing access to the users you have granted in Unity Catalog. Besides a much faster startup time, the big advantage here is that you will get to leverage the best and latest features in Databricks and better performance over time as Databricks learns about your usage. The main potential drawback some organizations feel is that they no longer control the resources directly, but most organizations feel comfortable with SaaS architectures.

6. Single-node vs. Multi-node

You can choose between a single-node and a multi-node design when creating a cluster. The correct cluster design depends on your workload! Single-node clusters are simply clusters with one machine or Driver Node. These clusters can still run Spark but are generally used to run common single-node frameworks like pandas or dplyr. They are a great choice for small datasets, as they are very cheap to run. Multi-node clusters have the same Driver Node but one or more Worker Nodes attached to it. These need to use Spark to distribute work across all the available resources and are great for large datasets.

7. Databricks Runtime

No matter what kind of cluster you create, each one will have the Databricks Runtime installed. This Runtime has various components, such as an optimized version of Spark, an enhanced Photon engine for fast SQL queries, and various libraries and APIs you will need to work in Databricks. We do not need to explore the Runtime too much, as it is automatically managed by Databricks. We recommend picking the most recent LTS, or Long-Term Support, version.

8. Let's practice!

With that, let us review some of the compute-related concepts we just went over!