Cluster Types and Runtimes

1. Cluster types and runtimes

None of the lakehouse matters without compute. Let's nail down what a cluster is, compare the two types, and see how runtimes work.

2. What is a cluster?

A cluster is simply a group of virtual machines working together. There's always one driver node that coordinates the work - accepting your commands, planning how to split the computation, and returning results. Then there are one or more worker nodes that do the heavy lifting, reading data from cloud storage and executing transformations in parallel. When you run a notebook cell or submit a job, the driver farms it out to the workers.

3. All-purpose clusters

All-purpose clusters are built for interactive work. You spin one up, attach your notebook, and start exploring data, testing transformations, or running ad-hoc queries. Multiple people on your team can share the same cluster, which is great for collaboration. The catch? These clusters stay running until someone explicitly terminates them. Think of it as your personal car - always available, but you pay even when it's parked.

4. Jobs clusters

Jobs clusters work differently. They're created automatically when a scheduled job kicks off, they run that one task, and they terminate as soon as it's done. No one needs to babysit them. This makes them ideal for production workloads - nightly ETL pipelines, scheduled reports, automated data quality checks. You only pay for the compute you actually use. If all-purpose clusters are your personal car, jobs clusters are a taxi: show up when you need them, gone when the ride's over.

5. When to use which?

The rule of thumb is straightforward. Use all-purpose clusters when you need to think and iterate - exploring datasets, developing notebooks, running one-off queries. Use jobs clusters when the work is repeatable and automated - ETL pipelines, scheduled reports, anything that runs on a timer. A common mistake teams make is running production pipelines on all-purpose clusters. It works, but you're paying for idle time between runs. Jobs clusters eliminate that waste entirely. Other specialized cluster types also exist, but all-purpose and jobs clusters cover most day-to-day work.

6. Databricks Runtime

Every cluster runs a Databricks Runtime - a versioned software stack that includes Apache Spark, Delta Lake, and language interpreters for Python, R, and Scala. Think of it as the operating system for your cluster. Databricks releases new runtime versions regularly, but for production work, you'll want to stick with an LTS version - long-term support - which gets security patches without breaking changes. There's also an ML Runtime that comes pre-loaded with machine learning libraries like TensorFlow and PyTorch, so data scientists can get started without installing anything.

7. Summary

To recap: all-purpose clusters are for interactive development and exploration - flexible but require manual management. Jobs clusters are for automated production workloads - they spin up, do the work, and shut down automatically. And the Databricks Runtime is the versioned software stack that powers both. Choosing the right cluster type for each workload is one of the simplest ways to keep your Databricks bill under control.

8. Let's practice!

Time to get hands-on. In the exercises ahead, you'll create a cluster and classify the characteristics of each type.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.