Cluster creation and management

1. Cluster creation and management

In this lesson, we will learn how to create, list, run and delete Databricks All-Purpose clusters.

2. Serverless vs. managed infrastructure

The Databricks jobs you've seen so far can run on multiple types of infrastructure. Either clusters that we create and manage or serverless infrastructure. Running on serverless infrastructure means we don't have to concern ourselves with managing the specific servers that the code will run on, instead we can just focus on what our code will do and pay on-demand every time our code is ran. In contrast, we can create our own spark clusters that are customized to meet our specific use case. For long running jobs, such as training AI models or running data engineering pipelines, it can be more cost effective to create our own clusters instead of using serverless.

3. Create a Databricks Spark cluster

The Databricks `clusters` service can be accessed via the `clusters` attribute of the `WorkspaceClient` object. We will use this service to create and manage clusters in our Databricks workspace. The `.create()` method of the Databricks `service` creates a new Spark cluster in our authenticated Databricks workspace. A Spark cluster is a group of servers that work together to process big data. There are many optional params that can be used to customize the new cluster to your needs. Here, we use the `cluster_name` param to pass in a name to identify the cluster being created. We set the `spark_version` of the cluster to be the latest stable version. We use `autotermination_minutes` to specify the number of minutes the cluster can be inactive before terminating to 20 minutes. And we use the `num_workers` param to specify that there should be 3 worker nodes available in this cluster. We use `.result()` to wait.

4. Lists clusters

We can use the `list()` method of the `clusters` service to list all of the clusters in the authenticated Databricks workspace. It returns a list of `cluster` objects. In this example, we iterate through all of the `cluster` objects and print the `cluster_id` attribute of each cluster.

5. Start a cluster

We can start a Databricks cluster by using the `start()` method of the `clusters` service and passing in the `cluster_id` of the cluster in your workspace that you want to start. In this lesson, we will assume that the `cluster_id` we are interested in is stored in an environment variable called "DATABRICKS_CLUSTER_ID".

6. Check the state of a cluster

To retrieve information about a cluster, we can pass the `id` of the cluster to the `.get()` method in the `clusters` service. This function returns a dictionary with keys that contain information about the given cluster. We can use the `state` key to retrieve that state of a cluster, which informs us if it is in a Terminated, Idle or Running state.

7. Delete a cluster

We can use the `.delete()` method of the `clusters` service to delete the Databricks Spark cluster with a specified `cluster_id`. This terminates the spark cluster with the specified ID and removes it asynchronously. Once the removal is complete, the cluster will still exist in the workspace, but it will be in a `TERMINATED` state.

8. Let's practice!

Congratulations on learning how to create and manage Databricks Spark clusters. You can use this knowledge to create and use clusters customized to your specific use case.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.