Job creation and management

1. Job creation and management

Now we will learn about creating and managing Databricks jobs using the Python SDK. Databricks jobs are used to run, automate, and manage the execution of data processing tasks on a Databricks cluster.

2. Databricks Notebook path

Databricks jobs run code written inside of a Databricks Notebook on a Databricks cluster. We need to know our username to find the path of a notebook. We can use the `WorkspaceClient.current_user.me().user_name` attribute to determine the username of the current user. Let's assume we created a notebook in our workspace called `My_Notebook`. We can specify which notebook the job should run by setting up a path variable to use when we create the job.

3. Creating a Databricks job

We create a Databricks job using the `.create()` function of the `jobs` API in the `WorkspaceClient`. There are many parameters that can be passed into the `create` function, but in this course we will focus on `name` and `tasks` parameters. There are many types of job tasks, including running a python file hosted in the cloud and running code in a Databricks notebook. We will create a Job Notebook task that is defined by passing the `description`, `notebook_task` and `task_key` parameters into the `jobs.Task()` function. The notebook task is defined by passing the notebook_path into the `jobs.NotebookTask()` function. By default, the job runs on serverless infrastructure, managed by Databricks. However, you can use the `existing_cluster_id` field to specify the Databricks cluster for the job to run on.

4. Creating and running a Databricks job

Let's say we want to create a Databricks Job that runs a notebook called "My_Notebook" on a specific cluster. First, we generate the notebook_path of My_Notebook. Next, we use the `WorkspaceClient.jobs.create()` function with `name` and `tasks` parameters passed in to create a job. In the tasks parameter, we pass in a list of tasks the job should run. A task is created using the `jobs.task()` function and passing in the description of the notebook and the notebook_task the job should run. We can run the created job on demand by passing the `job_id` into the Workspace Client Jobs API' `.run_now()` function. After creating a job, we will see it in our Databricks workspace.

5. Listing Databricks jobs

As we learned in chapter one, we can list all Databricks jobs in our workspace by using the `jobs.list()` function. Here, we list all jobs, iterating through each one and printing the `job_id`.

6. Deleting a Databricks job

We can delete an existing job in our workspace by passing in the `job_id` to the WorkspaceClient Job API `delete` function.

7. Deleting a Databricks job

In this example, we first create a job, and then pass in the `job_id` attribute of the created job into the `jobs.delete()` function to delete the job.

8. Cron syntax

To schedule a job with the Databricks SDK, we need to use cron expressions to specify when the job should run. Cron expressions are the syntax that we use to specify at which times during the day and at what frequency the jobs should run. In this course, we will reference cron expressions but won't go into detail about the syntax.

9. Scheduling a job

Here, we create a Databricks job that runs a notebook at 3 am everyday and times out after an hour. We might have the code in this notebook query an LLM about a question on data that gets updated every day.

10. Let's practice!

Now that we've learned how to create, modify, view, delete and schedule Databricks jobs, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.