Working with Databricks notebooks

1. Working with Databricks notebooks

Welcome to this hands-on Databricks course.

2. Your instructor

Our instructor is Disha, a lead data engineer with hands-on experience building analytics pipelines using Spark and Databricks on large transactional datasets. In this course, we will use these tools in practical ways and follow workflows we commonly see in production environments.

3. Prerequisites

Before we begin, it's helpful to be comfortable with basic SQL, especially Spark SQL syntax. Introductory Databricks SQL and PySpark knowledge also helps. We'll build on that foundation and focus on applying those skills inside Databricks.

4. What to expect

In this course, we'll work directly on the Databricks platform. We'll load transactional datasets, inspect and transform data, write Spark SQL queries, and build toward end-to-end analytical pipelines. But first, let's understand what Databricks is and how it works.

5. What is Databricks?

Databricks is a collaborative analytics platform built on Apache Spark. It provides notebooks, managed clusters, and integrated tools for engineering and analytics. Instead of managing servers ourselves, Databricks handles provisioning and scaling so we can focus on analysis.

6. Clusters and Spark

When we run code in a notebook, it executes on a Spark cluster. A cluster is a group of machines that work together to process data in parallel. This approach allows Spark to handle large datasets efficiently, far beyond what a single machine could manage. In our exercises, Databricks provides serverless compute, so you won't need to configure a cluster yourself, but the same driver-worker architecture runs behind the scenes. Now let's look at where we'll write our code.

7. Databricks notebooks

The Databricks notebook is where we'll spend most of our time. It's organized into cells that run independently. A cell can run SQL or Python, or it can be markdown for notes and explanations. Output appears directly below each cell, which makes iteration fast. Let's see this in action with some real data.

8. Dataset introduction

For our demonstrations, we'll use a customer transaction dataset that's already stored in Unity Catalog. Unity Catalog is Databricks' governance layer for permissions, metadata, and data access. The dataset includes transaction dates, amounts, and countries, giving us a realistic base for our Spark workflow.

9. Unity Catalog Volumes

Data can also be uploaded to Unity Catalog volumes. A volume is a secure storage location for files such as CSVs. In our exercises, the datasets are already stored in the volume for you, and we'll load them into a Spark DataFrame.

10. Loading data into a DataFrame

Next, we load the data into a Spark DataFrame. A DataFrame is Spark's distributed table abstraction, designed to work across many machines. We call spark.read.csv(), pass the file path, and set header=True so Spark uses the first row as column names. We also set inferSchema=True so Spark infers data types automatically. If we run that cell, we'll see the following output with some initial details about our data. Note that if we upload a file manually, we can copy the path from the Unity Catalog volume.

11. Inspecting the DataFrame

Before transforming data, we inspect its structure with printSchema(). This confirms column names and data types, and helps us catch issues early. It also confirms Spark interpreted the data correctly. Here we expect IDs as integers and dates as timestamps.

12. Previewing the data

Next, we call show() to preview sample rows. We can pass a row count and set truncate=False to display full values. This gives us a quick validation step, and we should see customer IDs, transaction dates, and amounts that align with the schema we just checked. One more thing before we practice: let's look at where to find logs when something goes wrong.

13. Driver logs

As we work in Databricks, we can open driver logs from the notebook by going to the Serverless section and clicking Logs. Logs are essential for debugging errors and understanding how code runs on the cluster. For now, we only need to know where to find them, and we'll use them later.

14. Let's practice!

You've completed your first data exploration in Databricks. Now, it's time for practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.