In this chapter, you'll learn how to work with Databricks notebooks, load CSV data into Spark DataFrames, and shape data using PySpark and SQL.

Working with Databricks notebooks

Understanding Databricks notebooks

Loading your first dataset

Exploring driver logs

Shaping data with PySpark and SQL

Using PySpark to shape data

Analyzing data with SQL

Understanding temporary views

Loading and Shaping Data

Learn how to define explicit schemas, build a data cleaning pipeline, and optimize query performance with broadcast joins.

Data cleaning and quality checks

Why explicit schemas matter

Cleaning the online retail dataset

Choosing the right quality metric

Aggregating and joining data efficiently

Joining and aggregating retail data

Understanding the shuffle bottleneck

When to use a broadcast join

Data Cleaning and Optimization

Learn how to calculate running totals and rankings with window functions, build streaming pipelines, and deploy production workflows.

Window functions and streaming queries

Ranking customers with window functions

Streaming retail data into Delta Lake

Resuming after a restart

Production pipelines with workflows

Writing and reading a Delta table

Building a multi-task job pipeline

Why switch to Lakeflow?

Wrapping up

Analytics and Production Pipelines

online_retail

transactions

country_lookup

Ready to handle real-world data at scale? This course teaches you to transform large datasets using Spark SQL and PySpark in Databricks. Learn to shape and clean data, run aggregations with optimized joins, and apply window functions for advanced analytics. You'll also set up file-based streaming with fault-tolerant checkpoints and persist results as Delta tables. By the end, you'll be orchestrating multi-step production pipelines with Databricks Workflows and Lakeflow Declarative Pipelines.


Introduction to Databricks SQL

Introduction to PySpark

Build end-to-end data pipelines - from cleaning and aggregation to streaming and orchestration.

Data Transformation with Spark SQL in Databricks

Build end-to-end data pipelines - from cleaning and aggregation to streaming and orchestration.


Associate Data Engineer in Databricks

Resuming after a restart

Data Transformation with Spark SQL in Databricks

Hands-on interactive exercise