Window functions and streaming queries

1. Window functions and streaming queries

Over the last two chapters, we cleaned and aggregated our data. Now we'll add window functions for customer analytics and streaming for continuous processing.

2. When groupBy isn't enough

Previously, groupBy gave us one row per customer with their total spend, great for summaries, but the individual transactions disappear. Window functions solve this: they calculate across related rows while keeping every row intact. That means running totals that accumulate over time, rankings across all customers, and even comparisons between consecutive rows.

3. Running total

The window specification has three parts: partitionBy groups by Customer_ID, so each customer gets their own window, and orderBy sorts by Date within that group. rowsBetween says: sum everything from the first transaction up to the current one. Then sum().over() applies that window to create a running_total column.

4. Running total

Looking at customer three, their first transaction is also the running total - it's the only one so far. By the second transaction, the running total jumps to over eighteen thousand. Every row stays intact; that's the power of window functions. Next, let's rank customers by total revenue.

5. Ranking customers

We aggregate with groupBy first to get one row per customer with total revenue, then apply rank() ordered descending. Without partitionBy, Spark ranks globally across all customers. Customer fourteen sixty-nine leads with just under a hundred thousand in total revenue.

6. What is streaming?

Window functions complete our batch analytics toolkit. Now let's look at how to handle data that arrives continuously. In production, new transaction files might land every hour. Reprocessing the entire dataset each time is wasteful and slow. Streaming solves this by processing data incrementally, handling only new files as they arrive. Think of it as a pipeline that stays open, ready for the next batch of data.

7. File-based streaming

Our transaction data arrives as five daily CSV files in a directory. Spark treats this as a streaming source: every file becomes a micro-batch, processed in order, and new files are picked up automatically. One requirement: the schema must be defined explicitly, since streaming can't infer types.

8. Reading the stream

This looks almost identical to a batch read. The key difference: readStream instead of read, telling Spark to watch the directory for new files. We pass in streaming_schema, the same StructType we used for batch reads. The output confirms this is now a streaming DataFrame.

9. What are checkpoints?

Before writing the stream, let's cover checkpoints. They're essential for reliable streaming. A checkpoint is a directory where Spark saves metadata about which files have been processed. If the stream stops for any reason, like a cluster restart, an error, or a scheduled shutdown, Spark reads the checkpoint and picks up exactly where it left off. No reprocessed files, no duplicated rows.

10. Writing the stream

writeStream sends data to a Delta table. We set three key options: a checkpoint directory, a Delta output path, and a trigger. trigger(availableNow=True) processes all available files and stops, ideal for demos and scheduled pipelines. After awaitTermination finishes, five thousand rows have landed in Delta.

11. Monitoring: status and lastProgress

Two properties give visibility into the stream. query.status shows the current state, stopped, with no data waiting. query.lastProgress reports the last micro-batch: five thousand rows at about seven hundred fifty rows per second. In production, these metrics help you catch performance bottlenecks early.

12. Checkpoint recovery

Let's prove it. We restart the stream with the exact same checkpoint directory. Spark reads the metadata, sees all five files were already processed, and skips them. Zero rows on restart, no duplicates, no reprocessing. Note that checkpoint recovery requires a persistent sink like Delta.

13. Let's practice!

We've covered window functions and built a streaming pipeline with checkpoint recovery. Now let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.