The Spark UI

1. The Spark UI

Good job caching dataframes! We will know take a look at the Spark UI. The Spark UI is a web interface to inspect Spark execution,

2. Use the Spark UI inspect execution

A task is a unit of execution that runs on a single CPU. A stage is a group of tasks that run the same computation in parallel. A job is comprised of stages. The Spark UI also shows cache, settings, and SQL queries.

3. Finding the Spark UI

The Spark UI runs on the driver host. When running Spark locally the Spark UI is typically found at localhost, port 4040. If that port is already in use, then Spark will try 4041, 4042, 4043, and so on, in succession. If you are running Spark on a managed cluster, its admin console will provide a link to the Spark UI on the driver host.

4. Spark UI initial view

When Spark is first started and before any action has been performed, it will look something like this. It has six tabs: Jobs, Stages, Storage, Environment, Executors, and SQL. However, not much has happened yet.

5. Spark UI after load

This was after loading a small dataframe from file. It indicates that one job completed, having a single stage, and a single task.

6. Cached dataframe in Spark UI

You can inspect cache under the storage tab. Here it indicates that cache contains a dataframe loaded from a file named sherlock_full_parts.parquet, having size in memory of 554.9 KB.

7. Spark catalog operations

We've learned that the spark catalog provides operations on a table, here called table1, namely cacheTable(), uncacheTable() and isCached(). The last one, dropTempView(), removes the temporary table from the catalog.

8. Spark Catalog

The Spark catalog provides some information about what Spark tables exist, and their properties. Seen here, spark.catalog.listTables() tells us that there is a single temporary table called text.

9. Cached table in Spark UI

The Spark UI gives additional insight into the cached table. Here's what the storage tab looked like after caching a temporary table called 'df'. This indicates that it has a single partition, a size of 554.9 KB, and residing completely in memory.

10. Spark UI SQL

Here is the SQL tab after a SELECT COUNT(*) query was run on a table having 107462 rows.

11. Spark UI Storage Tab

The Spark UI Storage tab shows where partitions exist in memory, or on disk, across the cluster, at a snapshot in time.

12. Spark UI SQL tab

Let's see what a nontrivial SQL query looks like in the Spark UI. Here we run a window function query we learned in a previous lesson.

13. Spark UI SQL tab

Recall that this had a window function subquery wrapped within an aggregate query. At the bottom of the image, it indicates that the subquery is a window function.

14. Spark UI SQL tab

Scrolling down shows more about this query.

15. Spark UI Stages tab

Going to the Stages tab, we see that there were three stages involved with this query. Stages are presented in reverse chronological order. The first stage, stage 3, input 677.8 KB, then wrote 1972.4 KB in a shuffle operation. The next stage, stage 4, read in the data that was written by stage 3, performed 200 tasks, then wrote 3.7 MB in a shuffle operation. The next stage, stage 5, read in that 3.7 MB, and then performed 200 tasks. This indicates that 401 tasks were completed.

16. Spark UI Jobs tab

Going to the Jobs tab, we see that the first job had 3 stages, and 401 tasks. This coincides with what we saw on the stages tab.

17. Let's practice

Let's practice what we just learned about the Spark UI.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to Spark SQL in Python

AdvancedSkill Level

4.8+

74 reviews