Get startedGet started for free

Query plans

1. Query Plans

In this lesson you will learn about query plans in Spark SQL, for both dataframes as well as for Spark SQL tables.

2. Explain

Spark SQL is an easier transition for those of you who are already proficient with Relational database management systems. One of the tools familiar to SQL practitioners is EXPLAIN. If you put the Explain keyword at the head of an sql query, running the query provides detailed plan information about the query without actually running it. Instead of the usual table of data, it returns a query execution plan, also called a query plan. A query plan is a string, representing a set of steps used to access the data.

3. Load dataframe and register

Suppose we load a dataframe from file, located at /temp/df.parquet. Then, we use it to create an SQL table.

4. Running an EXPLAIN query

Running a simple explain select query gives this result, which when formatted

5. Interpreting an EXPLAIN query

tells us that it read the data from a parquet file, having 4 columns along with their names, located at /temp/df.parquet. It also tells us the schema of the table, including the column types. This allows us to determine how the data was obtained and from where.

6. df.explain()

Spark provides a way to run EXPLAIN() on a dataframe. It formats the result to be easier to read. When run on a dataframe, it tells you how Spark obtained the dataframe. Next, we ran explain() on a result obtained from running a query. See that the two query plans are identical.

7. df.explain(), on cached dataframe

Here we cache the dataframe, then explain. Reading a query plan from the top-down gives reverse-chronological order of the steps involved. Reading it from the bottom-up tells us the steps in order from the first step first. Next, we explain a SELECT * query on the corresponding table. The query plans are identical. The principles of relational data management run deep in Spark SQL. In particular, you can use explain to understand the steps used to obtain a dataframe even if it was not obtained using an SQL query.

8. Words sorted by frequency query

Consider this query.

9. Same query using dataframe dot notation

Here is the query plan obtained by running explain() on the dataframe result. This is a bit overwhelming, so let's examine it step by step. We will do so in the next slide, reading it from the bottom up.

10. Reading from bottom up

Reading the result from the previous slide from the bottom up, we see that this data was originally loaded from a file. It was cached in memory. A table scan was performed on the in-memory table. An aggregation operation was performed on the column word. A count operation was performed on the groups. Finally, the result was sorted on the count column in descending order.

11. Query plan

Suppose you had seen this query plan instead. Examine the query plan from the bottom-up and see if you can discern the difference between this one and the query plan from the previous slide. You would be able to determine that the data had not been cached. This could explain an unexpected slow query duration run on data of unknown provenance.

12. Let's practice

Let's jump into some exercises.