1. Query Plans
In this lesson you will learn about query plans in Spark SQL,
for both dataframes as well as for Spark SQL tables.
2. Explain
Spark SQL is an easier transition for those of you who are already proficient with Relational database management systems. One of the tools familiar to SQL practitioners is EXPLAIN. If you put the Explain keyword at the head of an sql query, running the query provides detailed plan information about the query without actually running it. Instead of the usual table of data, it returns a query execution plan, also called a query plan. A query plan is a string, representing a set of steps used to access the data.
3. Load dataframe and register
Suppose we load a dataframe from file, located at /temp/df.parquet.
Then, we use it to create an SQL table.
4. Running an EXPLAIN query
Running a simple explain select query gives this result, which when formatted
5. Interpreting an EXPLAIN query
tells us that it read the data from a parquet file, having 4 columns along with their names, located at /temp/df.parquet. It also tells us the schema of the table, including the column types. This allows us to determine how the data was obtained and from where.
6. df.explain()
Spark provides a way to run EXPLAIN() on a dataframe. It formats the result to be easier to read.
When run on a dataframe, it tells you how Spark obtained the dataframe.
Next, we ran explain() on a result obtained from running a query.
See that the two query plans are identical.
7. df.explain(), on cached dataframe
Here we cache the dataframe, then explain.
Reading a query plan from the top-down gives reverse-chronological order of the steps involved. Reading it from the bottom-up tells us the steps in order from the first step first.
Next, we explain a SELECT * query on the corresponding table.
The query plans are identical. The principles of relational data management run deep in Spark SQL. In particular, you can use explain to understand the steps used to obtain a dataframe even if it was not obtained using an SQL query.
8. Words sorted by frequency query
Consider this query.
9. Same query using dataframe dot notation
Here is the query plan obtained by running explain() on the dataframe result. This is a bit overwhelming, so let's examine it step by step. We will do so in the next slide, reading it from the bottom up.
10. Reading from bottom up
Reading the result from the previous slide from the bottom up, we see that this data was originally loaded from a file.
It was cached in memory.
A table scan was performed on the in-memory table.
An aggregation operation was performed on the column word.
A count operation was performed on the groups.
Finally, the result was sorted on the count column in descending order.
11. Query plan
Suppose you had seen this query plan instead. Examine the query plan from the bottom-up and see if you can discern the difference between this one and the query plan from the previous slide.
You would be able to determine that the data had not been cached. This could explain an unexpected slow query duration run on data of unknown provenance.
12. Let's practice
Let's jump into some exercises.