Most frequent 3-tuples per chapter
We will now use a query as a subquery in a larger query. Spark SQL supports advanced features of SQL. Previously you learned how to find the most common word sequences over an entire book having 12 chapters. Now you will obtain the most frequent 3-tuple for each of the 12 chapters. You will do this using a window function to retrieve the top row per group.
There is a table having columns word
, id
, chapter
.
- The
chapter
column corresponds to the number of a chapter. - The
word
column corresponds to a single word in the document. - The
id
column corresponds to the word position in the document.
We also have the following query:
subquery = """
SELECT chapter, w1, w2, w3, COUNT(*) as count
FROM
(
SELECT
chapter,
word AS w1,
LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id ) AS w2,
LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id ) AS w3
FROM text
)
GROUP BY chapter, w1, w2, w3
ORDER BY chapter, count DESC
"""
spark.sql(subquery).show(5)
+-------+---+-----+----+-----+
|chapter| w1| w2| w3|count|
+-------+---+-----+----+-----+
| 1| up| to| the| 6|
| 1|one| of| the| 6|
| 1| in|front| of| 5|
| 1| up| and|down| 5|
| 1| it| was| a| 5|
+-------+---+-----+----+-----+
only showing top 5 rows
From this table you can determine that the first row of the desired result will be:
+-------+---+-----+----+-----+
|chapter| w1| w2| w3|count|
+-------+---+-----+----+-----+
| 1| up| to| the| 6|
+-------+---+-----+----+-----+
Your task is to use subquery
as a subquery in a larger query to obtain the most frequent 3-tuple per chapter. The desired result will have the same schema, but having one row per chapter. Use ROW_NUMBER()
to obtain the row number per row per chapter.
This exercise is part of the course
Introduction to Spark SQL in Python
Exercise instructions
- Get the most frequent 3-tuple per chapter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Most frequent 3-tuple per chapter
query = """
SELECT chapter, w1, w2, w3, count FROM
(
SELECT
chapter,
____() OVER (PARTITION BY chapter ORDER BY ____ DESC) AS row,
w1, w2, w3, count
FROM ( %s )
)
WHERE row = ____
ORDER BY chapter ASC
""" % subquery
spark.sql(query).show()