Get startedGet started for free

Finding common word sequences

Previously we saw how to create a query that finds word sequences of length three ("3-tuples"). We used that query as a subquery in a traditional SQL query to find the most common 3-tuples in the text document. You will now perform a similar task to find the most common 5-tuples.

DataFrame text_df is available. It contains the first five chapters of the Sherlock Holmes text. It has columns: word, id, part, title. The id column is an integer such that a word that comes later in the document has a larger id than a word that comes before it. The part column separates the data into chapters. The DataFrame text_df is also registered as temporary table called text. Our objective is to create a dataset where each row corresponds to a 5-tuple, having a count indicating how many times the tuple occurred in the dataset.

This exercise is part of the course

Introduction to Spark SQL in Python

View Course

Exercise instructions

  • Create a query query that finds the 10 most common 5-tuples in the dataset.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Find the top 10 sequences of five words
query = """
SELECT w1, w2, w3, w4, w5, COUNT(*) AS count FROM (
   SELECT word AS w1,
   LEAD(____) OVER(____ ) AS w2,
   ____ AS w3,
   ____ AS w4,
   ____ AS w5
   FROM text
)
GROUP BY w1, w2, w3, w4, w5
ORDER BY count DESC
LIMIT ____
"""
df = spark.sql(query)
df.show()
Edit and Run Code