Get startedGet started for free

Creating context window feature data

The moving window technique is useful for machine learning algorithms models that use context window feature data.

A table text having columns id, word, part, title is available in your workspace. It contains chapters 9, 10, 11 and 12 of the Sherlock Holmes book. The words are already processed and organized into one word per row. Each word has a unique integer index provided by the column id. The id column is lower for words that appear earlier in the text and greater for words appearing later in the text.

The first 10 rows of the dataset for chapter 12 are printed to the console as Table1. The first ten rows of the desired result, constrained to show part 12 (Chapter 12) are printed to the console as Table2. In Table2, the "given" word for the row is provided in column w3. Columns w1 and w2 give the two words immediately prior to the given word. Columns w4 and w5 give the two words immediately after the given word.

Note how w1 and w2 are null for the first row. This is because there are not any words prior to w3 (here, "xii") that are within part 12.

Don't hesitate to refer to the slides available at the right of the console if you forget how something was done in the video.

This exercise is part of the course

Introduction to Spark SQL in Python

View Course

Exercise instructions

  • Get the word for each row, along with the previous two words and the subsequent two words.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Word for each row, previous two and subsequent two words
query = """
SELECT
part,
LAG(word, 2) OVER(PARTITION BY ____ ORDER BY ____) AS w1,
LAG(word, ____) OVER(____ BY part ____ BY id) AS w2,
word AS w3,
____(word, 1) OVER(____ BY part ____ BY id) AS w4,
LEAD(word, 2) OVER(____ BY part ____ BY id) AS w5
FROM text
"""
spark.sql(query).where("part = 12").show(10)
Edit and Run Code