Creating context window feature data

The moving window technique is useful for machine learning algorithms models that use context window feature data.

A table text having columns id, word, part, title is available in your workspace. It contains chapters 9, 10, 11 and 12 of the Sherlock Holmes book. The words are already processed and organized into one word per row. Each word has a unique integer index provided by the column id. The id column is lower for words that appear earlier in the text and greater for words appearing later in the text.

The first 10 rows of the dataset for chapter 12 are printed to the console as Table1. The first ten rows of the desired result, constrained to show part 12 (Chapter 12) are printed to the console as Table2. In Table2, the "given" word for the row is provided in column w3. Columns w1 and w2 give the two words immediately prior to the given word. Columns w4 and w5 give the two words immediately after the given word.

Note how w1 and w2 are null for the first row. This is because there are not any words prior to w3 (here, "xii") that are within part 12.

Don't hesitate to refer to the slides available at the right of the console if you forget how something was done in the video.

Get the word for each row, along with the previous two words and the subsequent two words.

PySpark SQL

Using Window Function SQL for Natural Language Processing

Caching, Logging, and the Spark UI

Text Classification

Exercise

Creating context window feature data

Instructions