Moving window analysis

1. Moving window analysis

Congratulations on loading and manipulating text data! In this lesson you will keep working with the Sherlock Holmes ebook, and will now learn how to use a moving window using SQL.

2. The raw text

Here is a sample of the raw text.

3. The processed text

This dataset has already been processed to remove unwanted characters and put one word per row. An id column was added to identify the position of each word in the document.

4. Partitions

The data is partitioned into 12 parts, corresponding to chapters, each having a unique "part" and "title" column. The distinct() operation eliminates duplicates, fetching unique records. This will allow us to easily parallelize our work across up to 12 machines in a cluster, or up to 12 cores in a CPU on a single machine.

5. MV0

Here is what the text looks like in the processed table. We learned earlier that a window function allows using the values of other rows without using complicated joins.

6. MV1

A sliding window is analogous to an actual window that opens from top to bottom.

7. MV2

When opening the window its top edge moves down, as does its bottom edge.

8. MV3

You slide the window across the dataset,

9. MV4

calculating things using the data in the window.

10. MV5

11. MV7

12. MV10

13. MV13

14. The words are indexed

Take a mental snapshot of these rows.

15. A moving window query

This query prints the word of the current row as w1, the word from the following row as w2, and the word from the row after next as w3. The w1 column corresponds to the word column in the previous slide. As you move down the rows, each row gives a sliding window view of its own contents and contents from the next two rows. The columns w1, w2, and w3 correspond to a sequence of 3 words, also called a 3-tuple. Direct your attention to inside the over clause. It uses a partition by clause. '' This allows Spark to distribute the partitions, each partition containing a separate chapter, to different workers, thereby running the query in parallel on multiple cpus.

16. Moving window output

Let's read the first 7 rows of the w1 column. the project gutenberg ebook of the adventures. Now let's read the first three rows: "the project gutenberg". "project gutenberg ebook". "gutenberg ebook of". This is a sliding window. It is useful for creating features for algorithms that take moving window sequence data as input.

17. LAG window function

Let's do a similar thing, this time using the LAG window function. The LAG function gets its values from a previous row.

18. LAG window function – output

The w1 and w2 column values are null in the first row, because there are no previous rows in the partition. Likewise, the w1 column value is null in the second row because the lag function cannot find a row before the previous row.

19. Windows stay within partition

Suppose we look at the result for a later partition using WHERE part=2.

20. Windows stay within partition – output

Look at the first two rows. They contain null values in column w1 and w2. The window does not go outside of its partition to look at rows in partition 1!

21. Repartitioning

Window SQL makes use of partitions using the PARTITION BY clause. Dataframes are partitioned automatically without programmer intervention. However, sometimes it is useful to the specify the partitioning scheme to make the application more efficient, using knowledge that you have about the application that would not be automatically utilized by Spark. Previously we learned how to repartition data. Don't hesitate to refer to the slides available at the right of the console if you forget how this is done. Recall that properly partitioning data allows Spark to parallelize operations more efficiently. Rows that are in the same partition are guaranteed to be on the same machine. Thus, data won't be unnecessarily shuffled from one machine to another.

22. Let's practice!

Now it's your turn to play with moving windows!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.