Get startedGet started for free

Common word sequences

1. Common word sequences

Previously we learned a powerful tool for handling sequential data. In this lesson you will learn how to use Spark SQL for finding the most frequent word sequences in a natural language text document. Before we do that, let's review what we can do with what we've already covered. One powerful application is

2. Training

creating training sets

3. Predicting

for predictive models.

4. Endword prediction

A special case of this is when you are trying to predict a word from previous words in a sequence.

5. Sequence

Suppose you have a sequence of things.

6. Sequence last

And you want to predict the last item in this sequence,

7. The quick brown fox

such as the end word of a sequence of words.

8. Sentence bracket

You can use previous words to predict the end word. In general we represent things using symbolic "tokens". These could be something else,

9. Songs

such as song ids in a person's listening history,

10. Videos

or videos.

11. Categorical Data

Categorical data can take one of a limited number of possible values. This is sometimes also referred to as nominal data, and as qualitative data. Categories that are near each other in lexical order are not necessarily qualitatively similar. For example, "he" and "she" are both gender pronouns, but are far away alphabetically. "He" and "hi" are close together alphabetically, but are not very similar qualitatively.

12. Categorical vs Ordinal

Categorical data generally have no logical order. When they do, they are called ordinal data.

13. Sequence Analysis

Another powerful class of applications involve sequence analysis.

14. Word preceding and following

Suppose you want to determine what words tend to appear together. You will now learn how to do just that. We will use the same dataset used in the previous video lesson.

15. 3-tuples

Here is a moving window query that we used in a previous lesson. This query gives all word sequences of length 3 in the text document. In general these tokens could be things other than words, such as song ids, or video ids. A more general way to refer to sequences of length 3 is "3-tuples". In each row, the columns w1, w2 and w3 identify a 3-tuple. Let's use this query as a subquery to tally the most common 3-tuples.

16. A window function SQL as subquery

Here we use the previous moving window query as a subquery. This groups on w1, w2, and w3, counting the number of occurrences of each 3-tuple. ORDER BY count DESCending gives the most common 3-tuples.

17. A window function SQL as subquery – output

We are thus able to obtain the most common 3-tuples in this dataset in two statements, the first one defining a query, the second one running and displaying it!

18. Most frequent 3-tuples

Here is the same result displayed for the previous query expanded so that you may see more of its rows. The 3-tuple "one of the" occurred 49 times. The 3-tuple "I think that" is second most frequent, at 46 occurrences.

19. Another type of aggregation

We are not limited to counting. We can look at other aspects of the word sequences. This query finds the longest 3-tuples.

20. Another type of aggregation

Again, note that Spark is able to run this query across multiple workers in parallel, without us having to tell it exactly how to do so. Because the data is partitioned, Spark is able to automatically parallelize the window function SQL.

21. Let's practice

Now it's your turn to experiment!