1. Common word sequences
Previously we learned a powerful tool for handling sequential data.
In this lesson you will learn how to use Spark SQL for finding the most frequent word sequences in a natural language text document. Before we do that, let's review what we can do with what we've already covered. One powerful application is
2. Training
creating training sets
3. Predicting
for predictive models.
4. Endword prediction
A special case of this is when you are trying to predict a word from previous words in a sequence.
5. Sequence
Suppose you have a sequence of things.
6. Sequence last
And you want to predict the last item in this sequence,
7. The quick brown fox
such as the end word of a sequence of words.
8. Sentence bracket
You can use previous words to predict the end word. In general we represent things using symbolic "tokens". These could be something else,
9. Songs
such as song ids in a person's listening history,
10. Videos
or videos.
11. Categorical Data
Categorical data can take one of a limited number of possible values. This is sometimes also referred to as nominal data, and as qualitative data. Categories that are near each other in lexical order are not necessarily qualitatively similar. For example, "he" and "she" are both gender pronouns, but are far away alphabetically. "He" and "hi" are close together alphabetically, but are not very similar qualitatively.
12. Categorical vs Ordinal
Categorical data generally have no logical order.
When they do, they are called ordinal data.
13. Sequence Analysis
Another powerful class of applications involve sequence analysis.
14. Word preceding and following
Suppose you want to determine what words tend to appear together. You will now learn how to do just that. We will use the same dataset used in the previous video lesson.
15. 3-tuples
Here is a moving window query that we used in a previous lesson. This query gives all word sequences of length 3 in the text document. In general these tokens could be things other than words, such as song ids, or video ids. A more general way to refer to sequences of length 3 is "3-tuples". In each row, the columns w1, w2 and w3 identify a 3-tuple. Let's use this query as a subquery to tally the most common 3-tuples.
16. A window function SQL as subquery
Here we use the previous moving window query as a subquery. This groups on w1, w2, and w3, counting the number of occurrences of each 3-tuple. ORDER BY count DESCending gives the most common 3-tuples.
17. A window function SQL as subquery – output
We are thus able to obtain the most common 3-tuples in this dataset in two statements, the first one defining a query, the second one running and displaying it!
18. Most frequent 3-tuples
Here is the same result displayed for the previous query expanded so that you may see more of its rows. The 3-tuple "one of the" occurred 49 times. The 3-tuple "I think that" is second most frequent, at 46 occurrences.
19. Another type of aggregation
We are not limited to counting. We can look at other aspects of the word sequences.
This query finds the longest 3-tuples.
20. Another type of aggregation
Again, note that Spark is able to run this query across multiple workers in parallel, without us having to tell it exactly how to do so. Because the data is partitioned, Spark is able to automatically parallelize the window function SQL.
21. Let's practice
Now it's your turn to experiment!