1. The bag-of-words representation
Storing text in various forms is important, you could use the corpus or the tibble approach.
Regardless, it's time we actually do something with our text.
2. The previous example
We have already seen
how to tokenize and count words within a piece of text.
The result is a tibble of words and their counts. This is a great way to manually explore the results, but what if we wanted to use this information to do some analysis?
3. The bag-of-words representation
The bag-of-words representation uses vectors to specify which words are in each text.
Consider the following three texts. There are not many words here, but to create a bag-of-words representation,
we find the unique words, and then convert this into vector representations.
4. Typical vector representations
The first step
is to create a clean vector of the unique words used in all of the text. Although this is project dependent, usually this means using lower cased words with no stop words. Although removing punctuation and stemming words is optional, they are both generally good ideas.
We then convert each text into a binary representation of which words are in that text. For text 1, since "few" is in the text, the first entry in text1_vector, is a 1. Similarly, "all" is not in text 1, so the second entry is a 0. Although this is a binary representation, we could have used word counts, instead of 1's and 0's.
5. tidytext representation
tidytext's representation is slightly different. To find the words that appear in each chapter,
we can use the process of unnest_tokens, anti_join, and count.
The result is a tibble or word counts by chapter, sorted from most to least common.
6. One word example
In the original vector representation, a document that does not contain a word receives a 0 for that word. In this representation,
chapter and word pairs that do not exist,
are left out. Napoleon, interestingly enough, is a main character but is not mentioned in chapter 1.
7. Sparse matrices
Let's take a quick look at problems with this type of representation.
In our Russian tweet dataset, we have only 20,000 out of over 3 million original tweets.
This contain over 43,000 unique, non-stop words. Tweets are short, with most tweets only containing a hand-full of words.
8. Sparse matrices continued
With
20,000 tweets,
and 43,000 words, we would create what is known as a sparse matrix.
In our case, we would need 860 million values, but only
177,000 would be non-0.
Consider this sparse matrix example. Representing text as vectors can use a lot of computational resources. There are very few non-0 entries. The R packages we are using however, such as tidytext and tm, help relieve this burden by storing the values in a smart and efficient manner.
9. BoW Practice
Let's practice creating a bag-of-words representation for our text.