Get startedGet started for free

The bag-of-words representation

1. The bag-of-words representation

Storing text in various forms is important, you could use the corpus or the tibble approach. Regardless, it's time we actually do something with our text.

2. The previous example

We have already seen how to tokenize and count words within a piece of text. The result is a tibble of words and their counts. This is a great way to manually explore the results, but what if we wanted to use this information to do some analysis?

3. The bag-of-words representation

The bag-of-words representation uses vectors to specify which words are in each text. Consider the following three texts. There are not many words here, but to create a bag-of-words representation, we find the unique words, and then convert this into vector representations.

4. Typical vector representations

The first step is to create a clean vector of the unique words used in all of the text. Although this is project dependent, usually this means using lower cased words with no stop words. Although removing punctuation and stemming words is optional, they are both generally good ideas. We then convert each text into a binary representation of which words are in that text. For text 1, since "few" is in the text, the first entry in text1_vector, is a 1. Similarly, "all" is not in text 1, so the second entry is a 0. Although this is a binary representation, we could have used word counts, instead of 1's and 0's.

5. tidytext representation

tidytext's representation is slightly different. To find the words that appear in each chapter, we can use the process of unnest_tokens, anti_join, and count. The result is a tibble or word counts by chapter, sorted from most to least common.

6. One word example

In the original vector representation, a document that does not contain a word receives a 0 for that word. In this representation, chapter and word pairs that do not exist, are left out. Napoleon, interestingly enough, is a main character but is not mentioned in chapter 1.

7. Sparse matrices

Let's take a quick look at problems with this type of representation. In our Russian tweet dataset, we have only 20,000 out of over 3 million original tweets. This contain over 43,000 unique, non-stop words. Tweets are short, with most tweets only containing a hand-full of words.

8. Sparse matrices continued

With 20,000 tweets, and 43,000 words, we would create what is known as a sparse matrix. In our case, we would need 860 million values, but only 177,000 would be non-0. Consider this sparse matrix example. Representing text as vectors can use a lot of computational resources. There are very few non-0 entries. The R packages we are using however, such as tidytext and tm, help relieve this burden by storing the values in a smart and efficient manner.

9. BoW Practice

Let's practice creating a bag-of-words representation for our text.