1. Learn
  2. /
  3. Courses
  4. /
  5. Text Mining with Bag-of-Words in R

Exercise

How do bigrams affect word clouds?

Now that you have made a bigram DTM, you can examine it and remake a word cloud. The new tokenization method affects not only the matrices but also any visuals or modeling based on the matrices.

Remember how "Marvin" and "Gaye" were separate terms in the chardonnay word cloud? Using bigram, tokenization grabs all two-word combinations. Observe what happens to the word cloud in this exercise.

This exercise uses str_subset from stringr. Keep in mind, other DataCamp courses cover regular expressions in more detail. As a reminder, the regular expression ^ matches the starting position within the exercise's bigrams.

Instructions

100 XP

The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm.

  • Create bigram_dtm_m by converting bigram_dtm to a matrix.
  • Create an object freq consisting of the word frequencies by applying colSums() on bigram_dtm_m.
  • Extract the character vector of word combinations with names(freq) and assign the result to bi_words.
  • Pass bi_words to str_subset() with the matching pattern "^marvin" to review all bigrams starting with "marvin".
  • Plot a simple wordcloud() passing bi_words, freq and max.words = 15 into the function.