Get startedGet started for free

How do bigrams affect word clouds?

Now that you have made a bigram DTM, you can examine it and remake a word cloud. The new tokenization method affects not only the matrices but also any visuals or modeling based on the matrices.

Remember how "Marvin" and "Gaye" were separate terms in the chardonnay word cloud? Using bigram, tokenization grabs all two-word combinations. Observe what happens to the word cloud in this exercise.

This exercise uses str_subset from stringr. Keep in mind, other DataCamp courses cover regular expressions in more detail. As a reminder, the regular expression ^ matches the starting position within the exercise's bigrams.

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Exercise instructions

The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm.

  • Create bigram_dtm_m by converting bigram_dtm to a matrix.
  • Create an object freq consisting of the word frequencies by applying colSums() on bigram_dtm_m.
  • Extract the character vector of word combinations with names(freq) and assign the result to bi_words.
  • Pass bi_words to str_subset() with the matching pattern "^marvin" to review all bigrams starting with "marvin".
  • Plot a simple wordcloud() passing bi_words, freq and max.words = 15 into the function.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create bigram_dtm_m
___ <- ___(___)

# Create freq
___ <- ___(___)

# Create bi_words
___ <- ___(___)

# Examine part of bi_words
___(___, ___)

# Plot a word cloud
___(___, ___, ___)
Edit and Run Code