How do bigrams affect word clouds?
Now that you have made a bigram DTM, you can examine it and remake a word cloud. The new tokenization method affects not only the matrices but also any visuals or modeling based on the matrices.
Remember how "Marvin" and "Gaye" were separate terms in the chardonnay word cloud? Using bigram, tokenization grabs all two-word combinations. Observe what happens to the word cloud in this exercise.
This exercise uses str_subset from stringr. Keep in mind, other DataCamp courses cover regular expressions in more detail. As a reminder, the regular expression ^ matches the starting position within the exercise's bigrams.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm.
- Create
bigram_dtm_mby convertingbigram_dtmto a matrix. - Create an object
freqconsisting of the word frequencies by applyingcolSums()onbigram_dtm_m. - Extract the character vector of word combinations with
names(freq)and assign the result tobi_words. - Pass
bi_wordstostr_subset()with the matching pattern"^marvin"to review all bigrams starting with "marvin". - Plot a simple
wordcloud()passingbi_words,freqandmax.words = 15into the function.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create bigram_dtm_m
___ <- ___(___)
# Create freq
___ <- ___(___)
# Create bi_words
___ <- ___(___)
# Examine part of bi_words
___(___, ___)
# Plot a word cloud
___(___, ___, ___)