How do bigrams affect word clouds?
Now that you have made a bigram DTM, you can examine it and remake a word cloud. The new tokenization method affects not only the matrices but also any visuals or modeling based on the matrices.
Remember how "Marvin" and "Gaye" were separate terms in the chardonnay word cloud? Using bigram, tokenization grabs all two-word combinations. Observe what happens to the word cloud in this exercise.
This exercise uses str_subset
from stringr
. Keep in mind, other DataCamp courses cover regular expressions in more detail. As a reminder, the regular expression ^
matches the starting position within the exercise's bigrams.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm
.
- Create
bigram_dtm_m
by convertingbigram_dtm
to a matrix. - Create an object
freq
consisting of the word frequencies by applyingcolSums()
onbigram_dtm_m
. - Extract the character vector of word combinations with
names(freq)
and assign the result tobi_words
. - Pass
bi_words
tostr_subset()
with the matching pattern"^marvin"
to review all bigrams starting with "marvin". - Plot a simple
wordcloud()
passingbi_words
,freq
andmax.words = 15
into the function.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create bigram_dtm_m
___ <- ___(___)
# Create freq
___ <- ___(___)
# Create bi_words
___ <- ___(___)
# Examine part of bi_words
___(___, ___)
# Plot a word cloud
___(___, ___, ___)