Polarized tag cloud
Commonality clouds show words that are shared across documents. One interesting thing that they can't show you is which of those words appear more commonly in one document compared to another. For this, you need a pyramid plot; these can be generated using pyramid.plot()
from the plotrix
package.
First, some manipulation is required to get the data in a suitable form. This is most easily done by converting it to a data frame and using dplyr
. Given a matrix of word counts, as created by as.matrix(tdm)
, you need to end up with a data frame with three columns:
- The words contained in each document.
- The counts of those words from document 1.
- The counts of those words from document 2.
Then pyramid.plot()
using
pyramid.plot(word_count_data$count1, word_count_data$count2, word_count_data$word)
There are some additional arguments to improve the cosmetic appearance of the plot.
Now you'll explore words that are common in chardonnay tweets, but rare in coffee tweets. all_dtm_m
is created for you.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
top25_df <- all_tdm_m %>%
# Convert to data frame
as_tibble(rownames = "___") %>%
# Keep rows where word appears everywhere
filter(if_all(everything(), ___) %>%
# Get difference in counts
mutate(difference = ___) %>%
# Keep rows with biggest difference
slice_max(___, n = ___) %>%
# Arrange by descending difference
arrange(___(___))