Get startedGet started for free

Polarized tag cloud

Commonality clouds show words that are shared across documents. One interesting thing that they can't show you is which of those words appear more commonly in one document compared to another. For this, you need a pyramid plot; these can be generated using pyramid.plot() from the plotrix package.

First, some manipulation is required to get the data in a suitable form. This is most easily done by converting it to a data frame and using dplyr. Given a matrix of word counts, as created by as.matrix(tdm), you need to end up with a data frame with three columns:

  • The words contained in each document.
  • The counts of those words from document 1.
  • The counts of those words from document 2.

Then pyramid.plot() using

pyramid.plot(word_count_data$count1, word_count_data$count2, word_count_data$word)

There are some additional arguments to improve the cosmetic appearance of the plot.

Now you'll explore words that are common in chardonnay tweets, but rare in coffee tweets. all_dtm_m is created for you.

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

top25_df <- all_tdm_m %>%
  # Convert to data frame
  as_tibble(rownames = "___") %>% 
  # Keep rows where word appears everywhere
  filter(if_all(everything(), ___) %>% 
  # Get difference in counts
  mutate(difference = ___) %>% 
  # Keep rows with biggest difference
  slice_max(___,  n = ___) %>% 
  # Arrange by descending difference
  arrange(___(___))
Edit and Run Code