Exercise

Polarized tag cloud

Commonality clouds show words that are shared across documents. One interesting thing that they can't show you is which of those words appear more commonly in one document compared to another. For this, you need a pyramid plot; these can be generated using pyramid.plot() from the plotrix package.

First, some manipulation is required to get the data in a suitable form. This is most easily done by converting it to a data frame and using dplyr. Given a matrix of word counts, as created by as.matrix(tdm), you need to end up with a data frame with three columns:

  • The words contained in each document.
  • The counts of those words from document 1.
  • The counts of those words from document 2.

Then pyramid.plot() using

pyramid.plot(word_count_data$count1, word_count_data$count2, word_count_data$word)

There are some additional arguments to improve the cosmetic appearance of the plot.

Now you'll explore words that are common in chardonnay tweets, but rare in coffee tweets. all_dtm_m is created for you.

Instructions 1/2

undefined XP
    1
    2
  • Coerce all_tdm_m to a tibble. Set the rownames to a column named "word".
  • Filter all variables if they are greater than zero, using the syntax ~. > 0.
  • Add a column named difference, equal to the count in the chardonnay column minus the count in the coffee column.
  • Use slice_max with difference to obtain the top n = 25.
  • Arrange the rows by desc()ending order of difference.